Merge pull request #34 from snayfach/dev

Dev
snayfach · Nov 2, 2016 · f022f04 · f022f04
2 parents 5b977f6 + e5ca6ba
commit f022f04
Show file tree

Hide file tree

Showing 26 changed files with 687 additions and 621 deletions.
diff --git a/docs/build_db.md b/docs/build_db.md
@@ -6,11 +6,11 @@ This script will allow you to build a MIDAS database using your own genomes or m
 
 This script uses user-defined species and user-supplied genomes to build database files for running MIDAS. This will allow you to estimate the abundance of these species in a shotgun metagenome and quantify gene content and SNPs for species with sufficient data.
 
-First, a pan-genome is built for each species. A pan-genome is defined as the set of non-redundant genes across all genomes for each species. The program USEARCH (http://www.drive5.com/usearch) is used for clustering genes. The default clustering identity is 95%
+First, a pan-genome is built for each species. A pan-genome is defined as the set of non-redundant genes across all genomes for each species. The program USEARCH (http://www.drive5.com/usearch) is first used to cluster genes at 99% percent identity and identify a centroid gene sequence from each cluster. These "centroids" are used for recruiting metagenomic reads. This is done to reduce the number of genes searched while maintaining maximum mapping sensitivity. Next, USEARCH is used to cluster the centroids from 99% identity gene clusters further at 95, 90, 85, 80, and 75 percent identity. After mapping reads to 99% identity centroids with 'run_midas.py genes', reads can be optionally aggregated into gene clusters at any of the lower clustering thresholds with 'merge_midas.py genes'.
 
-Next, a representative-genome database is built. Representative genomes are used for mapping reads and calling SNPs. The user needs to define which genome they want to use as the representative for each species. 
+Next, a representative-genome database is built. Representative genomes are used for mapping reads and calling SNPs. The user simply needs to define which genome they want to use as the representative for each species.
 
-Finally, a marker genes database is built. Marker genes are defined as universal, single-copy gene families. These are genes that occur once per genome and in all genomes (of bacteria). MIDAS uses a set of 15 of such gene families. These are a subset of the PhyEco gene families described here: http://dx.doi.org/10.1371/journal.pone.0077033. To identify these genes, HMMER (http://hmmer.org) is used to scan each species' pan-genome. Once identified, a HS-BLASTN (http://dx.doi.org/10.1093/nar/gkv784) database is built for mapping short reads.
+Finally, a marker genes database is built. Marker genes are defined as universal, single-copy gene families. These are genes that occur once per genome and in all genomes (of bacteria). MIDAS uses a set of 15 of such gene families. These are a subset of the PhyEco gene families described here: http://dx.doi.org/10.1371/journal.pone.0077033. To identify these genes, HMMER (http://hmmer.org) is used to scan genes from the representative genome. Once identified, a HS-BLASTN (http://dx.doi.org/10.1093/nar/gkv784) database is built for mapping short reads.
 
 ## Requirements
 As with all scripts, MIDAS and its dependencies will need to be installed.  
@@ -39,19 +39,23 @@ Path to directory of genomes. Each subdirectory should be named with a genome id
 \<genome_id>.fna
 Genomic DNA sequence in FASTA format    
 
+\<genome_id>.faa
+Genomic DNA sequence in FASTA format
+
 \<genome_id>.ffn
-Gene DNA sequences in FASTA format 
+Protein sequences in FASTA format
 
 \<genome_id>.features
-File specificy genomic coordinates of genes.
+File specifies genomic coordinates of genes.
 This file should be tab-delimited with a header and 5 fields:  
 
 * scaffold_id (CHAR)
 * gene_id (CHAR)
 * start (INT)
 * end (INT)
 * strand ('+' or '-')	 
-
+* gene_type ('CDS' or 'RNA')
+
 <b>mapfile PATH</b>  
 Path to mapping file that specifies which genomes belonging to the same species.  
 The file should be tab-delimited file with a header and 3 fields:  
@@ -68,13 +72,6 @@ Directory to store MIDAS database
 <b>--threads INT</b>  
 Number of threads to use
 
-<b>--pid FLOAT</b>  
-Percent identity threshold for clustering genes (default=0.95)
-
-<b>--iter_size INT</b>  
-Maximum number of genomes to process at one time to prevent exceeding USEARCH's 4G memory limit (default=500).  
-If the number of genomes per species exceeds this parameter, USEARCH will be run in batches of <iter_size>. After all batches have completed, the resulting centroids from each batch are combined and clustered one final time.
-
 <b>--max_species INT</b>  
 Maximum number of species to process from input (default=use all).  
 Useful for quick tests

diff --git a/docs/cnvs.md b/docs/cnvs.md
@@ -44,9 +44,8 @@ Read alignment options (if using --align):
 Quantify genes options (if using --call_genes):
   --readq INT           Discard reads with mean quality < READQ (20)
   --mapid FLOAT         Discard reads with alignment identity < MAPID (94.0)
-  --mapq INT            Discard reads with mapping quality < MAPQ (10)
   --aln_cov FLOAT       Discard reads with alignment coverage < ALN_COV (0.75)
-  --trim INT            Trim N base-pairs from read-tails (0)
+  --trim INT            Trim N base-pairs from 3'/right end of reads (0)
 ```
 
 ## Examples

diff --git a/docs/install.md b/docs/install.md
@@ -20,7 +20,7 @@ Or, download the latest source code:
 https://github.com/snayfach/MIDAS/archive/master.zip  
 And unpack the project: `unzip MIDAS-master.zip`  
 
-Or, download a stable releases of the source code:  
+Or, download a stable release of the source code:  
 [https://github.com/snayfach/MIDAS/releases] (https://github.com/snayfach/MIDAS/releases)  
 
 ### Install dependencies
@@ -45,7 +45,7 @@ If you still have installation issues, please open a new [issue] (https://github
 
 This will enable you to run MIDAS scripts:  
 `export PYTHONPATH=$PYTHONPATH:/path/to/MIDAS`  
-`export PATH=$PATH:/path/to/MIDAS/scripts` 
+`export PATH=$PATH:/path/to/MIDAS/scripts`   
 `export MIDAS_DB=/path/to/midas_database` 
 
 Be sure to replace '/path/to' with the appropriate path on your system  

diff --git a/docs/merge_cnvs.md b/docs/merge_cnvs.md
@@ -34,26 +34,33 @@ Sample filters (select subset of samples from INPUT):
   --max_samples INT     Maximum number of samples to process. useful for testing (use all)
 
 Presence/Absence:
+  --cluster_pid {75,80,85,90,95,99}
+                        In the database, pan-genomes are defined at 6 different % identity clustering cutoffs
+                        CLUSTER_PID allows you to quantify gene content for any of these sets of gene clusters
+                        By default, gene content is reported for genes clustered at 95% identity (95)
   --min_copy FLOAT      Genes >= MIN_COPY are classified as present
                         Genes < MIN_COPY are classified as absent (0.35)
 ```
 
 ## Examples
 
 Examples:
-1) Merge results for all species. Provide list of paths to sample directories:
+1) Merge results for all species. Provide list of paths to sample directories:  
 `merge_midas.py genes /path/to/outdir -i sample_1,sample_2 -t list`
 
-2) Merge results for one species:
+2) Merge results for one species:  
 `merge_midas.py genes /path/to/outdir --species_id Bacteroides_vulgatus_57955 -i sample_1,sample_2 -t list`
 
-3) Exclude low-coverage samples in output matrix:
+3) Quantify genes clusters at 90% identity:  
+`merge_midas.py genes /path/to/outdir -i /path/to/samples -t dir --cluster_pid 90`
+
+4) Exclude low-coverage samples in output matrix:  
 `merge_midas.py genes /path/to/outdir -i /path/to/samples -t dir --sample_depth 5.0`
 
-4) Use lenient threshold for determining gene presence-absence:
+5) Use lenient threshold for determining gene presence-absence:  
 `merge_midas.py genes /path/to/outdir -i /path/to/samples -t dir --min_copy 0.1`
 
-5) Run a quick test:
+6) Run a quick test:  
 `merge_midas.py genes /path/to/outdir -i /path/to/samples -t dir --max_species 1 --max_samples 10`
 
 

diff --git a/docs/ref_db.md b/docs/ref_db.md
@@ -4,27 +4,27 @@ Description of how the MIDAS database was constructed, how the species groups co
 ## Install reference database
 
 #### Step 1. download default database 
-Download from your browser:   
-[http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.1.tar.gz](http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.1.tar.gz)
+Download the latest version from your browser:   
+[http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.2.tar.gz](http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.2.tar.gz)
 
 Or, download from the command line:   
-on Unix: `wget http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.1.tar.gz`  
-on OSX: `curl http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.1.tar.gz > midas_db_v1.1.tar.gz`
+on Unix: `wget http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.2.tar.gz`  
+on OSX: `curl http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.2.tar.gz > midas_db_v1.2.tar.gz`
 
 * This may take several minutes to several hours, depending on your internet speed
-* The entire database requires 32 GB of free space to download and decompress
-* Once decompressed, it takes 16G of free space
-* See midas_db_v1.1/README for more information
+* The entire database requires 36 GB of free space to download and decompress
+* Once decompressed, it takes 18G of free space
+* See midas_db_v1.2/README for more information
 
 #### Step 2. unpack tarball
-`tar -zxvf midas_db_v1.1.tar.gz`  
+`tar -zxvf midas_db_v1.2.tar.gz`  
 
 #### Step 3. create MIDAS_DB environmental variable
 The MIDAS_DB variable tells MIDAS where the reference database is located:   
-`export MIDAS_DB=midas_db_v1.1`
+`export MIDAS_DB=midas_db_v1.2`
 
 Alternatively, you can manually specify the database location when you run MIDAS:  
-ex: `run_midas.py species outdir -d midas_db_v1.1 [options]`
+ex: `run_midas.py species outdir -d midas_db_v1.2 [options]`
 
 ## Build custom database
 Alternatively, you can build a custom database with your own genome sequences [read more] (build_db.md)
@@ -53,8 +53,11 @@ Each genome-cluster was annotated according to the consensus (i.e., most common)
 
 **Pan-genome**
 
-* The set of non-redundant genes (95% DNA identity) across all genomes within species
-* Metagenomic reads are mapped to pan-genome database to determine the gene-content of strains in a sample
+* The set of non-redundant genes (99% DNA identity) across all genomes within each species
+* Mapping between 99% identity gene clusters and gene clusters at lower % identity clustering thresholds (95, 90, 85, 80, and 75)
+* Metagenomic reads are initially mapped to centroid gene sequences from 99% gene families
+* Reads can be easily aggregated into gene clusters at lower % identity clustering thresholds
+* Mapped reads are used to determine the gene content of strains in a sample
 * Gene clustering was performed with USEARCH
 
 ### Database coverage across biomes

diff --git a/docs/tutorial.md b/docs/tutorial.md
@@ -14,14 +14,14 @@ Install python dependencies as needed:
 [read more...] (install.md)
 
 Download & unpack reference database:  
-[http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.1.tar.gz](http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.1.tar.gz)  
-`tar -zxvf midas_db_v1.1.tar.gz`  
+[http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.2.tar.gz](http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.2.tar.gz)  
+`tar -zxvf midas_db_v1.2.tar.gz`  
 [read more...] (ref_db.md)
 
 Update your environment:  
 `export PYTHONPATH=$PYTHONPATH:MIDAS`  
 `export PATH=$PATH:MIDAS/scripts`   
-`export MIDAS_DB=midas_db_v1.1`  
+`export MIDAS_DB=midas_db_v1.2`  
 
 Optionally, download & unpack example dataset:  
 [http://lighthouse.ucsf.edu/MIDAS/example.tar.gz](http://lighthouse.ucsf.edu/MIDAS/example.tar.gz)  

diff --git a/midas/analyze/track_strains.py b/midas/analyze/track_strains.py
@@ -90,6 +90,8 @@ def call_markers(species, samples, args):
 	# open marker list
 	markers = utility.parse_file(species.paths['markers'])
 	marker = fetch_marker(markers) # dictionary for 1st marker allele
+	if marker is None:
+		sys.exit("\nError: no marker alleles found in file: %s\n" % species.paths['markers'])
 
 	# init markers per sample
 	for sample in samples.values():