Skip to content

Commit

Permalink
Merge pull request #34 from snayfach/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
snayfach authored Nov 2, 2016
2 parents 5b977f6 + e5ca6ba commit f022f04
Show file tree
Hide file tree
Showing 26 changed files with 687 additions and 621 deletions.
23 changes: 10 additions & 13 deletions docs/build_db.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ This script will allow you to build a MIDAS database using your own genomes or m

This script uses user-defined species and user-supplied genomes to build database files for running MIDAS. This will allow you to estimate the abundance of these species in a shotgun metagenome and quantify gene content and SNPs for species with sufficient data.

First, a pan-genome is built for each species. A pan-genome is defined as the set of non-redundant genes across all genomes for each species. The program USEARCH (http://www.drive5.com/usearch) is used for clustering genes. The default clustering identity is 95%
First, a pan-genome is built for each species. A pan-genome is defined as the set of non-redundant genes across all genomes for each species. The program USEARCH (http://www.drive5.com/usearch) is first used to cluster genes at 99% percent identity and identify a centroid gene sequence from each cluster. These "centroids" are used for recruiting metagenomic reads. This is done to reduce the number of genes searched while maintaining maximum mapping sensitivity. Next, USEARCH is used to cluster the centroids from 99% identity gene clusters further at 95, 90, 85, 80, and 75 percent identity. After mapping reads to 99% identity centroids with 'run_midas.py genes', reads can be optionally aggregated into gene clusters at any of the lower clustering thresholds with 'merge_midas.py genes'.

Next, a representative-genome database is built. Representative genomes are used for mapping reads and calling SNPs. The user needs to define which genome they want to use as the representative for each species.
Next, a representative-genome database is built. Representative genomes are used for mapping reads and calling SNPs. The user simply needs to define which genome they want to use as the representative for each species.

Finally, a marker genes database is built. Marker genes are defined as universal, single-copy gene families. These are genes that occur once per genome and in all genomes (of bacteria). MIDAS uses a set of 15 of such gene families. These are a subset of the PhyEco gene families described here: http://dx.doi.org/10.1371/journal.pone.0077033. To identify these genes, HMMER (http://hmmer.org) is used to scan each species' pan-genome. Once identified, a HS-BLASTN (http://dx.doi.org/10.1093/nar/gkv784) database is built for mapping short reads.
Finally, a marker genes database is built. Marker genes are defined as universal, single-copy gene families. These are genes that occur once per genome and in all genomes (of bacteria). MIDAS uses a set of 15 of such gene families. These are a subset of the PhyEco gene families described here: http://dx.doi.org/10.1371/journal.pone.0077033. To identify these genes, HMMER (http://hmmer.org) is used to scan genes from the representative genome. Once identified, a HS-BLASTN (http://dx.doi.org/10.1093/nar/gkv784) database is built for mapping short reads.

## Requirements
As with all scripts, MIDAS and its dependencies will need to be installed.
Expand Down Expand Up @@ -39,19 +39,23 @@ Path to directory of genomes. Each subdirectory should be named with a genome id
\<genome_id>.fna
Genomic DNA sequence in FASTA format

\<genome_id>.faa
Genomic DNA sequence in FASTA format

\<genome_id>.ffn
Gene DNA sequences in FASTA format
Protein sequences in FASTA format

\<genome_id>.features
File specificy genomic coordinates of genes.
File specifies genomic coordinates of genes.
This file should be tab-delimited with a header and 5 fields:

* scaffold_id (CHAR)
* gene_id (CHAR)
* start (INT)
* end (INT)
* strand ('+' or '-')

* gene_type ('CDS' or 'RNA')

<b>mapfile PATH</b>
Path to mapping file that specifies which genomes belonging to the same species.
The file should be tab-delimited file with a header and 3 fields:
Expand All @@ -68,13 +72,6 @@ Directory to store MIDAS database
<b>--threads INT</b>
Number of threads to use

<b>--pid FLOAT</b>
Percent identity threshold for clustering genes (default=0.95)

<b>--iter_size INT</b>
Maximum number of genomes to process at one time to prevent exceeding USEARCH's 4G memory limit (default=500).
If the number of genomes per species exceeds this parameter, USEARCH will be run in batches of <iter_size>. After all batches have completed, the resulting centroids from each batch are combined and clustered one final time.

<b>--max_species INT</b>
Maximum number of species to process from input (default=use all).
Useful for quick tests
Expand Down
3 changes: 1 addition & 2 deletions docs/cnvs.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,8 @@ Read alignment options (if using --align):
Quantify genes options (if using --call_genes):
--readq INT Discard reads with mean quality < READQ (20)
--mapid FLOAT Discard reads with alignment identity < MAPID (94.0)
--mapq INT Discard reads with mapping quality < MAPQ (10)
--aln_cov FLOAT Discard reads with alignment coverage < ALN_COV (0.75)
--trim INT Trim N base-pairs from read-tails (0)
--trim INT Trim N base-pairs from 3'/right end of reads (0)
```

## Examples
Expand Down
4 changes: 2 additions & 2 deletions docs/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Or, download the latest source code:
https://github.com/snayfach/MIDAS/archive/master.zip
And unpack the project: `unzip MIDAS-master.zip`

Or, download a stable releases of the source code:
Or, download a stable release of the source code:
[https://github.com/snayfach/MIDAS/releases] (https://github.com/snayfach/MIDAS/releases)

### Install dependencies
Expand All @@ -45,7 +45,7 @@ If you still have installation issues, please open a new [issue] (https://github

This will enable you to run MIDAS scripts:
`export PYTHONPATH=$PYTHONPATH:/path/to/MIDAS`
`export PATH=$PATH:/path/to/MIDAS/scripts`
`export PATH=$PATH:/path/to/MIDAS/scripts`
`export MIDAS_DB=/path/to/midas_database`

Be sure to replace '/path/to' with the appropriate path on your system
Expand Down
17 changes: 12 additions & 5 deletions docs/merge_cnvs.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,26 +34,33 @@ Sample filters (select subset of samples from INPUT):
--max_samples INT Maximum number of samples to process. useful for testing (use all)
Presence/Absence:
--cluster_pid {75,80,85,90,95,99}
In the database, pan-genomes are defined at 6 different % identity clustering cutoffs
CLUSTER_PID allows you to quantify gene content for any of these sets of gene clusters
By default, gene content is reported for genes clustered at 95% identity (95)
--min_copy FLOAT Genes >= MIN_COPY are classified as present
Genes < MIN_COPY are classified as absent (0.35)
```

## Examples

Examples:
1) Merge results for all species. Provide list of paths to sample directories:
1) Merge results for all species. Provide list of paths to sample directories:
`merge_midas.py genes /path/to/outdir -i sample_1,sample_2 -t list`

2) Merge results for one species:
2) Merge results for one species:
`merge_midas.py genes /path/to/outdir --species_id Bacteroides_vulgatus_57955 -i sample_1,sample_2 -t list`

3) Exclude low-coverage samples in output matrix:
3) Quantify genes clusters at 90% identity:
`merge_midas.py genes /path/to/outdir -i /path/to/samples -t dir --cluster_pid 90`

4) Exclude low-coverage samples in output matrix:
`merge_midas.py genes /path/to/outdir -i /path/to/samples -t dir --sample_depth 5.0`

4) Use lenient threshold for determining gene presence-absence:
5) Use lenient threshold for determining gene presence-absence:
`merge_midas.py genes /path/to/outdir -i /path/to/samples -t dir --min_copy 0.1`

5) Run a quick test:
6) Run a quick test:
`merge_midas.py genes /path/to/outdir -i /path/to/samples -t dir --max_species 1 --max_samples 10`


Expand Down
27 changes: 15 additions & 12 deletions docs/ref_db.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,27 @@ Description of how the MIDAS database was constructed, how the species groups co
## Install reference database

#### Step 1. download default database
Download from your browser:
[http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.1.tar.gz](http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.1.tar.gz)
Download the latest version from your browser:
[http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.2.tar.gz](http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.2.tar.gz)

Or, download from the command line:
on Unix: `wget http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.1.tar.gz`
on OSX: `curl http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.1.tar.gz > midas_db_v1.1.tar.gz`
on Unix: `wget http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.2.tar.gz`
on OSX: `curl http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.2.tar.gz > midas_db_v1.2.tar.gz`

* This may take several minutes to several hours, depending on your internet speed
* The entire database requires 32 GB of free space to download and decompress
* Once decompressed, it takes 16G of free space
* See midas_db_v1.1/README for more information
* The entire database requires 36 GB of free space to download and decompress
* Once decompressed, it takes 18G of free space
* See midas_db_v1.2/README for more information

#### Step 2. unpack tarball
`tar -zxvf midas_db_v1.1.tar.gz`
`tar -zxvf midas_db_v1.2.tar.gz`

#### Step 3. create MIDAS_DB environmental variable
The MIDAS_DB variable tells MIDAS where the reference database is located:
`export MIDAS_DB=midas_db_v1.1`
`export MIDAS_DB=midas_db_v1.2`

Alternatively, you can manually specify the database location when you run MIDAS:
ex: `run_midas.py species outdir -d midas_db_v1.1 [options]`
ex: `run_midas.py species outdir -d midas_db_v1.2 [options]`

## Build custom database
Alternatively, you can build a custom database with your own genome sequences [read more] (build_db.md)
Expand Down Expand Up @@ -53,8 +53,11 @@ Each genome-cluster was annotated according to the consensus (i.e., most common)

**Pan-genome**

* The set of non-redundant genes (95% DNA identity) across all genomes within species
* Metagenomic reads are mapped to pan-genome database to determine the gene-content of strains in a sample
* The set of non-redundant genes (99% DNA identity) across all genomes within each species
* Mapping between 99% identity gene clusters and gene clusters at lower % identity clustering thresholds (95, 90, 85, 80, and 75)
* Metagenomic reads are initially mapped to centroid gene sequences from 99% gene families
* Reads can be easily aggregated into gene clusters at lower % identity clustering thresholds
* Mapped reads are used to determine the gene content of strains in a sample
* Gene clustering was performed with USEARCH

### Database coverage across biomes
Expand Down
6 changes: 3 additions & 3 deletions docs/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,14 @@ Install python dependencies as needed:
[read more...] (install.md)

Download & unpack reference database:
[http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.1.tar.gz](http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.1.tar.gz)
`tar -zxvf midas_db_v1.1.tar.gz`
[http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.2.tar.gz](http://lighthouse.ucsf.edu/MIDAS/midas_db_v1.2.tar.gz)
`tar -zxvf midas_db_v1.2.tar.gz`
[read more...] (ref_db.md)

Update your environment:
`export PYTHONPATH=$PYTHONPATH:MIDAS`
`export PATH=$PATH:MIDAS/scripts`
`export MIDAS_DB=midas_db_v1.1`
`export MIDAS_DB=midas_db_v1.2`

Optionally, download & unpack example dataset:
[http://lighthouse.ucsf.edu/MIDAS/example.tar.gz](http://lighthouse.ucsf.edu/MIDAS/example.tar.gz)
Expand Down
2 changes: 2 additions & 0 deletions midas/analyze/track_strains.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,8 @@ def call_markers(species, samples, args):
# open marker list
markers = utility.parse_file(species.paths['markers'])
marker = fetch_marker(markers) # dictionary for 1st marker allele
if marker is None:
sys.exit("\nError: no marker alleles found in file: %s\n" % species.paths['markers'])

# init markers per sample
for sample in samples.values():
Expand Down
Loading

0 comments on commit f022f04

Please sign in to comment.