Skip to content

Commit

Permalink
Merge pull request #262 from VDBWRAIR/v423
Browse files Browse the repository at this point in the history
Closes #261, closes #259, #258. Related to #243
  • Loading branch information
necrolyte2 committed Jul 6, 2015
2 parents 6620c4d + 15d38c9 commit 9f93a36
Show file tree
Hide file tree
Showing 6 changed files with 62 additions and 49 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
Changelog
=========

Version 4.2.3
-------------

* Fixed a bug with make_summary looking for missing order column
* Fixed documentation for database generation

Version 4.2.2
-------------

Expand Down
72 changes: 41 additions & 31 deletions docs/source/databases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,13 @@ The pipeline requires that you have blast databases and host genome indexes
available for some of the stages such as :doc:`stages/host_map`
and :doc:`stages/iterative_blast_phylo`

*Note* Most of the commands below may not exist on your system until after you
have completed the normal installation in which the commands are placed into
your virtualenv's bin directory.
Prereqs
=======

Most of the commands below may not exist on your system until after you
have completed the normal installation of the pipeline in which the commands
are placed into your virtualenv's bin directory.

It is fine to complete the installation(``python setup.py install``) prior
to setting up your databases, but just know that ``verifydatabases`` will report
errors until after you have completed setting up your databases.
Expand Down Expand Up @@ -38,7 +42,7 @@ modify all the instructions below replacing ``~/databases`` with
the path you choose.

You will also need to change the ``pathdiscov/files/config.yaml``
file to point to that location as well. It by defualt also points to
file to point to that location as well. It by default also points to
~/databases.

Create databases directory structure
Expand All @@ -47,16 +51,17 @@ Create databases directory structure
.. code-block:: bash
mkdir -p ~/databases/{humandna,humanrna,ncbi}
mkdir -p ~/databases/ncbi/blast/{nt,nr,taxonomy}
mkdir -p ~/databases/ncbi/blast/{nt,nr}
mkdir -p ~/databases/ncbi/taxonomy
Blast
=====

Blast databases coorespond to :doc:`stages/iterative_blast_phylo`'s
Blast databases correspond to :doc:`stages/iterative_blast_phylo`'s
``blast_db_list`` and ``blast_pro_db``

In general you just need to unpack the nt/nr databases from ncbi(or wherever)
into ~/databases/ncbi/blast/nt,nr,taxdb
into ~/databases/ncbi/blast/nt,nr

There is a shell script included that you can use to do this for you.
This may take a long time depending on your network connection.
Expand All @@ -68,22 +73,22 @@ This may take a long time depending on your network connection.
Taxonomy
========

Taxonomy databases coorespond to :doc:`stages/iterative_blast_phylo`'s
Taxonomy databases correspond to :doc:`stages/iterative_blast_phylo`'s
``taxonomy_nodes`` and ``taxonomy_names``

You need to download and extract the taxonomy databases as well so the pipeline
can extract taxonomy names for each of the blast results

.. code-block:: bash
pushd ~/databases/ncbi/blast/taxonomy
pushd ~/databases/ncbi/taxonomy
wget http://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz -O - | tar xzvf -
popd
Diamond
=======

Diamond coorespond to :doc:`stages/iterative_blast_phylo`'s
Diamond corresponds to :doc:`stages/iterative_blast_phylo`'s
``blast_db_list``

Download and index protein database for diamond blastx
Expand Down Expand Up @@ -111,7 +116,7 @@ blast nr database
Host Genome Setup
=================

The host genome setup cooresponds to the :doc:`stages/host_map`'s
The host genome setup corresponds to the :doc:`stages/host_map`'s
``mapper_db_list``

General steps to build host genome
Expand Down Expand Up @@ -150,44 +155,44 @@ DNA

.. code-block:: bash
_cwd=$(pwd)
pushd ~/databases/humandna
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz
tar -xzvf hg38.chromFa.tar.gz
#. Clean up download

.. code-block:: bash
rm chroms/\*_random.fa
rm chroms/\*alt.fa
rm -rf chroms
rm hg38.chromFa.tar.gz
#. Concatenate all host fasta [Optional]

If you have multiple hosts, you may download the fasta files of all
hosts to same folder ('chroms/') and concatinate as show below.
You may also modify the names accordingly, exmaple instead of hg38, you may
hosts to same folder ('chroms/') and concatenate as show below.
You may also modify the names accordingly, example instead of hg38, you may
name 'allHost.fa'

.. code-block:: bash
cat chroms/*.fa > hg38_all.fa
cat chroms/\*.fa > hg38_all.fa
#. Index the downloaded fasta

* Bowtie

.. code-block:: bash
${_cwd}/pathdiscov/download/bowtie2/bowtie2-build hg38_all.fa hg38
bowtie2-build hg38_all.fa hg38
* Snap

.. code-block:: bash
${_cwd}/pathdiscov/download/snap/snap index hg38_all.fa hg38 -s 20
snap index hg38_all.fa hg38 -s 20
#. Clean up download

.. code-block:: bash
rm chroms/\*_random.fa
rm chroms/\*alt.fa
rm -rf chroms
rm hg38.chromFa.tar.gz
rm hg38_all.fa
#. Setup config.yaml to utilize indexed database

Expand All @@ -199,13 +204,12 @@ DNA
RNA
^^^

Download human rna from the same URL, the version of the geome might be different.
Download human rna from the same URL, the version of the genome might be different.

#. Download and unpack

.. code-block:: bash
_cwd=$(pwd)
pushd ~/databases/humanrna
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/mrna.fa.gz
gunzip mrna.fa.gz
Expand All @@ -216,13 +220,19 @@ Download human rna from the same URL, the version of the geome might be differen

.. code-block:: bash
${_cwd}/pathdiscov/download/bowtie2/bowtie2-build mrna.fa hg38_mrna
bowtie2-build mrna.fa hg38_mrna
* Snap

.. code-block:: bash
${_cwd}/pathdiscov/download/snap/snap index mrna.fa hg38_mrna -s 20
snap index mrna.fa hg38_mrna -s 20
#. Cleanu up download

.. code-block:: bash
rm mrna.fa
#. Setup config.yaml to utilize indexed database

Expand All @@ -234,7 +244,7 @@ Download human rna from the same URL, the version of the geome might be differen
Verify Databases
================

Note: This command is only available after you install. Unfortuneatly at this point you cannot use verifydatabases until after you have finished the entire installation.
Note: This command is only available after you install. Unfortunately at this point you cannot use verifydatabases until after you have finished the entire installation.

You will probably want to ensure that the pipeline can find all of your databases. There is now a handy script that you can use to do this prior to installing.

Expand Down
12 changes: 6 additions & 6 deletions docs/source/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,19 +92,19 @@ Installation
SEQUENCE_PLATFORM: illumina #choices are: illumina,454
#. Databases setup

You must refer to built documentation to set up these databases. These databases must be built before you can verify below.

See :doc:`databases` or `<databases.rst>`_ if you have not built the docs

#. Install the pipeline into the virtualenv

.. code-block:: bash
python setup.py install
#. Databases setup

You must refer to built documentation to set up these databases. These databases must be built before you can verify below.

See :doc:`databases` or `<databases.rst>`_ if you have not built the docs

#. Quick verify of a few things

* See if required executables are available
Expand Down
9 changes: 4 additions & 5 deletions pathdiscov/make_summary.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,6 @@ def contigs_for( projdir, blastcol, blastval ):
info['family'] = contig['family']
info['genus'] = contig['genus']
info['superkingdom'] = contig['superkingdom']
info['order'] = contig['order']
info['description'] = contig['descrip']
yield info

Expand Down Expand Up @@ -258,8 +257,8 @@ def format_summary( summary ):
rows = []
import itertools
# Iterate over longsest of the two and fill the other in with ''
contigkeys = ('contigname','length','numreads','accession','superkingdom', 'order', 'family','genus','description')
unasskeys = ('count','accession','superkingdom', 'order', 'family','genus','descrip')
contigkeys = ('contigname','length','numreads','accession','superkingdom', 'family','genus','description')
unasskeys = ('count','accession','superkingdom', 'family','genus','descrip')
prefix = format_dict( summary, ('numreads','nonhostreads','numcontig','numblastcontig','n50','assemblylength') )
unassembled = sorted( summary['unassembled'].items(), key=lambda x: x[1]['count'], reverse=True )
for contig, unassembled in itertools.izip_longest( summary['contigs'], unassembled, fillvalue=None ):
Expand Down Expand Up @@ -300,11 +299,11 @@ def main( ):
# These come from summary
hdr += ['Num Reads', 'Non-Host Num reads', 'Num Ctg', 'Num blast0 Ctg', 'N50', 'Assembly Length']
# These come from summary['contig']
hdr += ['Ctg#', 'Ctg bp', 'numReads', 'Accession', 'Superkingdom', 'Order', 'Family', 'Genus', 'description']
hdr += ['Ctg#', 'Ctg bp', 'numReads', 'Accession', 'Superkingdom', 'Family', 'Genus', 'description']
# These come from summary
hdr += ['Num unassem', 'Num blast0 Unassem']
# These come from summary['unassembled']
hdr += ['num reads', 'Accession', 'Superkingdom', 'Order', 'Family', 'Genus', 'descrip']
hdr += ['num reads', 'Accession', 'Superkingdom', 'Family', 'Genus', 'descrip']
print '\t'.join( hdr )
for p in args.projdir:
try:
Expand Down
2 changes: 1 addition & 1 deletion pathdiscov/metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
package = 'pathdiscov'
project = "Pathogen Discovery Pipeline"
project_no_spaces = project.replace(' ', '')
version = '4.2.2'
version = '4.2.3'
description = 'Pathogen Discovery Pipleline'
authors = [
'Mike Wiley',
Expand Down
10 changes: 4 additions & 6 deletions pathdiscov/test/test_make_summary.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,6 @@ def mock_contig( self, *args, **kwargs ):
'contigname': 'contigname',
'description': 'description',
'superkingdom' : 'sk',
'order' : 'ord',
'family': 'family',
'genus': 'genus',
'length': 1,
Expand All @@ -82,7 +81,6 @@ def mock_unassembled( self, *args, **blastcols ):
'genus': 'Betabaculovirus',
'length': '82',
'mismatch': '7',
'order': '-',
'pident': '85.37',
'qend': '150',
'qlen': '150',
Expand Down Expand Up @@ -556,18 +554,18 @@ def test_check_csv( self ):
summary = self.mock_summary(contigs=contig, unassembled=una)
# Check summary line with contig and unassembled read
r = self._C( summary )
e = ['','1','2','3','4','7','8','c1','1','2','ca', 'sk', 'ord', 'cfam','cgen','cdesc','5','6','3','acc3', 'Bacteria', '-', 'family3','genus3','descrip3']
e = ['','1','2','3','4','7','8','c1','1','2','ca', 'sk', 'cfam','cgen','cdesc','5','6','3','acc3', 'Bacteria', 'family3','genus3','descrip3']
print e
print r[0].split('\t')
print '---------'
eq_(e, r[0].split('\t'))
# Check summary line with contig and unassembled read
e = ['','','','','','','','c2','10','20','cb','sk','ord','cfam','cgen','cdesc','','','2','acc2','Bacteria', '-', 'family2','genus2','descrip2']
e = ['','','','','','','','c2','10','20','cb','sk','cfam','cgen','cdesc','','','2','acc2','Bacteria', 'family2','genus2','descrip2']
print e
print r[1].split('\t')
eq_( e, r[1].split('\t') )
# Check summary line with only unassembled read
e = ['','','','','','','','','','','','','','','','','','','1','acc1', 'Bacteria', '-', 'family1','genus1','descrip1']
e = ['','','','','','','','','','','','','','','','','','1','acc1', 'Bacteria', 'family1','genus1','descrip1']
print e
print r[2].split('\t')
eq_( e, r[2].split('\t') )
Expand All @@ -586,7 +584,7 @@ def test_sorted_unassembled( self ):
print r
line = r.split('\t')
print line
count = line[18]
count = line[17]
eq_( e, int(count), 'Count should be {0} but got {1} at index {2}'.format(e,count,i) )

class TestFormatDic( BaseTest ):
Expand Down

0 comments on commit 9f93a36

Please sign in to comment.