Merge pull request #262 from VDBWRAIR/v423

Closes #261, closes #259, #258. Related to #243
VDBWRAIR · Jul 6, 2015 · 9f93a36 · 9f93a36
2 parents 6620c4d + 15d38c9
commit 9f93a36
Show file tree

Hide file tree

Showing 6 changed files with 62 additions and 49 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,6 +1,12 @@
 Changelog
 =========
 
+Version 4.2.3
+-------------
+
+* Fixed a bug with make_summary looking for missing order column
+* Fixed documentation for database generation
+
 Version 4.2.2
 -------------
 

diff --git a/docs/source/databases.rst b/docs/source/databases.rst
@@ -6,9 +6,13 @@ The pipeline requires that you have blast databases and host genome indexes
 available for some of the stages such as :doc:`stages/host_map` 
 and :doc:`stages/iterative_blast_phylo`
 
-*Note* Most of the commands below may not exist on your system until after you
-have completed the normal installation in which the commands are placed into
-your virtualenv's bin directory.
+Prereqs
+=======
+
+Most of the commands below may not exist on your system until after you
+have completed the normal installation of the pipeline in which the commands 
+are placed into your virtualenv's bin directory.
+
 It is fine to complete the installation(``python setup.py install``) prior
 to setting up your databases, but just know that ``verifydatabases`` will report
 errors until after you have completed setting up your databases.
@@ -38,7 +42,7 @@ modify all the instructions below replacing ``~/databases`` with
 the path you choose.
 
 You will also need to change the ``pathdiscov/files/config.yaml``
-file to point to that location as well. It by defualt also points to
+file to point to that location as well. It by default also points to
 ~/databases.
 
 Create databases directory structure
@@ -47,16 +51,17 @@ Create databases directory structure
 .. code-block:: bash
     
     mkdir -p ~/databases/{humandna,humanrna,ncbi}
-    mkdir -p ~/databases/ncbi/blast/{nt,nr,taxonomy}
+    mkdir -p ~/databases/ncbi/blast/{nt,nr}
+    mkdir -p ~/databases/ncbi/taxonomy
 
 Blast
 =====
 
-Blast databases coorespond to :doc:`stages/iterative_blast_phylo`'s 
+Blast databases correspond to :doc:`stages/iterative_blast_phylo`'s 
 ``blast_db_list`` and ``blast_pro_db``
 
 In general you just need to unpack the nt/nr databases from ncbi(or wherever) 
-into ~/databases/ncbi/blast/nt,nr,taxdb
+into ~/databases/ncbi/blast/nt,nr
 
 There is a shell script included that you can use to do this for you.
 This may take a long time depending on your network connection.
@@ -68,22 +73,22 @@ This may take a long time depending on your network connection.
 Taxonomy
 ========
 
-Taxonomy databases coorespond to :doc:`stages/iterative_blast_phylo`'s 
+Taxonomy databases correspond to :doc:`stages/iterative_blast_phylo`'s 
 ``taxonomy_nodes`` and ``taxonomy_names``
 
 You need to download and extract the taxonomy databases as well so the pipeline
 can extract taxonomy names for each of the blast results
 
 .. code-block:: bash
 
-    pushd ~/databases/ncbi/blast/taxonomy
+    pushd ~/databases/ncbi/taxonomy
     wget http://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz -O - | tar xzvf -
     popd
 
 Diamond
 =======
 
-Diamond coorespond to :doc:`stages/iterative_blast_phylo`'s 
+Diamond corresponds to :doc:`stages/iterative_blast_phylo`'s 
 ``blast_db_list``
 
 Download and index protein database for diamond blastx
@@ -111,7 +116,7 @@ blast nr database
 Host Genome Setup
 =================
 
-The host genome setup cooresponds to the :doc:`stages/host_map`'s
+The host genome setup corresponds to the :doc:`stages/host_map`'s
 ``mapper_db_list``
 
 General steps to build host genome
@@ -150,44 +155,44 @@ DNA
 
     .. code-block:: bash
 
-        _cwd=$(pwd)
         pushd ~/databases/humandna
         wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz
         tar -xzvf hg38.chromFa.tar.gz
 
-#. Clean up download
-
-    .. code-block:: bash
-
-        rm chroms/\*_random.fa
-        rm chroms/\*alt.fa
-        rm -rf chroms
-        rm hg38.chromFa.tar.gz
-
 #. Concatenate all host fasta [Optional]
 
     If you have multiple hosts, you may download the fasta files of all 
-    hosts to same folder ('chroms/') and concatinate as show below.
-    You may also modify the names accordingly, exmaple instead of hg38, you may 
+    hosts to same folder ('chroms/') and concatenate as show below.
+    You may also modify the names accordingly, example instead of hg38, you may 
     name 'allHost.fa'
 
     .. code-block:: bash
 
-        cat chroms/*.fa > hg38_all.fa
+        cat chroms/\*.fa > hg38_all.fa
 
 #. Index the downloaded fasta
 
     * Bowtie
 
         .. code-block:: bash
 
-            ${_cwd}/pathdiscov/download/bowtie2/bowtie2-build hg38_all.fa hg38
+            bowtie2-build hg38_all.fa hg38
 
     * Snap
 
         .. code-block:: bash
 
-            ${_cwd}/pathdiscov/download/snap/snap index hg38_all.fa hg38 -s 20
+            snap index hg38_all.fa hg38 -s 20
+
+#. Clean up download
+
+    .. code-block:: bash
+
+        rm chroms/\*_random.fa
+        rm chroms/\*alt.fa
+        rm -rf chroms
+        rm hg38.chromFa.tar.gz
+        rm hg38_all.fa
 
 #. Setup config.yaml to utilize indexed database
 
@@ -199,13 +204,12 @@ DNA
 RNA
 ^^^
 
-Download human rna from the same URL, the version of the geome might be different.
+Download human rna from the same URL, the version of the genome might be different.
 
 #. Download and unpack
 
     .. code-block:: bash
        
-        _cwd=$(pwd)
         pushd ~/databases/humanrna
         wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/mrna.fa.gz
         gunzip mrna.fa.gz
@@ -216,13 +220,19 @@ Download human rna from the same URL, the version of the geome might be differen
 
         .. code-block:: bash
 
-            ${_cwd}/pathdiscov/download/bowtie2/bowtie2-build mrna.fa hg38_mrna
+            bowtie2-build mrna.fa hg38_mrna
 
     * Snap
 
         .. code-block:: bash
 
-            ${_cwd}/pathdiscov/download/snap/snap index mrna.fa hg38_mrna -s 20
+            snap index mrna.fa hg38_mrna -s 20
+
+#. Cleanu up download
+
+    .. code-block:: bash
+
+        rm mrna.fa
 
 #. Setup config.yaml to utilize indexed database
 
@@ -234,7 +244,7 @@ Download human rna from the same URL, the version of the geome might be differen
 Verify Databases
 ================
 
-Note: This command is only available after you install. Unfortuneatly at this point you cannot use verifydatabases until after you have finished the entire installation.
+Note: This command is only available after you install. Unfortunately at this point you cannot use verifydatabases until after you have finished the entire installation.
 
 You will probably want to ensure that the pipeline can find all of your databases. There is now a handy script that you can use to do this prior to installing.
 

diff --git a/docs/source/install.rst b/docs/source/install.rst
@@ -92,19 +92,19 @@ Installation
 
         SEQUENCE_PLATFORM: illumina #choices are: illumina,454
 
-#. Databases setup
-
-    You must refer to built documentation to set up these databases. These databases must be built before you can verify below.
-
-    See :doc:`databases` or `<databases.rst>`_ if you have not built the docs
-
 
 #. Install the pipeline into the virtualenv
 
     .. code-block:: bash
 
         python setup.py install
 
+#. Databases setup
+
+    You must refer to built documentation to set up these databases. These databases must be built before you can verify below.
+
+    See :doc:`databases` or `<databases.rst>`_ if you have not built the docs
+
 #. Quick verify of a few things
 
     * See if required executables are available

diff --git a/pathdiscov/make_summary.py b/pathdiscov/make_summary.py
@@ -179,7 +179,6 @@ def contigs_for( projdir, blastcol, blastval ):
         info['family'] = contig['family']
         info['genus'] = contig['genus']
         info['superkingdom'] = contig['superkingdom']
-        info['order'] = contig['order']
         info['description'] = contig['descrip']
         yield info
 
@@ -258,8 +257,8 @@ def format_summary( summary ):
     rows = []
     import itertools
     # Iterate over longsest of the two and fill the other in with ''
-    contigkeys = ('contigname','length','numreads','accession','superkingdom', 'order', 'family','genus','description')
-    unasskeys = ('count','accession','superkingdom', 'order', 'family','genus','descrip')
+    contigkeys = ('contigname','length','numreads','accession','superkingdom', 'family','genus','description')
+    unasskeys = ('count','accession','superkingdom', 'family','genus','descrip')
     prefix = format_dict( summary, ('numreads','nonhostreads','numcontig','numblastcontig','n50','assemblylength') )
     unassembled = sorted( summary['unassembled'].items(), key=lambda x: x[1]['count'], reverse=True )
     for contig, unassembled in itertools.izip_longest( summary['contigs'], unassembled, fillvalue=None ):
@@ -300,11 +299,11 @@ def main( ):
     # These come from summary
     hdr += ['Num Reads', 'Non-Host Num reads', 'Num Ctg', 'Num blast0 Ctg', 'N50', 'Assembly Length']
     # These come from summary['contig']
-    hdr += ['Ctg#', 'Ctg bp', 'numReads', 'Accession', 'Superkingdom', 'Order', 'Family', 'Genus', 'description']
+    hdr += ['Ctg#', 'Ctg bp', 'numReads', 'Accession', 'Superkingdom', 'Family', 'Genus', 'description']
     # These come from summary
     hdr += ['Num unassem', 'Num blast0 Unassem']
     # These come from summary['unassembled']
-    hdr += ['num reads', 'Accession', 'Superkingdom', 'Order', 'Family', 'Genus', 'descrip']
+    hdr += ['num reads', 'Accession', 'Superkingdom', 'Family', 'Genus', 'descrip']
     print '\t'.join( hdr )
     for p in args.projdir:
         try:

diff --git a/pathdiscov/metadata.py b/pathdiscov/metadata.py
@@ -8,7 +8,7 @@
 package = 'pathdiscov'
 project = "Pathogen Discovery Pipeline"
 project_no_spaces = project.replace(' ', '')
-version = '4.2.2'
+version = '4.2.3'
 description = 'Pathogen Discovery Pipleline'
 authors = [
     'Mike Wiley',

diff --git a/pathdiscov/test/test_make_summary.py b/pathdiscov/test/test_make_summary.py
@@ -61,7 +61,6 @@ def mock_contig( self, *args, **kwargs ):
             'contigname': 'contigname',
             'description': 'description',
             'superkingdom' : 'sk',
-            'order' : 'ord',
             'family': 'family',
             'genus': 'genus',
             'length': 1,
@@ -82,7 +81,6 @@ def mock_unassembled( self, *args, **blastcols ):
             'genus': 'Betabaculovirus',
             'length': '82',
             'mismatch': '7',
-            'order': '-',
             'pident': '85.37',
             'qend': '150',
             'qlen': '150',
@@ -556,18 +554,18 @@ def test_check_csv( self ):
         summary = self.mock_summary(contigs=contig, unassembled=una)
         # Check summary line with contig and unassembled read
         r = self._C( summary )
-        e = ['','1','2','3','4','7','8','c1','1','2','ca', 'sk', 'ord', 'cfam','cgen','cdesc','5','6','3','acc3', 'Bacteria', '-', 'family3','genus3','descrip3']
+        e = ['','1','2','3','4','7','8','c1','1','2','ca', 'sk', 'cfam','cgen','cdesc','5','6','3','acc3', 'Bacteria', 'family3','genus3','descrip3']
         print e
         print r[0].split('\t')
         print '---------'
         eq_(e, r[0].split('\t'))
         # Check summary line with contig and unassembled read
-        e = ['','','','','','','','c2','10','20','cb','sk','ord','cfam','cgen','cdesc','','','2','acc2','Bacteria', '-', 'family2','genus2','descrip2']
+        e = ['','','','','','','','c2','10','20','cb','sk','cfam','cgen','cdesc','','','2','acc2','Bacteria', 'family2','genus2','descrip2']
         print e
         print r[1].split('\t')
         eq_( e, r[1].split('\t') )
         # Check summary line with only unassembled read
-        e = ['','','','','','','','','','','','','','','','','','','1','acc1', 'Bacteria', '-', 'family1','genus1','descrip1']
+        e = ['','','','','','','','','','','','','','','','','','1','acc1', 'Bacteria', 'family1','genus1','descrip1']
         print e
         print r[2].split('\t')
         eq_( e, r[2].split('\t') )
@@ -586,7 +584,7 @@ def test_sorted_unassembled( self ):
             print r
             line = r.split('\t')
             print line
-            count = line[18]
+            count = line[17]
             eq_( e, int(count), 'Count should be {0} but got {1} at index {2}'.format(e,count,i) )
 
 class TestFormatDic( BaseTest ):