Skip to content

Commit

Permalink
Fix #1 update taxonomy db creation
Browse files Browse the repository at this point in the history
  • Loading branch information
marieBvr committed Jul 5, 2021
1 parent 070640d commit fa9f6a7
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 24 deletions.
45 changes: 24 additions & 21 deletions db/loadTaxonomy.pl
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@
"acc_prot=s" => \$data_acc_prot,
"acc_wgs=s"=> \$data_acc_wgs,
"acc_nucl=s"=> \$data_acc_nucl,
"dead_prot=s"=> \$data_dead_acc_prot,
"dead_nucl=s"=> \$data_dead_acc_nucl,
# "gi_nucl=s" => \$gi_nucl,
"gi_prot=s" => \$gi_prot,
"names=s" => \$data_names,
Expand Down Expand Up @@ -246,27 +248,28 @@ sub _set_options {
}


sub help {
my $prog = basename($0);
print STDERR <<EOF ;
#### $prog ####
#
# AUTHOR: Sebastien THEIL and Marie LEFEBVRE
# LAST MODIF: 07/02/2020
# PURPOSE: This script is used to load NCBI taxonomy file into a SQLite database.
sub help {
my $prog = basename($0);
print STDERR <<EOF ;
#### $prog ####
#
# AUTHOR: Sebastien THEIL and Marie LEFEBVRE
# LAST MODIF: 07/02/2020
# PURPOSE: This script is used to load NCBI taxonomy file into a SQLite database.
USAGE:
$prog -struct taxonomyStructure.sql -index taxonomyIndex.sql -acc_prot acc2taxid.prot -acc_nucl acc2taxid.nucl -names names.dmp -nodes nodes.dmp -gi_nucl gi_taxid_nucl.dmp -gi_prot gi_taxid_prot.dmp
USAGE:
$prog -struct taxonomyStructure.sql -index taxonomyIndex.sql -acc_prot acc2taxid.prot -acc_nucl acc2taxid.nucl -names names.dmp -nodes nodes.dmp -gi_nucl gi_taxid_nucl.dmp -gi_prot gi_taxid_prot.dmp
### OPTIONS ###
-struct <path> taxonomyStructure.sql path. (Default: $taxo_struct_dmp)
-index <path> taxonomyIndex.sql path. (Default: $taxo_index_dmp)
-acc_prot <path> prot.accession2taxid. (Default: $data_acc_prot)
-acc_nucl <path> nucl.accession2taxid. (Default: $data_acc_wgs)
-names <path> names.dmp file. (Default: $data_names)
-nodes <path> nodes.dmp file. (Default: $data_nodes)
-gi_prot <path> gi_taxid_prot.dmp file (Default: $gi_prot)
-v <int> Verbosity level. (0 -> 4).
EOF
exit(1);
### OPTIONS ###
-struct <path> taxonomyStructure.sql path. (Default: $taxo_struct_dmp)
-index <path> taxonomyIndex.sql path. (Default: $taxo_index_dmp)
-acc_prot <path> prot.accession2taxid. (Default: $data_acc_prot)
-acc_nucl <path> nucl.accession2taxid. (Default: $data_acc_wgs)
-names <path> names.dmp file. (Default: $data_names)
-nodes <path> nodes.dmp file. (Default: $data_nodes)
-gi_prot <path> gi_taxid_prot.dmp file (Default: $gi_prot)
-v <int> Verbosity level. (0 -> 4).
EOF
exit(1);
}
}
8 changes: 5 additions & 3 deletions docs/source/prerequisite.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ Perl external libraries
* String::Random
* Bio::SearchIO:blastxml
* Bio::SeqIO
* Expect
* GD:Simple


Perl included libraries
Expand Down Expand Up @@ -116,15 +118,15 @@ NCBI Taxonomy
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_wgs.accession2taxid.gz ; gunzip dead_wgs.accession2taxid.gz
cat nucl_wgs.accession2taxid nucl_gb.accession2taxid dead_wgs.accession2taxid > acc2taxid.nucl
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/dead_nucl.accession2taxid.gz; gunzip dead_nucl.accession2taxid.gz;
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz; gunzip gi_taxid_prot.dmp.gz;
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/obsolete/gi_taxid_prot.dmp.gz; gunzip gi_taxid_prot.dmp.gz;
Optionally you can combine multiple accession2taxid file with a simple cat. But keep separated nucl and prot accessions as they will be loaded in two different tables.

Launch the loadTaxonomy.pl script that will create the sqlite database. The script needs two provided sqlite files: ``taxonomyIndex.sql`` and ``taxonomyStructure.sql`` that describe the database struture. All these files are in virAnnot/db/.

.. code-block:: bash
./loadTaxonomy.pl -struct taxonomyStructure.sql -index taxonomyIndex.sql -acc_prot acc2taxid.prot -acc_nucl acc2taxid.nucl -names names.dmp -nodes nodes.dmp -gi_prot gi_taxid_prot.dmp
./loadTaxonomy.pl -struct taxonomyStructure.sql -index taxonomyIndex.sql -acc_prot acc2taxid.prot -acc_nucl acc2taxid.nucl -names names.dmp -nodes nodes.dmp -gi_prot gi_taxid_prot.dmp -acc_wgs acc2taxid.nucl
PFAM taxonomy
Expand All @@ -140,7 +142,7 @@ Be carefull, the files you will download have a size of ~900Mo.
ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/fasta.tar.gz
tar -xzf fasta.tar.gz;
mkdir pfam
mkdir fasta
mv pfam*.FASTA fasta/
cd pfam/
Expand Down

0 comments on commit fa9f6a7

Please sign in to comment.