Skip to content

Latest commit

 

History

History
94 lines (70 loc) · 3.01 KB

README.md

File metadata and controls

94 lines (70 loc) · 3.01 KB



This tool is design to classify metagenomic sequences (marker genes, genomes and amplicon reads) using a Hierarchical Taxonomic Classifier.

Please check also the wiki for more information.





Dependencies

The stag classifier requires:

  • Python 3.7 (or higher)
  • HMMER3 (or Infernal)
  • Easel (link)
  • seqtk
  • prodigal (to predict genes in genomes)
  • python library:
    • numpy
    • pandas
    • sklearn
    • h5py = 2.10.0

If you have conda, you can install all the dependencies in conda_env_stag.yaml. See Installation wiki for more info.

Installation

git clone https://github.com/zellerlab/stag.git
cd stag
# if environment is needed
conda env create -f conda_env_stag.yaml
python setup.py bdist_wheel
pip install --no-deps --force-reinstall dist/*.whl

Note: in the following examples we assume that the python script stag is in the system path.

Execution

# if environment was installed
conda activate stag
# test the installation
stag test

Taxonomically annotate gene sequences

Given a fasta file (let's say unknown_seq.fasta), you can find the taxonomy annotation of these sequences using:

stag classify -d test_db.stagDB -i unknown_seq.fasta

The output is:

sequence	taxonomy
geneA	d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus
geneB	d__Bacteria
geneC	d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria

You can either create a database (see Create a database), or use one that we already compiled:

Taxonomically annotate genomes

Given a fasta file (let's say unknown_genome.fasta), you can find the taxonomy annotation of this genome with:

stag classify_genome -i unknown_genome.fasta -d gtdb_30.stagDB -o res_dir

The output is saved in the directory res_dir. Inside you will find the file genome_annotation with the annotation in the same format as in the gene classification. More information on the other files can be found here.

To classify multiple genomes, you can use:

stag classify_genome -D all/genomes/dir -d gtdb_30.stagDB -o res_dir

Where all/genomes/dir is a directory, and all fasta files inside the directory will be classified.

Finally, you can find some databases to classify genomes (gtdb_30.stagDB in the examples) here.