This repo contains the key codes and logic for generating the absolute quantification results in the paper "Rapid Absolute Quantification of Pathogens and ARGs by Nanopore Sequencing" by Yang, Yu et al. 2021.
- This paper is published here
- Raw sequencing data files are available under BioProject: PRJNA728386
- Intended use for research only
- Construction of the
Structured Average Genome Size (SAGS)
Database - End-to-End
Absolute Quantification
workflow - Citations
- Tools used:
- SAGS is built upon the bacterial and archaeal taxonomy and metadata files from
GTDB_r95
- seqtk
- seqkit
- Kraken2
- In case of Illumina metagenomic shotgun reads, Braken2
- Minimap2
- filter_fasta_by_list_of_headers.py
B) Additional files besides original sequence files required: (files bracketed by * should be provided by users):
- Kraken2_gtdb_db:
*your Kraken2-compatible GTDB index database files*
- mClover3 fasta file:
./fasta/mClover3.fa
- nucleotide ARG database and the structure file:
*nucleotide-ARG-DB.fasta*
&*ARG_structure*
- Structured Avg Genome Size (AGS) database:
*SAGS*
constructed as above - Nanopore DNA CS fasta file:
./fasta/DCS.fasta
- Pathogen list:
*pathogen.list*
original list
Please refer to our manuscript for details of the conversion to GTDB taxonomy nomenclature - Original data can be obtained upon request
- Prepare sequencing reads
seqtk seq -a input.fq > input.fa
seqkit fx2tab -l input.fa
seqkit seq -m 1000 input.fa > input_1kb.fa
minimap2 -cx map-ont ./fasta/DCS.fasta input.fasta > output_DCS_minimap.paf
- Kraken2 for rapid taxonomic classification using GTDB r95 database
- compile and stratify taxonomic abundance results into different taxonomic resolutions at the number of bases and the number of genome copy levels
kraken2 --db Kraken2_gtdb_db input_1kb.fa --output kraken2_gtdb_r95 --use-names --report kraken2_report_gtdbr95 --unclassified-out kraken2_gtdb_r95_unclassified --classified-out kraken2_gtdb_r95_classified
- Spiked marker gene alignment by minimap2
minimap2 -cx map-ont ./fasta/mClover3.fa input_1kb.fa > minimap_mClover3_algn.paf
- ARG identification by Minimap2 against nucleotide ARG database
minimap2 -cx map-ont nucleotide-ARG-DB.fasta input_1kb.fa > minimap2_ARG_algn.paf
- Calculation of the absolute abundance of microbial cells in unit sample volumn
- refer to our paper for the calculation of scaling factor for converting seqenced genome copy number into cell number per unit sample volumn
- absolute abundance of pathogens and ARG-carrying hosts can then be extracted
For an input fasta file with size 10 Gb
, an approximated 3-4 hr
data processing time would be expected to generate the final microbial absolute quantification results.
Kraken2
for taxonomic classification --30 min
with 10 threads and 300 G memory pre-allocated.Minimap2
for mClover3 (spiked gene) identification --2.5 min
with 10 threads and 150 G memory pre-allocated.Processing kraken2 output
to convert the sequenced genome copy numbers to the final absolute cell abundance per unit sample volumn:- Summing bases for all the classified reads to different Kraken2-assigned LCA taxonomic lineages --
2.5 hr
with 10 threads and 150 G memory pre-allocated. - Stratifying the summation results above into different taxonomic levels --
5 min
with 10 threads and 150 G memory pre-allocated. - Convert the sequenced genome copy numbers into the asbolute cell abundance per unit sample volumn -- untimed, but approx.
15 min
with single thread.
- Summing bases for all the classified reads to different Kraken2-assigned LCA taxonomic lineages --
GTDB
Kraken2
In case of Illumina metagenomic shotgun reads, Braken2
Nucleotide ARG database
Minimap2
taxonkit
seqtk
seqkit
MetaPhlAn: merge_metaphlan_tables.py
Pathogen list
I try hard to credit all the third-party resources/tools/codes. If any unintentional infringements, please contact [email protected].