Skip to content

This repo contains the commands to reproduce the absolute quantification results in the paper "Rapid Absolute Quantification of Pathogens and ARGs by Nanopore Sequencing" by Yang, Yu et al. 2021.

License

Notifications You must be signed in to change notification settings

ellyyuyang/abs-quanti-nanopore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 

Repository files navigation

abs-quanti-nanopore

This repo contains the key codes and logic for generating the absolute quantification results in the paper "Rapid Absolute Quantification of Pathogens and ARGs by Nanopore Sequencing" by Yang, Yu et al. 2021.


Components for reproducible analysis

1. Construction of the Structured Average Genome Size (SAGS) Database

2. End-to-End Absolute Quantification workflow

A) Tools used:

B) Additional files besides original sequence files required: (files bracketed by * should be provided by users):

  • Kraken2_gtdb_db: *your Kraken2-compatible GTDB index database files*
  • mClover3 fasta file: ./fasta/mClover3.fa
  • nucleotide ARG database and the structure file: *nucleotide-ARG-DB.fasta* & *ARG_structure*
  • Structured Avg Genome Size (AGS) database: *SAGS* constructed as above
  • Nanopore DNA CS fasta file: ./fasta/DCS.fasta
  • Pathogen list: *pathogen.list* original list
    Please refer to our manuscript for details of the conversion to GTDB taxonomy nomenclature
  • Original data can be obtained upon request

C) Logic flow and key codes:

  • Prepare sequencing reads
    • merge reads, convert file types, length filtering by seqtk and seqkit;
    • identify (Minimap2) and remove DCS reads if DCS is used in ONT library preparation
seqtk seq -a input.fq > input.fa
seqkit fx2tab -l input.fa
seqkit seq -m 1000 input.fa > input_1kb.fa
minimap2 -cx map-ont ./fasta/DCS.fasta input.fasta > output_DCS_minimap.paf
  • Kraken2 for rapid taxonomic classification using GTDB r95 database
    • compile and stratify taxonomic abundance results into different taxonomic resolutions at the number of bases and the number of genome copy levels
kraken2 --db Kraken2_gtdb_db input_1kb.fa  --output kraken2_gtdb_r95 --use-names --report kraken2_report_gtdbr95 --unclassified-out kraken2_gtdb_r95_unclassified --classified-out kraken2_gtdb_r95_classified
  • Spiked marker gene alignment by minimap2
    • Identify mClover3 reads by Minimap2 and filter results with parameters described in our paper;
    • calculate mClover3 gene copy number for a final number of spike cell genome copy number approximation;
minimap2 -cx map-ont ./fasta/mClover3.fa input_1kb.fa > minimap_mClover3_algn.paf
  • ARG identification by Minimap2 against nucleotide ARG database
    • align reads to nucleotide ARG database by Minimap2 and filter results with cutoffs from (here)
    • calculate the gene copy number of different ARGs
    • keep those ARG-carrying reads with at least addtional 1kb walkout distance for ARG host tracking
minimap2  -cx map-ont nucleotide-ARG-DB.fasta input_1kb.fa > minimap2_ARG_algn.paf
  • Calculation of the absolute abundance of microbial cells in unit sample volumn
    • refer to our paper for the calculation of scaling factor for converting seqenced genome copy number into cell number per unit sample volumn
    • absolute abundance of pathogens and ARG-carrying hosts can then be extracted

D) Running time estimation for major steps:

For an input fasta file with size 10 Gb, an approximated 3-4 hr data processing time would be expected to generate the final microbial absolute quantification results.

  • Kraken2 for taxonomic classification -- 30 min with 10 threads and 300 G memory pre-allocated.
  • Minimap2 for mClover3 (spiked gene) identification -- 2.5 min with 10 threads and 150 G memory pre-allocated.
  • Processing kraken2 output to convert the sequenced genome copy numbers to the final absolute cell abundance per unit sample volumn:
    • Summing bases for all the classified reads to different Kraken2-assigned LCA taxonomic lineages -- 2.5 hr with 10 threads and 150 G memory pre-allocated.
    • Stratifying the summation results above into different taxonomic levels -- 5 min with 10 threads and 150 G memory pre-allocated.
    • Convert the sequenced genome copy numbers into the asbolute cell abundance per unit sample volumn -- untimed, but approx. 15 min with single thread.

If you intend to use these commands, please cite these resources:

GTDB
Kraken2
In case of Illumina metagenomic shotgun reads, Braken2
Nucleotide ARG database
Minimap2
taxonkit
seqtk
seqkit
MetaPhlAn: merge_metaphlan_tables.py
Pathogen list

I try hard to credit all the third-party resources/tools/codes. If any unintentional infringements, please contact [email protected].

About

This repo contains the commands to reproduce the absolute quantification results in the paper "Rapid Absolute Quantification of Pathogens and ARGs by Nanopore Sequencing" by Yang, Yu et al. 2021.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published