-
Notifications
You must be signed in to change notification settings - Fork 0
Kraken2 dataset.md
This page describes a metagenomics dataset analysed using Kraken2 and other profilers.
Samples were downloaded from a public dataset released with a paper (Loss of microbial diversity and pathogen domination of the gut microbiota in critically ill patients) from Mark Pallen's group about comprehensive analysis of faecal metagenome samples from healthy individuals and patients with Clostridium difficile infection (CDI).
The authors found that CDI patients had significant alterations in their faecal microbiome composition, including a depletion of beneficial bacteria and an increase in potentially pathogenic species. The study also identified specific microbial signatures associated with CDI severity.
Sample | #Seq | Total bp |
---|---|---|
SRX5707173 | 1,981,484 | 251,334,383 |
SRX5707285 | 4,534,703 | 552,669,038 |
SRX5707290 | 63,972 | 6,961,885 |
SRX5707359 | 561,807 | 62,895,743 |
SRX5707366 | 23,921,451 | 2,832,669,901 |
SRX5707374 | 10,304,335 | 1,252,385,596 |
SRX5707377 | 10,580,411 | 1,279,917,137 |
If you find an interesting dataset from a paper, it will usually be deposited to the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (NCBI) or the European Nucleotide Archive (ENA) at the European Bioinformatics Institute (EBI).
In both cases, you will be able to determine a list of accession numbers using command line tools like:
-
fastq-dump
from the SRA Toolkit - or a more complex pipeline like
nf-core/fetchngs
(see docs)
The raw reads have been profiled using different tools and databases:
- Kraken2, using these databases
- KMCP (gtdb db)
- SingleM (default db)
- MetaPhlAn 4 (default db)
Using these files, you can try to perform some typical analyses, either numerical (e.g. rarefaction, normalization...) or visualizations.
Some simple examples:
- Comparing the two databases used with Kraken (using a single sample)
- Compare Kraken with another tool (using a single sample)
- Merge multiple samples in a single data frame
π‘ You can use R or Python, but if you are unfamiliar, you can still use user-friendly tools such as Pavian, which can import multiple Kraken reports.
The files are in the outgoing location:
/qib/platforms/Informatics/transfer/outgoing/datascience-icu-profiles/
And also publicly available on ZENODO: https://zenodo.org/records/10684408
the structure of the directory is:
βββ kmcp Files in CAMI format (.cami.profile) and raw KMCP profiles (.tsv)
βββ kraken2 Individual Kraken reports and combined table (raw and polished)
β βββ k2_nt_20230502
β βββ k2_standard_20230314
βββ metaphlan Metaphlan4 profiles and combined table
βββ singlem HTML profiles, OTUs (marker genes) and profiles (.tsv)
# SingleM (HPC), example:
singlem pipe -1 reads/SRX5707377_R1.fastq.gz -2 reads/SRX5707377_R2.fastq.gz \
-p singlem/SRX5707377.tsv \
--taxonomic-profile-krona singlem/SRX5707377.html \
--otu-table singlem/SRX5707377.otu \
--threads 32
# Kraken2 example:
kraken2 reads/SRX5707377_R1.fastq.gz reads/SRX5707377_R2.fastq.gz \
--db /qib/platforms/Informatics/transfer/outgoing/databases/kraken2/benlangmead/${DB} \
--threads 16 --memory-mapping --paired
# Metaphlan (locally)
for i in reads/*_R1*gz;
do echo $i;
metaphlan --input_type fastq --nproc 64 $i profiles/metaphlan/$(basename $i).tsv;
done
Quadram Institute Bioscience, working group on Data Science