Skip to content

Kraken2 dataset.md

Andrea Telatin edited this page Feb 27, 2024 · 7 revisions

A dataset to explore

This page describes a metagenomics dataset analysed using Kraken2 and other profilers.

The paper

Samples were downloaded from a public dataset released with a paper (Loss of microbial diversity and pathogen domination of the gut microbiota in critically ill patients) from Mark Pallen's group about comprehensive analysis of faecal metagenome samples from healthy individuals and patients with Clostridium difficile infection (CDI).

The authors found that CDI patients had significant alterations in their faecal microbiome composition, including a depletion of beneficial bacteria and an increase in potentially pathogenic species. The study also identified specific microbial signatures associated with CDI severity.

Sample #Seq Total bp
SRX5707173 1,981,484 251,334,383
SRX5707285 4,534,703 552,669,038
SRX5707290 63,972 6,961,885
SRX5707359 561,807 62,895,743
SRX5707366 23,921,451 2,832,669,901
SRX5707374 10,304,335 1,252,385,596
SRX5707377 10,580,411 1,279,917,137

Downloading the data

⚠️ you don't need to download the raw reads: they have been downloaded and analysed

If you find an interesting dataset from a paper, it will usually be deposited to the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (NCBI) or the European Nucleotide Archive (ENA) at the European Bioinformatics Institute (EBI).

In both cases, you will be able to determine a list of accession numbers using command line tools like:

  • fastq-dump from the SRA Toolkit
  • or a more complex pipeline like nf-core/fetchngs (see docs)

Reads profiling

The raw reads have been profiled using different tools and databases:

What to do?

Using these files, you can try to perform some typical analyses, either numerical (e.g. rarefaction, normalization...) or visualizations.

Some simple examples:

  1. Comparing the two databases used with Kraken (using a single sample)
  2. Compare Kraken with another tool (using a single sample)
  3. Merge multiple samples in a single data frame

πŸ’‘ You can use R or Python, but if you are unfamiliar, you can still use user-friendly tools such as Pavian, which can import multiple Kraken reports.

Where are the files?

The files are in the outgoing location:

/qib/platforms/Informatics/transfer/outgoing/datascience-icu-profiles/

And also publicly available on ZENODO: https://zenodo.org/records/10684408

the structure of the directory is:

β”œβ”€β”€ kmcp                         Files in CAMI format (.cami.profile) and raw KMCP profiles (.tsv)
β”œβ”€β”€ kraken2                      Individual Kraken reports and combined table (raw and polished)
β”‚   β”œβ”€β”€ k2_nt_20230502
β”‚   └── k2_standard_20230314
β”œβ”€β”€ metaphlan                    Metaphlan4 profiles and combined table
β”œβ”€β”€ singlem                      HTML profiles, OTUs (marker genes) and profiles (.tsv)

Notes

# SingleM (HPC), example:
singlem pipe -1 reads/SRX5707377_R1.fastq.gz -2 reads/SRX5707377_R2.fastq.gz \
  -p singlem/SRX5707377.tsv \
  --taxonomic-profile-krona singlem/SRX5707377.html  \
  --otu-table singlem/SRX5707377.otu \
  --threads 32

# Kraken2 example:
kraken2 reads/SRX5707377_R1.fastq.gz reads/SRX5707377_R2.fastq.gz \
  --db /qib/platforms/Informatics/transfer/outgoing/databases/kraken2/benlangmead/${DB} \
  --threads 16 --memory-mapping --paired 

# Metaphlan (locally)
for i in reads/*_R1*gz; 
  do echo $i; 
  metaphlan --input_type fastq --nproc 64 $i profiles/metaphlan/$(basename $i).tsv; 
done