Kraken2 dataset.md

A dataset to explore

This page describes a metagenomics dataset analysed using Kraken2 and other profilers.

The paper

Samples were downloaded from a public dataset released with a paper (Loss of microbial diversity and pathogen domination of the gut microbiota in critically ill patients) from Mark Pallen's group about comprehensive analysis of faecal metagenome samples from healthy individuals and patients with Clostridium difficile infection (CDI).

The authors found that CDI patients had significant alterations in their faecal microbiome composition, including a depletion of beneficial bacteria and an increase in potentially pathogenic species. The study also identified specific microbial signatures associated with CDI severity.

Sample	#Seq	Total bp
SRX5707173	1,981,484	251,334,383
SRX5707285	4,534,703	552,669,038
SRX5707290	63,972	6,961,885
SRX5707359	561,807	62,895,743
SRX5707366	23,921,451	2,832,669,901
SRX5707374	10,304,335	1,252,385,596
SRX5707377	10,580,411	1,279,917,137

Downloading the data

⚠️ you don't need to download the raw reads: they have been downloaded and analysed

If you find an interesting dataset from a paper, it will usually be deposited to the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (NCBI) or the European Nucleotide Archive (ENA) at the European Bioinformatics Institute (EBI).

In both cases, you will be able to determine a list of accession numbers using command line tools like:

fastq-dump from the SRA Toolkit
or a more complex pipeline like nf-core/fetchngs (see docs)

Reads profiling

The raw reads have been profiled using different tools and databases:

Kraken2, using these databases
- k2_nt_20230502
- k2_standard_20230314
KMCP (gtdb db)
SingleM (default db)
MetaPhlAn 4 (default db)

What to do?

Using these files, you can try to perform some typical analyses, either numerical (e.g. rarefaction, normalization...) or visualizations.

Some simple examples:

Comparing the two databases used with Kraken (using a single sample)
Compare Kraken with another tool (using a single sample)
Merge multiple samples in a single data frame

💡 You can use R or Python, but if you are unfamiliar, you can still use user-friendly tools such as Pavian, which can import multiple Kraken reports.

Where are the files?

The files are in the outgoing location:

/qib/platforms/Informatics/transfer/outgoing/datascience-icu-profiles/

And also publicly available on ZENODO: https://zenodo.org/records/10684408

the structure of the directory is:

├── kmcp                         Files in CAMI format (.cami.profile) and raw KMCP profiles (.tsv)
├── kraken2                      Individual Kraken reports and combined table (raw and polished)
│   ├── k2_nt_20230502
│   └── k2_standard_20230314
├── metaphlan                    Metaphlan4 profiles and combined table
├── singlem                      HTML profiles, OTUs (marker genes) and profiles (.tsv)

Notes

# SingleM (HPC), example:
singlem pipe -1 reads/SRX5707377_R1.fastq.gz -2 reads/SRX5707377_R2.fastq.gz \
  -p singlem/SRX5707377.tsv \
  --taxonomic-profile-krona singlem/SRX5707377.html  \
  --otu-table singlem/SRX5707377.otu \
  --threads 32

# Kraken2 example:
kraken2 reads/SRX5707377_R1.fastq.gz reads/SRX5707377_R2.fastq.gz \
  --db /qib/platforms/Informatics/transfer/outgoing/databases/kraken2/benlangmead/${DB} \
  --threads 16 --memory-mapping --paired 

# Metaphlan (locally)
for i in reads/*_R1*gz; 
  do echo $i; 
  metaphlan --input_type fastq --nproc 64 $i profiles/metaphlan/$(basename $i).tsv; 
done

Quadram Institute Bioscience, working group on Data Science

Provide feedback

Saved searches

Use saved searches to filter your results more quickly