Home

Metagenomic Analysis on Ancient Samples

Introduction

In this workshop, we will perform metagenomic analysis on ancient samples which are generated by Hannes’s group ( unpublished ). It mainly contains three parts.

Perform the metagenomic screening using Metaphlan4
Validate and authenticate the ancient genomes
Visualize the ancient genomes

After the exercise, you will be able to:

Identify the metagenomic profiles/taxonomic composition within the ancient sample
Authenticate the target genomes based on damage pattern/edit distance/evenness coverage, etc
Generate the plot for the ancient metagenomic analysis

Preparation steps

Let's create a folder and copy the toy datasets to your working folder.

mkdir -p ./metagenomic_workshop 
cd ./metagenomic_workshop
ln -s /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/raw .

Extend to check files within the folder

GIL-1.fq.gz
HOR-1.fq.gz
MSW-2765.fq.gz

Install packages

MetaPhlan4

# copy the directory of MetaPhlan 4 to your working directory.
cp -r /projects/mjolnir1/people/hsf378/data/bin/MetaPhlAn-4.beta.1 .
# activate the environment of MetaPhlan 3
conda activate /projects/mjolnir1/apps/conda/metaphlan-3.0.14/
# enter the metaphlan 4 folder
cd /MetaPhlAn-4.beta.1
# update the software to version 4 by using pip install
pip install .

hclust2
Hclust2 is a package for plotting the heat-maps, which could be applied to visualize the abundance of species across samples from the Metaphlan results.

# create a conda environment for hclust2
conda create --name hclust2
# activate the environment
conda activate hclust2
# install packages for running hclust2
conda install -c bioconda hclust2

or you could just copy and paste the preinstalled conda environment to your conda environment

cp -r /home/hsf378/.conda/envs/hclust2 /home/$USER-ID/.conda/envs
conda activate hclust2

filterBAM

# create a new conda environment for filterBAM
conda create --name bam-filter
# activate the environment
conda activate bam-filter
# install the packages required for running filterBAM
conda install -c conda-forge -c bioconda -c genomewalker bam-filter

or you could just copy and paste the preinstalled conda environment to your conda environment

cp -r /home/hsf378/.conda/envs/bam-filter /home/$USER-ID/.conda/envs
conda activate bam-filter

metaDMG

# download the environment setting file for metaDMG
wget https://github.com/metaDMG-dev/metaDMG-core/raw/main/environment.yaml 
# install the packages required for running metaDMG and creat the corresponding conda environment
conda env create --file environment.yaml

or you could just copy and paste the preinstalled conda environment to your conda environment

cp -r /home/hsf378/.conda/envs/metaDMG /home/$USER-ID/.conda/envs
conda activate metaDMG

Metagenomic screening

Load the working environment and make a folder for the results

mkdir Metaphlan
cd Metaphlan
conda activate /projects/mjolnir1/apps/conda/metaphlan-3.0.14/

Run metagenomic screening on Metaphlan

if you only have one sample, eg. GIL-1.fq.gz

metaphlan  GIL-1.fq.gz \
 --nproc 16 \
 --read_min_len 30 \
 --input_type fastq \
 --bowtie2out GIL-1.bowtie2.out \
 -o GIL-1.metaphlan4.txt \
 --bowtie2db /projects/mjolnir1/people/tmk528/SharedData/Databases/Metaphlan4

By default, metaphlan perform the metagenomic profiling on bacteria, archaea and eukaryotes. If you want to include the viruses, you have to change the bowtie2 database which set up for Metaphlan 3 and add extra parameters --mpa3 --add_viruses as the following command:

metaphlan  GIL-1.fq.gz \
 --nproc 16 \
 --read_min_len 30 \
 --input_type fastq \
 --bowtie2out GIL-1.bowtie2.out \
 -o GIL-1.metaphlan4.txt \
 --bowtie2db /home/tmk528/data/SharedData/Databases/Metaphlan3 \
 --mpa3 \
 --add_viruses

if you perform metagenomic profiling on multiple samples

Prepare a file including all samples for the downstream metagenomic screening.

cp /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Samples.txt .

Run the script for running metaphlan in parallel.

cp /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Metaphlan/metaphlan4.array.sh .
sbatch metaphlan4.array.sh

Note: Before you run the scripts, remember to load the conda environment by using the command: conda activate /projects/mjolnir1/apps/conda/metaphlan-3.0.14/

Check the output files

Extend to check files within the folder

GIL-1.bowtie2.out
GIL-1.metaphlan4.txt
HOR-1.bowtie2.out
MSW-2765.metaphlan4.txt
MSW-2765.bowtie2.out
MSW-2765.metaphlan4.txt

Check the bowtie2.out

Click

header of reads	marker gene
M_A00706:484:H25WWDSX3:2:1451:20401:33599_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2	194702__GCA_900460955.1_00324
M_A00706:484:H25WWDSX3:2:2554:4698:1297_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2	563038__E9FKL8__HMPREF0851_00827
M_A00706:484:H25WWDSX3:2:2519:31412:28025_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2	999425__K8MRR9__HMPREF9186_01101
M_A00706:484:H25WWDSX3:2:2619:28140:6605_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2	544581__G9PQL4__HMPREF1979_01803
M_A00706:484:H25WWDSX3:2:2166:3170:31532_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2	43675__D2NSN9__RM6536_1480
M_A00706:484:H25WWDSX3:2:2173:24080:36401_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2	2047__E3H1D8__HMPREF0733_10607
M_A00706:484:H25WWDSX3:2:2224:10366:15060_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2	487__A0A0Y5FL51__galM
M_A00706:484:H25WWDSX3:2:2532:18945:21371_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2	487__A0A378WD25__NM65014_0252
M_A00706:357:HKNKTDRXY:2:2101:32633:5118_RG:Z:ILLUMINA-MSW-2765_SCR1_4Y2F7	1263109__GCA_000438135.1_00036
M_A00706:484:H25WWDSX3:2:2158:4083:8030_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2	712121__L1PF62__HMPREF9061_01305

Check the metaphlan4.txt

Click

#mpa_v30_CHOCOPhlAn_201901
#/projects/mjolnir1/apps/conda/metaphlan-3.0.14/bin/metaphlan /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/raw/MSW-2765.fq.gz --nproc 16 --read_min_len 30 --input_type fastq --bowtie2out MSW-2765.bowtie2.out -o MSW-2765.metaphlan4.txt --bowtie2db /home/jql545/data/FromJonas/Bowtie2DB --mpa3 --add_viruses
#SampleID Metaphlan_Analysis

#clade_name	NCBI_tax_id	relative_abundance
k__Bacteria	2	81.89136
k__Viruses	10239	18.03597
k__Archaea	2157	0.07268
k__Bacteria\|p__Firmicutes	2\|1239	30.84009
k__Bacteria\|p__Proteobacteria	2\|1224	26.65484
k__Viruses\|p__Viruses_unclassified	10239\|	18.03597

Merge all output tables

/projects/mjolnir1/people/hsf378/data/MetaPhlAn-4.beta.1/metaphlan/utils/merge_metaphlan_tables.py *.metaphlan4.txt > merge.metaphlan4.txt

Check the merged abundance file

head merge.metaphlan4.txt

Click

#mpa_v30_CHOCOPhlAn_201901

clade_name	NCBI_tax_id	MSW-2765.metaphlan4	HOR-1.metaphlan4	GIL-1.metaphlan4
k__Archaea	2157	0.07268	0.27318	0.13221
k__Archaea\|p__Candidatus_Bathyarchaeota	2157\|928852	0.02647	0	0.01343
k__Archaea\|p__Candidatus_Bathyarchaeota\|c__Candidatus_Bathyarchaeota_unclassified	2157\|928852\|	0.02647	0	0.01343
k__Archaea\|p__Candidatus_Bathyarchaeota\|c__Candidatus_Bathyarchaeota_unclassified\|o__Candidatus_Bathyarchaeota_unclassified	2157\|928852\|\|	0.02647	0	0.01343
k__Archaea\|p__Candidatus_Bathyarchaeota\|c__Candidatus_Bathyarchaeota_unclassified\|o__Candidatus_Bathyarchaeota_unclassified\|f__Candidatus_Bathyarchaeota_unclassified	2157\|928852\|\|\|	0.02647	0	0.01343
k__Archaea\|p__Candidatus_Bathyarchaeota\|c__Candidatus_Bathyarchaeota_unclassified\|o__Candidatus_Bathyarchaeota_unclassified\|f__Candidatus_Bathyarchaeota_unclassified\|g__Candidatus_Bathyarchaeota_unclassified	2157\|928852\|\|\|\|	0.02647	0	0.01343
k__Archaea\|p__Candidatus_Bathyarchaeota\|c__Candidatus_Bathyarchaeota_unclassified\|o__Candidatus_Bathyarchaeota_unclassified\|f__Candidatus_Bathyarchaeota_unclassified\|g__Candidatus_Bathyarchaeota_unclassified\|s__Candidatus_Bathyarchaeota_archaeon_CG_4_8_14_3_um_filter_42_8	2157\|928852\|\|\|\|\|1974384	0.02647	0	0
k__Archaea\|p__Candidatus_Bathyarchaeota\|c__Candidatus_Bathyarchaeota_unclassified\|o__Candidatus_Bathyarchaeota_unclassified\|f__Candidatus_Bathyarchaeota_unclassified\|g__Candidatus_Bathyarchaeota_unclassified\|s__Candidatus_Bathyarchaeota_archaeon_RBG_13_46_16b	2157\|928852\|\|\|\|\|1797378	0	0	0.01343

Modify the output for the downstream visualization.

grep -E "(^ID)|(s__)" merge.metaphlan4.txt | grep -v "t__" | sed "s/.*s__//g" > modify.merge.metaphlan4.txt
sed -n "2,2p" merge.metaphlan4.txt | cat - modify.merge.metaphlan4.txt | cut -f 1,3-5 | sed 's/.metaphlan4//g' > final.merge.metaphlan4.txt

Check the modified abundance file

head final.merge.metaphlan4.txt

Click

clade_name	MSW-2765	HOR-1	GIL-1
Candidatus_Bathyarchaeota_archaeon_CG_4_8_14_3_um_filter_42_8	0.02647	0	0
Candidatus_Bathyarchaeota_archaeon_RBG_13_46_16b	0	0	0.01343
Candidatus_Lokiarchaeota_archaeon_CR_4	0.0462	0.27318	0.11878
Pyrinomonas_methylaliphatogenes	0.08442	0	0
Actinobaculum_sp_oral_taxon_183	0.57967	0.10377	0
Actinomyces_georgiae	0.00688	0	0
Actinomyces_graevenitzii	0	0.25428	0
Actinomyces_johnsonii	0.11789	0.05306	0
Actinomyces_meyeri	0.09658	0	0

Visualize the abundance profiles

Here, hclust2 is applied to generate the heatmap plot for the merged metaphlan abundance profiles.

conda deactivate 
conda activate /home/hsf378/.conda/envs/hclust2
hclust2.py \
 -i final.merge.metaphlan4.txt \
 -o metaphlan4.sqrt_scale.png \
 --ftop 50 \
 --f_dist_f braycurtis \
 --s_dist_f braycurtis \
 --cell_aspect_ratio 0.5 \
 -l --flabel_size 2 \
 --max_flabel_len 100 \
 --max_slabel_len 100 \
 --slabel_size 2 \
 --dpi 1000 \

Check the heatmap plot

scp /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Metaphlan/final.metaphlan4_add_viruses.png .

Note: Here is the final results for the MetaPhlan
ln -s /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Metaphlan/

Validation and Authentication

Align reads to the GTDB database including representative reference genomes of bacteria, archaea, viruses and eukaryotes.

bowtie2 -x gtdb-r202-organelles-viruses \
    -p 8 \
    -q -U GIL-1.fq.gz \
    --maxins 500 --rg-id ILLUMINA-$SAMPLE --rg SM:$SAMPLE --rg PL:illumina --rg PU:ILLUMINA-$SAMPLE \
    -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05" --very-sensitive --no-unal \
    | samtools view -@ 24 -b -F 4 \
    | samtools sort -@ 2 -O bam - > GIL-1.vanilla.sorted.bam

Here the parameter -k 16: maximum 16 alignments per reads. --maxins 500: the maximum of reads length is 500bp. --rg-id ILLUMINA-$SAMPLE --rg SM:$SAMPLE --rg PL:illumina --rg PU:ILLUMINA-$SAMPLE: set the flag for header. --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05": specify how bowtie2 scores the alignments.

Check the alignment results:

10073353 reads; of these:
10073353 (100.00%) were unpaired; of these:
8532659 (84.71%) aligned 0 times
741841 (7.36%) aligned exactly 1 time
798853 (7.93%) aligned >1 times
15.29% overall alignment rate

After the mapping steps, we will remove the duplicates reads with Picard MarkDuplicates.

module load picard/2.15.0
picard MarkDuplicates I=GIL-1.vanilla.sorted.bam O=GIL-1.vanilla.rmdup.bam M=GIL-1.rmdup.txt AS=True REMOVE_DUPLICATES=True

Note: If we have multiple samples, we could run the alignment in parallel by using the following scripts

Click

#!/bin/bash
#SBATCH -c 24
#SBATCH --mem-per-cpu 10000
#SBATCH --time 7-00
#SBATCH --array=1-3%3 ##replace 3 with the number of samples you run
 
module load openjdk/13.0.1 bowtie2/2.4.2 picard/2.15.0

FASTQ_DIRECTORY=/PATH TO SEQUENCING RESULTS/
SAMPLE_LIST=/PATH TO SAMPLES/Samples.txt
DB=/home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses #/PATH TO THE DATABASE/prefix of the db index
SAMPLE=$(sed -n "$SLURM_ARRAY_TASK_ID"p $SAMPLE_LIST)

bowtie2 -x $DB -U $FASTQ_DIRECTORY/$SAMPLE.fq.gz -p 24 \
--maxins 500 --rg-id ILLUMINA-$SAMPLE --rg SM:$SAMPLE --rg PL:illumina --rg PU:ILLUMINA-$SAMPLE \ 
-k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05" \
--very-sensitive --no-unal  | samtools view -@ 24 -b -F 4 - \
| samtools sort -@ 2 -O bam - > $SAMPLE.vanilla.sorted.bam

picard MarkDuplicates I=$SAMPLE.vanilla.sorted.bam O=$SAMPLE.vanilla.rmdup.bam M=$SAMPLE.rmdup.txt AS=True REMOVE_DUPLICATES=True

Filter out the noise from multiple aligned reads, and those reference genomes with uneven coverage by using FilterBAM.

#!/bin/bash

#SBATCH -c 24
#SBATCH --mem-per-cpu 3000
#SBATCH --time 7-00
#SBATCH --array=1-3%3

SAMPLE_LIST=/projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Samples.txt
#DB=/home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses
DB=/projects/mjolnir1/data/databases/GTDB-Hires/2022-05-15/gtdb-r202-organelles-viruses-hires
SAMPLE=$(sed -n "$SLURM_ARRAY_TASK_ID"p $SAMPLE_LIST)
path=/projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/filterBAM/hires/bowtie2

filterBAM -t 24 -p $SAMPLE.hires -r $DB.len.map $path/$SAMPLE.hires.rmdup.bam

For example on the GIL-1 sample:

INFO ::: 2022-12-07 05:37:49 ::: Loading BAM file
INFO ::: 2022-12-07 05:37:51 ::: BAM index not found. Indexing...
INFO ::: 2022-12-07 05:37:55 ::: Reloading BAM file
INFO ::: 2022-12-07 05:37:56 ::: Found 140,239 reference sequences
INFO ::: 2022-12-07 05:37:56 ::: Removing references without mappings or less than 10 reads...
INFO ::: 2022-12-07 05:37:56 ::: Keeping 22,010 references
INFO ::: 2022-12-07 05:37:56 ::: Processing 96 chunks of 230 references each
INFO ::: 2022-12-07 05:47:05 ::: Writing reference statistics to GIL-1.vanilla_stats.tsv.gz
INFO ::: 2022-12-07 05:47:08 ::: Filtering stats...
INFO ::: 2022-12-07 05:47:08 ::: ::: min_read_count >= 10 & min_read_length >= 30 & min_avg_read_ani >= 90.0 & min_expected_breadth_ratio >= 0.5 & min_breadth >= 0 & min_coverage_evenness >= 0 & min_coverage_mean >= 0
INFO ::: 2022-12-07 05:47:08 ::: Saving filtered stats...
INFO ::: 2022-12-07 05:47:10 ::: Writing filtered BAM file... (be patient)
INFO ::: 2022-12-07 05:47:11 ::: ::: Filtering 20,858 references sequentially...
INFO ::: 2022-12-07 05:47:58 ::: BAM index not found. Indexing...
INFO ::: 2022-12-07 05:48:01 ::: ALL DONE.

Check how much noise did we filter out:

### The number of aligned reads

# 7,703,509
samtools view -c ../bowtie2/GIL-1.hires.rmdup.bam
# 6,888,515
samtools view -c GIL-1.hires.filtered.bam 

### The number of reference genomes

# 57,576
samtools view ../bowtie2/GIL-1.hires.rmdup.bam | cut -f3 | sort -u | wc -l
# 20,858
samtools view  GIL-1.hires.filtered.bam |cut -f3 | sort -u | wc -l

# 22,010
zcat GIL-1.hires_stats.tsv.gz | sed "1d"  | wc -l
# 20,858
zcat GIL-1.hires_stats-filtered.tsv.gz | sed "1d"  | wc -l

Check the summary statistic result of filterBAM

# overview
zcat GIL-1.hires_stats-filtered.tsv.gz | less -S
# extract the columns we may be interested in
zcat GIL-1.hires_stats-filtered.tsv.gz \
| csvtk cut -t -T -f "reference,n_reads,read_ani_mean,read_ani_std,coverage_mean,breadth,breadth_exp_ratio,cov_evenness,tax_abund_read" \
| csvtk sort -t -T -k "n_reads:Nr" | less -S

Assess the damage pattern

Here, we apply the metaDMG to estimate the damage over taxa within samples (LCA mode).

Load the metaDMG working environment

conda activate /home/hsf378/.conda/envs/metaDMG

Generate the config file

metaDMG config --config-file meta_workshop.hires.lca.yaml \
  --custom-database \
  --names /home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses/names.dmp \
  --nodes /home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses/nodes.dmp \
  --acc2tax /home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses/acc2taxid.map.gz \
  --metaDMG-cpp /home/tmk528/data/bin/metaDMG-cpp/metaDMG-cpp \
  --parallel-samples 1 \
  --cores-per-sample 8 \
  --output-dir ~/hsf378/data/Metagenomic_workshop/filterBAM/hires/metaDMG \ ### change the path
  --max-position 15 \
  --lca-rank '' ' ' \
  --min-similarity-score 0.95 \
  --max-similarity-score 1 \
  --damage-mode lca \
  --weight-type 1 \
  --forward-only GIL-1.hires.filtered.bam HOR-1.hires.filtered.bam MSW-2765.hires.filtered.bam

Compute the results

sbatch metaDMG.hires.compute.sh

The details on metaDMG.hires.compute.sh

#!/bin/bash

#SBATCH -c 24
#SBATCH --mem-per-cpu 8000
#SBATCH --time=07:00:00

conda activate /projects/mjolnir1/people/hsf378/.conda/envs/metaDMG
metaDMG compute ~/hsf378/data/Metagenomic_workshop/filterBAM/hires/metaDMG/meta_workshop.hires.lca.yaml

Check the results
1. open a new terminal and run the following code:

ssh -L 8050:localhost:8050 [email protected]
conda activate /home/hsf378/.conda/envs/metaDMG
metaDMG dashboard --results /PATH TO METADMG RESULTS FOLDER/ --port 8050 --server

open your browser

http://localhost:8050/

Note: Here is the final results for the filterBAM and
ln -s /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/filterBAM/

Visualization

TBC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Metagenomic Analysis on Ancient Samples

Introduction

Preparation steps

Metagenomic screening

Validation and Authentication

Visualization

Reference

Clone this wiki locally