Skip to content
hsf378 edited this page Jan 3, 2023 · 35 revisions

Metagenomic Analysis on Ancient Samples

Introduction

In this workshop, we will perform metagenomic analysis on ancient samples which are generated by Hannes’s group ( unpublished ). It mainly contains three parts.

  1. Perform the metagenomic screening using Metaphlan4
  2. Validate and authenticate the ancient genomes
  3. Visualize the ancient genomes

After the exercise, you will be able to:

  • Identify the metagenomic profiles/taxonomic composition within the ancient sample
  • Authenticate the target genomes based on damage pattern/edit distance/evenness coverage, etc
  • Generate the plot for the ancient metagenomic analysis

Preparation steps

  1. Let's create a folder and copy the toy datasets to your working folder.
mkdir -p ./metagenomic_workshop 
cd ./metagenomic_workshop
ln -s /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/raw .
Extend to check files within the folder

GIL-1.fq.gz
HOR-1.fq.gz
MSW-2765.fq.gz

  1. Install packages
# copy the directory of MetaPhlan 4 to your working directory.
cp -r /projects/mjolnir1/people/hsf378/data/bin/MetaPhlAn-4.beta.1 .
# activate the environment of MetaPhlan 3
conda activate /projects/mjolnir1/apps/conda/metaphlan-3.0.14/
# enter the metaphlan 4 folder
cd /MetaPhlAn-4.beta.1
# update the software to version 4 by using pip install
pip install .
  • hclust2

    Hclust2 is a package for plotting the heat-maps, which could be applied to visualize the abundance of species across samples from the Metaphlan results.
# create a conda environment for hclust2
conda create --name hclust2
# activate the environment
conda activate hclust2
# install packages for running hclust2
conda install -c bioconda hclust2

or you could just copy and paste the preinstalled conda environment to your conda environment

cp -r /home/hsf378/.conda/envs/hclust2 /home/$USER-ID/.conda/envs
conda activate hclust2
# create a new conda environment for filterBAM
conda create --name bam-filter
# activate the environment
conda activate bam-filter
# install the packages required for running filterBAM
conda install -c conda-forge -c bioconda -c genomewalker bam-filter

or you could just copy and paste the preinstalled conda environment to your conda environment

cp -r /home/hsf378/.conda/envs/bam-filter /home/$USER-ID/.conda/envs
conda activate bam-filter
# download the environment setting file for metaDMG
wget https://github.com/metaDMG-dev/metaDMG-core/raw/main/environment.yaml 
# install the packages required for running metaDMG and creat the corresponding conda environment
conda env create --file environment.yaml

or you could just copy and paste the preinstalled conda environment to your conda environment

cp -r /home/hsf378/.conda/envs/metaDMG /home/$USER-ID/.conda/envs
conda activate metaDMG

Metagenomic screening

  1. Load the working environment and make a folder for the results
mkdir Metaphlan
cd Metaphlan
conda activate /projects/mjolnir1/apps/conda/metaphlan-3.0.14/
  1. Run metagenomic screening on Metaphlan
  • if you only have one sample, eg. GIL-1.fq.gz
metaphlan  GIL-1.fq.gz \
 --nproc 16 \
 --read_min_len 30 \
 --input_type fastq \
 --bowtie2out GIL-1.bowtie2.out \
 -o GIL-1.metaphlan4.txt \
 --bowtie2db /projects/mjolnir1/people/tmk528/SharedData/Databases/Metaphlan4 

By default, metaphlan perform the metagenomic profiling on bacteria, archaea and eukaryotes. If you want to include the viruses, you have to change the bowtie2 database which set up for Metaphlan 3 and add extra parameters --mpa3 --add_viruses as the following command:

metaphlan  GIL-1.fq.gz \
 --nproc 16 \
 --read_min_len 30 \
 --input_type fastq \
 --bowtie2out GIL-1.bowtie2.out \
 -o GIL-1.metaphlan4.txt \
 --bowtie2db /home/tmk528/data/SharedData/Databases/Metaphlan3 \
 --mpa3 \
 --add_viruses
  • if you perform metagenomic profiling on multiple samples
  1. Prepare a file including all samples for the downstream metagenomic screening.
cp /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Samples.txt .
  1. Run the script for running metaphlan in parallel.
cp /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Metaphlan/metaphlan4.array.sh .
sbatch metaphlan4.array.sh

Note: Before you run the scripts, remember to load the conda environment by using the command: conda activate /projects/mjolnir1/apps/conda/metaphlan-3.0.14/

  • Check the output files
Extend to check files within the folder

GIL-1.bowtie2.out
GIL-1.metaphlan4.txt
HOR-1.bowtie2.out
MSW-2765.metaphlan4.txt
MSW-2765.bowtie2.out
MSW-2765.metaphlan4.txt

  • Check the bowtie2.out
Click
header of reads marker gene
M_A00706:484:H25WWDSX3:2:1451:20401:33599_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 194702__GCA_900460955.1_00324
M_A00706:484:H25WWDSX3:2:2554:4698:1297_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 563038__E9FKL8__HMPREF0851_00827
M_A00706:484:H25WWDSX3:2:2519:31412:28025_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 999425__K8MRR9__HMPREF9186_01101
M_A00706:484:H25WWDSX3:2:2619:28140:6605_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 544581__G9PQL4__HMPREF1979_01803
M_A00706:484:H25WWDSX3:2:2166:3170:31532_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 43675__D2NSN9__RM6536_1480
M_A00706:484:H25WWDSX3:2:2173:24080:36401_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 2047__E3H1D8__HMPREF0733_10607
M_A00706:484:H25WWDSX3:2:2224:10366:15060_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 487__A0A0Y5FL51__galM
M_A00706:484:H25WWDSX3:2:2532:18945:21371_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 487__A0A378WD25__NM65014_0252
M_A00706:357:HKNKTDRXY:2:2101:32633:5118_RG:Z:ILLUMINA-MSW-2765_SCR1_4Y2F7 1263109__GCA_000438135.1_00036
M_A00706:484:H25WWDSX3:2:2158:4083:8030_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 712121__L1PF62__HMPREF9061_01305
  • Check the metaphlan4.txt
Click

#mpa_v30_CHOCOPhlAn_201901
#/projects/mjolnir1/apps/conda/metaphlan-3.0.14/bin/metaphlan /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/raw/MSW-2765.fq.gz --nproc 16 --read_min_len 30 --input_type fastq --bowtie2out MSW-2765.bowtie2.out -o MSW-2765.metaphlan4.txt --bowtie2db /home/jql545/data/FromJonas/Bowtie2DB --mpa3 --add_viruses
#SampleID Metaphlan_Analysis

#clade_name NCBI_tax_id relative_abundance additional_species
k__Bacteria 2 81.89136
k__Viruses 10239 18.03597
k__Archaea 2157 0.07268
k__Bacteria|p__Firmicutes 2|1239 30.84009
k__Bacteria|p__Proteobacteria 2|1224 26.65484
k__Viruses|p__Viruses_unclassified 10239| 18.03597
  1. Merge all output tables
/projects/mjolnir1/people/hsf378/data/MetaPhlAn-4.beta.1/metaphlan/utils/merge_metaphlan_tables.py *.metaphlan4.txt > merge.metaphlan4.txt 
  • Check the merged abundance file
head merge.metaphlan4.txt
Click

#mpa_v30_CHOCOPhlAn_201901

clade_name NCBI_tax_id MSW-2765.metaphlan4 HOR-1.metaphlan4 GIL-1.metaphlan4
k__Archaea 2157 0.07268 0.27318 0.13221
k__Archaea|p__Candidatus_Bathyarchaeota 2157|928852 0.02647 0 0.01343
k__Archaea|p__Candidatus_Bathyarchaeota|c__Candidatus_Bathyarchaeota_unclassified 2157|928852| 0.02647 0 0.01343
k__Archaea|p__Candidatus_Bathyarchaeota|c__Candidatus_Bathyarchaeota_unclassified|o__Candidatus_Bathyarchaeota_unclassified 2157|928852|| 0.02647 0 0.01343
k__Archaea|p__Candidatus_Bathyarchaeota|c__Candidatus_Bathyarchaeota_unclassified|o__Candidatus_Bathyarchaeota_unclassified|f__Candidatus_Bathyarchaeota_unclassified 2157|928852||| 0.02647 0 0.01343
k__Archaea|p__Candidatus_Bathyarchaeota|c__Candidatus_Bathyarchaeota_unclassified|o__Candidatus_Bathyarchaeota_unclassified|f__Candidatus_Bathyarchaeota_unclassified|g__Candidatus_Bathyarchaeota_unclassified 2157|928852|||| 0.02647 0 0.01343
k__Archaea|p__Candidatus_Bathyarchaeota|c__Candidatus_Bathyarchaeota_unclassified|o__Candidatus_Bathyarchaeota_unclassified|f__Candidatus_Bathyarchaeota_unclassified|g__Candidatus_Bathyarchaeota_unclassified|s__Candidatus_Bathyarchaeota_archaeon_CG_4_8_14_3_um_filter_42_8 2157|928852|||||1974384 0.02647 0 0
k__Archaea|p__Candidatus_Bathyarchaeota|c__Candidatus_Bathyarchaeota_unclassified|o__Candidatus_Bathyarchaeota_unclassified|f__Candidatus_Bathyarchaeota_unclassified|g__Candidatus_Bathyarchaeota_unclassified|s__Candidatus_Bathyarchaeota_archaeon_RBG_13_46_16b 2157|928852|||||1797378 0 0 0.01343
  • Modify the output for the downstream visualization.
grep -E "(^ID)|(s__)" merge.metaphlan4.txt | grep -v "t__" | sed "s/.*s__//g" > modify.merge.metaphlan4.txt
sed -n "2,2p" merge.metaphlan4.txt | cat - modify.merge.metaphlan4.txt | cut -f 1,3-5 | sed 's/.metaphlan4//g' > final.merge.metaphlan4.txt
  • Check the modified abundance file
head final.merge.metaphlan4.txt
Click
clade_name MSW-2765 HOR-1 GIL-1
Candidatus_Bathyarchaeota_archaeon_CG_4_8_14_3_um_filter_42_8 0.02647 0 0
Candidatus_Bathyarchaeota_archaeon_RBG_13_46_16b 0 0 0.01343
Candidatus_Lokiarchaeota_archaeon_CR_4 0.0462 0.27318 0.11878
Pyrinomonas_methylaliphatogenes 0.08442 0 0
Actinobaculum_sp_oral_taxon_183 0.57967 0.10377 0
Actinomyces_georgiae 0.00688 0 0
Actinomyces_graevenitzii 0 0.25428 0
Actinomyces_johnsonii 0.11789 0.05306 0
Actinomyces_meyeri 0.09658 0 0
  1. Visualize the abundance profiles

Here, hclust2 is applied to generate the heatmap plot for the merged metaphlan abundance profiles.

conda deactivate 
conda activate /home/hsf378/.conda/envs/hclust2
hclust2.py \
 -i final.merge.metaphlan4.txt \
 -o metaphlan4.sqrt_scale.png \
 --ftop 50 \
 --f_dist_f braycurtis \
 --s_dist_f braycurtis \
 --cell_aspect_ratio 0.5 \
 -l --flabel_size 2 \
 --max_flabel_len 100 \
 --max_slabel_len 100 \
 --slabel_size 2 \
 --dpi 1000 \
  • Check the heatmap plot
scp /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Metaphlan/final.metaphlan4_add_viruses.png .

Note: Here is the final results for the MetaPhlan

ln -s /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Metaphlan/

Validation and Authentication

  1. Align reads to the GTDB database including representative reference genomes of bacteria, archaea, viruses and eukaryotes.
bowtie2 -x gtdb-r202-organelles-viruses \
    -p 8 \
    -q -U GIL-1.fq.gz \
    --maxins 500 --rg-id ILLUMINA-$SAMPLE --rg SM:$SAMPLE --rg PL:illumina --rg PU:ILLUMINA-$SAMPLE \
    -k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05" --very-sensitive --no-unal \
    | samtools view -@ 24 -b -F 4 \
    | samtools sort -@ 2 -O bam - > GIL-1.vanilla.sorted.bam

Here the parameter -k 16: maximum 16 alignments per reads. --maxins 500: the maximum of reads length is 500bp. --rg-id ILLUMINA-$SAMPLE --rg SM:$SAMPLE --rg PL:illumina --rg PU:ILLUMINA-$SAMPLE: set the flag for header. --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05": specify how bowtie2 scores the alignments.

  • Check the alignment results:

10073353 reads; of these:
10073353 (100.00%) were unpaired; of these:
8532659 (84.71%) aligned 0 times
741841 (7.36%) aligned exactly 1 time
798853 (7.93%) aligned >1 times
15.29% overall alignment rate

After the mapping steps, we will remove the duplicates reads with Picard MarkDuplicates.

module load picard/2.15.0
picard MarkDuplicates I=GIL-1.vanilla.sorted.bam O=GIL-1.vanilla.rmdup.bam M=GIL-1.rmdup.txt AS=True REMOVE_DUPLICATES=True

Note: If we have multiple samples, we could run the alignment in parallel by using the following scripts

Click
#!/bin/bash
#SBATCH -c 24
#SBATCH --mem-per-cpu 10000
#SBATCH --time 7-00
#SBATCH --array=1-3%3 ##replace 3 with the number of samples you run
 
module load openjdk/13.0.1 bowtie2/2.4.2 picard/2.15.0

FASTQ_DIRECTORY=/PATH TO SEQUENCING RESULTS/
SAMPLE_LIST=/PATH TO SAMPLES/Samples.txt
DB=/home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses #/PATH TO THE DATABASE/prefix of the db index
SAMPLE=$(sed -n "$SLURM_ARRAY_TASK_ID"p $SAMPLE_LIST)

bowtie2 -x $DB -U $FASTQ_DIRECTORY/$SAMPLE.fq.gz -p 24 \
--maxins 500 --rg-id ILLUMINA-$SAMPLE --rg SM:$SAMPLE --rg PL:illumina --rg PU:ILLUMINA-$SAMPLE \ 
-k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05" \
--very-sensitive --no-unal  | samtools view -@ 24 -b -F 4 - \
| samtools sort -@ 2 -O bam - > $SAMPLE.vanilla.sorted.bam

picard MarkDuplicates I=$SAMPLE.vanilla.sorted.bam O=$SAMPLE.vanilla.rmdup.bam M=$SAMPLE.rmdup.txt AS=True REMOVE_DUPLICATES=True

  1. Filter out the noise from multiple aligned reads, and those reference genomes with uneven coverage by using FilterBAM.
#!/bin/bash

#SBATCH -c 24
#SBATCH --mem-per-cpu 3000
#SBATCH --time 7-00
#SBATCH --array=1-3%3

SAMPLE_LIST=/projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Samples.txt
#DB=/home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses
DB=/projects/mjolnir1/data/databases/GTDB-Hires/2022-05-15/gtdb-r202-organelles-viruses-hires
SAMPLE=$(sed -n "$SLURM_ARRAY_TASK_ID"p $SAMPLE_LIST)
path=/projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/filterBAM/hires/bowtie2

filterBAM -t 24 -p $SAMPLE.hires -r $DB.len.map $path/$SAMPLE.hires.rmdup.bam

For example on the GIL-1 sample:

INFO ::: 2022-12-07 05:37:49 ::: Loading BAM file
INFO ::: 2022-12-07 05:37:51 ::: BAM index not found. Indexing...
INFO ::: 2022-12-07 05:37:55 ::: Reloading BAM file
INFO ::: 2022-12-07 05:37:56 ::: Found 140,239 reference sequences
INFO ::: 2022-12-07 05:37:56 ::: Removing references without mappings or less than 10 reads...
INFO ::: 2022-12-07 05:37:56 ::: Keeping 22,010 references
INFO ::: 2022-12-07 05:37:56 ::: Processing 96 chunks of 230 references each
INFO ::: 2022-12-07 05:47:05 ::: Writing reference statistics to GIL-1.vanilla_stats.tsv.gz
INFO ::: 2022-12-07 05:47:08 ::: Filtering stats...
INFO ::: 2022-12-07 05:47:08 ::: ::: min_read_count >= 10 & min_read_length >= 30 & min_avg_read_ani >= 90.0 & min_expected_breadth_ratio >= 0.5 & min_breadth >= 0 & min_coverage_evenness >= 0 & min_coverage_mean >= 0
INFO ::: 2022-12-07 05:47:08 ::: Saving filtered stats...
INFO ::: 2022-12-07 05:47:10 ::: Writing filtered BAM file... (be patient)
INFO ::: 2022-12-07 05:47:11 ::: ::: Filtering 20,858 references sequentially...
INFO ::: 2022-12-07 05:47:58 ::: BAM index not found. Indexing...
INFO ::: 2022-12-07 05:48:01 ::: ALL DONE.

  • Check how much noise did we filter out:
### The number of aligned reads

# 7,703,509
samtools view -c ../bowtie2/GIL-1.hires.rmdup.bam
# 6,888,515
samtools view -c GIL-1.hires.filtered.bam 

### The number of reference genomes

# 57,576
samtools view ../bowtie2/GIL-1.hires.rmdup.bam | cut -f3 | sort -u | wc -l
# 20,858
samtools view  GIL-1.hires.filtered.bam |cut -f3 | sort -u | wc -l

# 22,010
zcat GIL-1.hires_stats.tsv.gz | sed "1d"  | wc -l
# 20,858
zcat GIL-1.hires_stats-filtered.tsv.gz | sed "1d"  | wc -l

  • Check the summary statistic result of filterBAM
# overview
zcat GIL-1.hires_stats-filtered.tsv.gz | less -S
# extract the columns we may be interested in
zcat GIL-1.hires_stats-filtered.tsv.gz \
| csvtk cut -t -T -f "reference,n_reads,read_ani_mean,read_ani_std,coverage_mean,breadth,breadth_exp_ratio,cov_evenness,tax_abund_read" \
| csvtk sort -t -T -k "n_reads:Nr" | less -S
  1. Assess the damage pattern

Here, we apply the metaDMG to estimate the damage over taxa within samples (LCA mode).

  • Load the metaDMG working environment
conda activate /home/hsf378/.conda/envs/metaDMG
  • Generate the config file
metaDMG config --config-file meta_workshop.hires.lca.yaml \
  --custom-database \
  --names /home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses/names.dmp \
  --nodes /home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses/nodes.dmp \
  --acc2tax /home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses/acc2taxid.map.gz \
  --metaDMG-cpp /home/tmk528/data/bin/metaDMG-cpp/metaDMG-cpp \
  --parallel-samples 1 \
  --cores-per-sample 8 \
  --output-dir ~/hsf378/data/Metagenomic_workshop/filterBAM/hires/metaDMG \ ### change the path
  --max-position 15 \
  --lca-rank '' ' ' \
  --min-similarity-score 0.95 \
  --max-similarity-score 1 \
  --damage-mode lca \
  --weight-type 1 \
  --forward-only GIL-1.hires.filtered.bam HOR-1.hires.filtered.bam MSW-2765.hires.filtered.bam
  • Compute the results
sbatch metaDMG.hires.compute.sh
The details on metaDMG.hires.compute.sh
#!/bin/bash

#SBATCH -c 24
#SBATCH --mem-per-cpu 8000
#SBATCH --time=07:00:00

conda activate /projects/mjolnir1/people/hsf378/.conda/envs/metaDMG
metaDMG compute ~/hsf378/data/Metagenomic_workshop/filterBAM/hires/metaDMG/meta_workshop.hires.lca.yaml

  • Check the results
    1. open a new terminal and run the following code:
ssh -L 8050:localhost:8050 [email protected]
conda activate /home/hsf378/.conda/envs/metaDMG
metaDMG dashboard --results /PATH TO METADMG RESULTS FOLDER/ --port 8050 --server
  1. open your browser
http://localhost:8050/

Note: Here is the final results for the filterBAM and

ln -s /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/filterBAM/

Visualization

TBC

Reference