-
Notifications
You must be signed in to change notification settings - Fork 0
Home
In this workshop, we will perform metagenomic analysis on ancient samples which are generated by Hannes’s group ( unpublished ). It mainly contains three parts.
- Perform the metagenomic screening using Metaphlan4
- Validate and authenticate the ancient genomes
- Visualize the ancient genomes
After the exercise, you will be able to:
- Identify the metagenomic profiles/taxonomic composition within the ancient sample
- Authenticate the target genomes based on damage pattern/edit distance/evenness coverage, etc
- Generate the plot for the ancient metagenomic analysis
- Let's create a folder and copy the toy datasets to your working folder.
mkdir -p ./metagenomic_workshop
cd ./metagenomic_workshop
ln -s /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/raw .
Extend to check files within the folder
GIL-1.fq.gz
HOR-1.fq.gz
MSW-2765.fq.gz
- Install packages
# copy the directory of MetaPhlan 4 to your working directory.
cp -r /projects/mjolnir1/people/hsf378/data/bin/MetaPhlAn-4.beta.1 .
# activate the environment of MetaPhlan 3
conda activate /projects/mjolnir1/apps/conda/metaphlan-3.0.14/
# enter the metaphlan 4 folder
cd /MetaPhlAn-4.beta.1
# update the software to version 4 by using pip install
pip install .
- Hclust2 is a package for plotting the heat-maps, which could be applied to visualize the abundance of species across samples from the Metaphlan results.
# create a conda environment for hclust2
conda create --name hclust2
# activate the environment
conda activate hclust2
# install packages for running hclust2
conda install -c bioconda hclust2
or you could just copy and paste the preinstalled conda environment to your conda environment
cp -r /home/hsf378/.conda/envs/hclust2 /home/$USER-ID/.conda/envs
conda activate hclust2
# create a new conda environment for filterBAM
conda create --name bam-filter
# activate the environment
conda activate bam-filter
# install the packages required for running filterBAM
conda install -c conda-forge -c bioconda -c genomewalker bam-filter
or you could just copy and paste the preinstalled conda environment to your conda environment
cp -r /home/hsf378/.conda/envs/bam-filter /home/$USER-ID/.conda/envs
conda activate bam-filter
# download the environment setting file for metaDMG
wget https://github.com/metaDMG-dev/metaDMG-core/raw/main/environment.yaml
# install the packages required for running metaDMG and creat the corresponding conda environment
conda env create --file environment.yaml
or you could just copy and paste the preinstalled conda environment to your conda environment
cp -r /home/hsf378/.conda/envs/metaDMG /home/$USER-ID/.conda/envs
conda activate metaDMG
- Load the working environment and make a folder for the results
mkdir Metaphlan
cd Metaphlan
conda activate /projects/mjolnir1/apps/conda/metaphlan-3.0.14/
- Run metagenomic screening on Metaphlan
- if you only have one sample, eg. GIL-1.fq.gz
metaphlan GIL-1.fq.gz \
--nproc 16 \
--read_min_len 30 \
--input_type fastq \
--bowtie2out GIL-1.bowtie2.out \
-o GIL-1.metaphlan4.txt \
--bowtie2db /projects/mjolnir1/people/tmk528/SharedData/Databases/Metaphlan4
By default, metaphlan perform the metagenomic profiling on bacteria, archaea and eukaryotes. If you want to include the viruses, you have to change the bowtie2 database which set up for Metaphlan 3 and add extra parameters --mpa3 --add_viruses
as the following command:
metaphlan GIL-1.fq.gz \
--nproc 16 \
--read_min_len 30 \
--input_type fastq \
--bowtie2out GIL-1.bowtie2.out \
-o GIL-1.metaphlan4.txt \
--bowtie2db /home/tmk528/data/SharedData/Databases/Metaphlan3 \
--mpa3 \
--add_viruses
- if you perform metagenomic profiling on multiple samples
- Prepare a file including all samples for the downstream metagenomic screening.
cp /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Samples.txt .
- Run the script for running metaphlan in parallel.
cp /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Metaphlan/metaphlan4.array.sh .
sbatch metaphlan4.array.sh
Note: Before you run the scripts, remember to load the conda environment by using the command:
conda activate /projects/mjolnir1/apps/conda/metaphlan-3.0.14/
- Check the output files
Extend to check files within the folder
GIL-1.bowtie2.out
GIL-1.metaphlan4.txt
HOR-1.bowtie2.out
MSW-2765.metaphlan4.txt
MSW-2765.bowtie2.out
MSW-2765.metaphlan4.txt
- Check the bowtie2.out
Click
header of reads | marker gene |
---|---|
M_A00706:484:H25WWDSX3:2:1451:20401:33599_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 | 194702__GCA_900460955.1_00324 |
M_A00706:484:H25WWDSX3:2:2554:4698:1297_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 | 563038__E9FKL8__HMPREF0851_00827 |
M_A00706:484:H25WWDSX3:2:2519:31412:28025_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 | 999425__K8MRR9__HMPREF9186_01101 |
M_A00706:484:H25WWDSX3:2:2619:28140:6605_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 | 544581__G9PQL4__HMPREF1979_01803 |
M_A00706:484:H25WWDSX3:2:2166:3170:31532_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 | 43675__D2NSN9__RM6536_1480 |
M_A00706:484:H25WWDSX3:2:2173:24080:36401_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 | 2047__E3H1D8__HMPREF0733_10607 |
M_A00706:484:H25WWDSX3:2:2224:10366:15060_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 | 487__A0A0Y5FL51__galM |
M_A00706:484:H25WWDSX3:2:2532:18945:21371_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 | 487__A0A378WD25__NM65014_0252 |
M_A00706:357:HKNKTDRXY:2:2101:32633:5118_RG:Z:ILLUMINA-MSW-2765_SCR1_4Y2F7 | 1263109__GCA_000438135.1_00036 |
M_A00706:484:H25WWDSX3:2:2158:4083:8030_RG:Z:ILLUMINA-MSW-2765_SCR1_HXZD2 | 712121__L1PF62__HMPREF9061_01305 |
- Check the metaphlan4.txt
Click
#mpa_v30_CHOCOPhlAn_201901
#/projects/mjolnir1/apps/conda/metaphlan-3.0.14/bin/metaphlan /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/raw/MSW-2765.fq.gz --nproc 16 --read_min_len 30 --input_type fastq --bowtie2out MSW-2765.bowtie2.out -o MSW-2765.metaphlan4.txt --bowtie2db /home/jql545/data/FromJonas/Bowtie2DB --mpa3 --add_viruses
#SampleID Metaphlan_Analysis
#clade_name | NCBI_tax_id | relative_abundance | additional_species |
---|---|---|---|
k__Bacteria | 2 | 81.89136 | |
k__Viruses | 10239 | 18.03597 | |
k__Archaea | 2157 | 0.07268 | |
k__Bacteria|p__Firmicutes | 2|1239 | 30.84009 | |
k__Bacteria|p__Proteobacteria | 2|1224 | 26.65484 | |
k__Viruses|p__Viruses_unclassified | 10239| | 18.03597 |
- Merge all output tables
/projects/mjolnir1/people/hsf378/data/MetaPhlAn-4.beta.1/metaphlan/utils/merge_metaphlan_tables.py *.metaphlan4.txt > merge.metaphlan4.txt
- Check the merged abundance file
head merge.metaphlan4.txt
Click
#mpa_v30_CHOCOPhlAn_201901
clade_name | NCBI_tax_id | MSW-2765.metaphlan4 | HOR-1.metaphlan4 | GIL-1.metaphlan4 |
---|---|---|---|---|
k__Archaea | 2157 | 0.07268 | 0.27318 | 0.13221 |
k__Archaea|p__Candidatus_Bathyarchaeota | 2157|928852 | 0.02647 | 0 | 0.01343 |
k__Archaea|p__Candidatus_Bathyarchaeota|c__Candidatus_Bathyarchaeota_unclassified | 2157|928852| | 0.02647 | 0 | 0.01343 |
k__Archaea|p__Candidatus_Bathyarchaeota|c__Candidatus_Bathyarchaeota_unclassified|o__Candidatus_Bathyarchaeota_unclassified | 2157|928852|| | 0.02647 | 0 | 0.01343 |
k__Archaea|p__Candidatus_Bathyarchaeota|c__Candidatus_Bathyarchaeota_unclassified|o__Candidatus_Bathyarchaeota_unclassified|f__Candidatus_Bathyarchaeota_unclassified | 2157|928852||| | 0.02647 | 0 | 0.01343 |
k__Archaea|p__Candidatus_Bathyarchaeota|c__Candidatus_Bathyarchaeota_unclassified|o__Candidatus_Bathyarchaeota_unclassified|f__Candidatus_Bathyarchaeota_unclassified|g__Candidatus_Bathyarchaeota_unclassified | 2157|928852|||| | 0.02647 | 0 | 0.01343 |
k__Archaea|p__Candidatus_Bathyarchaeota|c__Candidatus_Bathyarchaeota_unclassified|o__Candidatus_Bathyarchaeota_unclassified|f__Candidatus_Bathyarchaeota_unclassified|g__Candidatus_Bathyarchaeota_unclassified|s__Candidatus_Bathyarchaeota_archaeon_CG_4_8_14_3_um_filter_42_8 | 2157|928852|||||1974384 | 0.02647 | 0 | 0 |
k__Archaea|p__Candidatus_Bathyarchaeota|c__Candidatus_Bathyarchaeota_unclassified|o__Candidatus_Bathyarchaeota_unclassified|f__Candidatus_Bathyarchaeota_unclassified|g__Candidatus_Bathyarchaeota_unclassified|s__Candidatus_Bathyarchaeota_archaeon_RBG_13_46_16b | 2157|928852|||||1797378 | 0 | 0 | 0.01343 |
- Modify the output for the downstream visualization.
grep -E "(^ID)|(s__)" merge.metaphlan4.txt | grep -v "t__" | sed "s/.*s__//g" > modify.merge.metaphlan4.txt
sed -n "2,2p" merge.metaphlan4.txt | cat - modify.merge.metaphlan4.txt | cut -f 1,3-5 | sed 's/.metaphlan4//g' > final.merge.metaphlan4.txt
- Check the modified abundance file
head final.merge.metaphlan4.txt
Click
clade_name | MSW-2765 | HOR-1 | GIL-1 |
---|---|---|---|
Candidatus_Bathyarchaeota_archaeon_CG_4_8_14_3_um_filter_42_8 | 0.02647 | 0 | 0 |
Candidatus_Bathyarchaeota_archaeon_RBG_13_46_16b | 0 | 0 | 0.01343 |
Candidatus_Lokiarchaeota_archaeon_CR_4 | 0.0462 | 0.27318 | 0.11878 |
Pyrinomonas_methylaliphatogenes | 0.08442 | 0 | 0 |
Actinobaculum_sp_oral_taxon_183 | 0.57967 | 0.10377 | 0 |
Actinomyces_georgiae | 0.00688 | 0 | 0 |
Actinomyces_graevenitzii | 0 | 0.25428 | 0 |
Actinomyces_johnsonii | 0.11789 | 0.05306 | 0 |
Actinomyces_meyeri | 0.09658 | 0 | 0 |
- Visualize the abundance profiles
Here, hclust2 is applied to generate the heatmap plot for the merged metaphlan abundance profiles.
conda deactivate
conda activate /home/hsf378/.conda/envs/hclust2
hclust2.py \
-i final.merge.metaphlan4.txt \
-o metaphlan4.sqrt_scale.png \
--ftop 50 \
--f_dist_f braycurtis \
--s_dist_f braycurtis \
--cell_aspect_ratio 0.5 \
-l --flabel_size 2 \
--max_flabel_len 100 \
--max_slabel_len 100 \
--slabel_size 2 \
--dpi 1000 \
- Check the heatmap plot
scp /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Metaphlan/final.metaphlan4_add_viruses.png .
Note: Here is the final results for the MetaPhlan
ln -s /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Metaphlan/
- Align reads to the GTDB database including representative reference genomes of bacteria, archaea, viruses and eukaryotes.
bowtie2 -x gtdb-r202-organelles-viruses \
-p 8 \
-q -U GIL-1.fq.gz \
--maxins 500 --rg-id ILLUMINA-$SAMPLE --rg SM:$SAMPLE --rg PL:illumina --rg PU:ILLUMINA-$SAMPLE \
-k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05" --very-sensitive --no-unal \
| samtools view -@ 24 -b -F 4 \
| samtools sort -@ 2 -O bam - > GIL-1.vanilla.sorted.bam
Here the parameter
-k 16
: maximum 16 alignments per reads.--maxins 500
: the maximum of reads length is 500bp.--rg-id ILLUMINA-$SAMPLE --rg SM:$SAMPLE --rg PL:illumina --rg PU:ILLUMINA-$SAMPLE
: set the flag for header.--np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05"
: specify how bowtie2 scores the alignments.
- Check the alignment results:
10073353 reads; of these:
10073353 (100.00%) were unpaired; of these:
8532659 (84.71%) aligned 0 times
741841 (7.36%) aligned exactly 1 time
798853 (7.93%) aligned >1 times
15.29% overall alignment rate
After the mapping steps, we will remove the duplicates reads with Picard MarkDuplicates.
module load picard/2.15.0
picard MarkDuplicates I=GIL-1.vanilla.sorted.bam O=GIL-1.vanilla.rmdup.bam M=GIL-1.rmdup.txt AS=True REMOVE_DUPLICATES=True
Note: If we have multiple samples, we could run the alignment in parallel by using the following scripts
Click
#!/bin/bash
#SBATCH -c 24
#SBATCH --mem-per-cpu 10000
#SBATCH --time 7-00
#SBATCH --array=1-3%3 ##replace 3 with the number of samples you run
module load openjdk/13.0.1 bowtie2/2.4.2 picard/2.15.0
FASTQ_DIRECTORY=/PATH TO SEQUENCING RESULTS/
SAMPLE_LIST=/PATH TO SAMPLES/Samples.txt
DB=/home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses #/PATH TO THE DATABASE/prefix of the db index
SAMPLE=$(sed -n "$SLURM_ARRAY_TASK_ID"p $SAMPLE_LIST)
bowtie2 -x $DB -U $FASTQ_DIRECTORY/$SAMPLE.fq.gz -p 24 \
--maxins 500 --rg-id ILLUMINA-$SAMPLE --rg SM:$SAMPLE --rg PL:illumina --rg PU:ILLUMINA-$SAMPLE \
-k 16 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.05" \
--very-sensitive --no-unal | samtools view -@ 24 -b -F 4 - \
| samtools sort -@ 2 -O bam - > $SAMPLE.vanilla.sorted.bam
picard MarkDuplicates I=$SAMPLE.vanilla.sorted.bam O=$SAMPLE.vanilla.rmdup.bam M=$SAMPLE.rmdup.txt AS=True REMOVE_DUPLICATES=True
- Filter out the noise from multiple aligned reads, and those reference genomes with uneven coverage by using FilterBAM.
#!/bin/bash
#SBATCH -c 24
#SBATCH --mem-per-cpu 3000
#SBATCH --time 7-00
#SBATCH --array=1-3%3
SAMPLE_LIST=/projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/Samples.txt
#DB=/home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses
DB=/projects/mjolnir1/data/databases/GTDB-Hires/2022-05-15/gtdb-r202-organelles-viruses-hires
SAMPLE=$(sed -n "$SLURM_ARRAY_TASK_ID"p $SAMPLE_LIST)
path=/projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/filterBAM/hires/bowtie2
filterBAM -t 24 -p $SAMPLE.hires -r $DB.len.map $path/$SAMPLE.hires.rmdup.bam
For example on the GIL-1 sample:
INFO ::: 2022-12-07 05:37:49 ::: Loading BAM file
INFO ::: 2022-12-07 05:37:51 ::: BAM index not found. Indexing...
INFO ::: 2022-12-07 05:37:55 ::: Reloading BAM file
INFO ::: 2022-12-07 05:37:56 ::: Found 140,239 reference sequences
INFO ::: 2022-12-07 05:37:56 ::: Removing references without mappings or less than 10 reads...
INFO ::: 2022-12-07 05:37:56 ::: Keeping 22,010 references
INFO ::: 2022-12-07 05:37:56 ::: Processing 96 chunks of 230 references each
INFO ::: 2022-12-07 05:47:05 ::: Writing reference statistics to GIL-1.vanilla_stats.tsv.gz
INFO ::: 2022-12-07 05:47:08 ::: Filtering stats...
INFO ::: 2022-12-07 05:47:08 ::: ::: min_read_count >= 10 & min_read_length >= 30 & min_avg_read_ani >= 90.0 & min_expected_breadth_ratio >= 0.5 & min_breadth >= 0 & min_coverage_evenness >= 0 & min_coverage_mean >= 0
INFO ::: 2022-12-07 05:47:08 ::: Saving filtered stats...
INFO ::: 2022-12-07 05:47:10 ::: Writing filtered BAM file... (be patient)
INFO ::: 2022-12-07 05:47:11 ::: ::: Filtering 20,858 references sequentially...
INFO ::: 2022-12-07 05:47:58 ::: BAM index not found. Indexing...
INFO ::: 2022-12-07 05:48:01 ::: ALL DONE.
- Check how much noise did we filter out:
### The number of aligned reads
# 7,703,509
samtools view -c ../bowtie2/GIL-1.hires.rmdup.bam
# 6,888,515
samtools view -c GIL-1.hires.filtered.bam
### The number of reference genomes
# 57,576
samtools view ../bowtie2/GIL-1.hires.rmdup.bam | cut -f3 | sort -u | wc -l
# 20,858
samtools view GIL-1.hires.filtered.bam |cut -f3 | sort -u | wc -l
# 22,010
zcat GIL-1.hires_stats.tsv.gz | sed "1d" | wc -l
# 20,858
zcat GIL-1.hires_stats-filtered.tsv.gz | sed "1d" | wc -l
- Check the summary statistic result of filterBAM
# overview
zcat GIL-1.hires_stats-filtered.tsv.gz | less -S
# extract the columns we may be interested in
zcat GIL-1.hires_stats-filtered.tsv.gz \
| csvtk cut -t -T -f "reference,n_reads,read_ani_mean,read_ani_std,coverage_mean,breadth,breadth_exp_ratio,cov_evenness,tax_abund_read" \
| csvtk sort -t -T -k "n_reads:Nr" | less -S
- Assess the damage pattern
Here, we apply the metaDMG to estimate the damage over taxa within samples (LCA mode).
- Load the metaDMG working environment
conda activate /home/hsf378/.conda/envs/metaDMG
- Generate the config file
metaDMG config --config-file meta_workshop.hires.lca.yaml \
--custom-database \
--names /home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses/names.dmp \
--nodes /home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses/nodes.dmp \
--acc2tax /home/tmk528/data/db/Vanilla/gtdb-r202-organelles-viruses/acc2taxid.map.gz \
--metaDMG-cpp /home/tmk528/data/bin/metaDMG-cpp/metaDMG-cpp \
--parallel-samples 1 \
--cores-per-sample 8 \
--output-dir ~/hsf378/data/Metagenomic_workshop/filterBAM/hires/metaDMG \ ### change the path
--max-position 15 \
--lca-rank '' ' ' \
--min-similarity-score 0.95 \
--max-similarity-score 1 \
--damage-mode lca \
--weight-type 1 \
--forward-only GIL-1.hires.filtered.bam HOR-1.hires.filtered.bam MSW-2765.hires.filtered.bam
- Compute the results
sbatch metaDMG.hires.compute.sh
The details on metaDMG.hires.compute.sh
#!/bin/bash
#SBATCH -c 24
#SBATCH --mem-per-cpu 8000
#SBATCH --time=07:00:00
conda activate /projects/mjolnir1/people/hsf378/.conda/envs/metaDMG
metaDMG compute ~/hsf378/data/Metagenomic_workshop/filterBAM/hires/metaDMG/meta_workshop.hires.lca.yaml
- Check the results
- open a new terminal and run the following code:
ssh -L 8050:localhost:8050 [email protected]
conda activate /home/hsf378/.conda/envs/metaDMG
metaDMG dashboard --results /PATH TO METADMG RESULTS FOLDER/ --port 8050 --server
- open your browser
http://localhost:8050/
Note: Here is the final results for the filterBAM and
ln -s /projects/mjolnir1/people/hsf378/data/Metagenomic_workshop/filterBAM/
TBC