-
Notifications
You must be signed in to change notification settings - Fork 1
Day 4
In this tutorial, we will replicate some current analyses using the new tools we have developed. By the end of this tutorial, you will be able to:
- Map ancient metagenomic reads to a database of reference genomes to understand the sample's composition.
- Identify specific genomes in the ancient metagenome.
- Authenticate genomes based on damage patterns.
First, let's create the necessary folder structure in our working directory (wdir
):
cd ~/course/wdir
mkdir -p day4
cd day4
ln -s ~/course/data/day4/ data
Now, activate the environment for today's session:
conda activate env4
Before moving forward, we need to install a missing package:
pip install tabview
If everything went as expected, we will have our system ready for today.
Our approach for analysing ancient microbial data is a bit different from what is the norm in the field. After QC'ing our sequences we follow the following steps:
extension -> dereplication -> mapping -> reassign -> filtering -> damage estimation
One of the first steps we do is to extend the reads by doing very gentle assemblies, but before proceeding let's get some statistics for our fastQ files:
seqkit stats -j 5 -T data/fq/PRI-TJPGK-CATN-160-162.fq.gz | csvtk -t pretty
ℹ️ seqkit, csvtk and taxonkit are very useful tools. Check them at https://bioinf.shenwei.me/
And extend the reads using bbmap. We will need to do two different steps, first calculate how much memory we need so the extension is deterministic and then do the extension using the estimates:
MEM=$(loglog.sh seed=1234 k=17 in=data/fq/PRI-TJPGK-CATN-160-162.fq.gz ignorebadquality 2> >(grep Cardinality) \
| awk -vP=0.5 -vB=16 -vH=3 '{{print int( (((B*H)/8)*$2)/P )}}')
tadpole.sh -Xmx10G \
k=17 \
in=data/fq/PRI-TJPGK-CATN-160-162.fq.gz \
out=PRI-TJPGK-CATN-160-162.extended.fq.gz \
mode=extend \
ibb=f \
prefilter=0 \
el=100 er=100 \
threads=5 \
overwrite=true \
trimends=9 \
ecc=f ecco=f \
filtermem="${MEM}" \
conservative=t \
ignorebadquality
Get the stats of PRI-TJPGK-CATN-160-162.extended.fq.gz
Once the reads have been extended, we will dereplicate them:
seqkit rmdup -j 5 -s -o PRI-TJPGK-CATN-160-162.extended.derep.fq.gz PRI-TJPGK-CATN-160-162.extended.fq.gz
Get the stats of PRI-TJPGK-CATN-160-162.extended.derep.fq.gz
Before mapping, we will get the original reads using the extended-dereplicated as a guide. Although we could use this set of reads for mapping, at the moment we prefer not to (any ideas why?):
filterbyname.sh in=data/fq/PRI-TJPGK-CATN-160-162.fq.gz out=PRI-TJPGK-CATN-160-162.mapping.fastq.gz names=PRI-TJPGK-CATN-160-162.extended.derep.fq.gz threads=5 overwrite=t include=t
So finally we have the reads we can use for mapping. Although we use Bowtie2 we follow a slightly different strategy for mapping. First let's activate the conda environment that we need:
conda activate mapping
And let's do the mapping:
bowtie2 -p 5 -k 100 \
-N 1 -D 5 -R 1 -L 22 -i S,0,2.50 \
--np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.1" \
-x data/db/aegenomics.db \
-q PRI-TJPGK-CATN-160-162.mapping.fastq.gz --no-unal \
| samtools view -F 4 -b \
| samtools sort -@ 32 -m 8G -o PRI-TJPGK-CATN-160-162.sorted.bam
We can have a look at the specific parameters at the bowtie2 manual: https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
Expand for an explanation of the parameters
There's a lot going on in this last command. First, with -k we ask for a maximum of 100 alignments per read. This is because many of those reads are going to be mapping equally well to multiple references.Seed: We will do a fast search by using -N 1 -D 5 -R 1 -L 22 -i S,0,2.50. -i S,0,2.5 sets the interval function f to f(x) = 1 + 2.5 * sqrt(x), where x is the read length, which set how many seed substrings will be generated. Those are the preset parameters for a fast search in Bowtie2, but with the difference that we are using -N 1 to allow a mismatch in the seed to increase sensitivity.
The parameters --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" determine the reward/penalization values. We can adjust the % identity in --score-min "L,0,-0.1", where the -0.1 == 90, 0.05 == 95%
This is just an example for the tutorial, in real life you might want to use:
recommended, good compromise between sensitivity, specificity and speed: -D 15 -R 2 -N 1 -L 22 -i S,1,1.15 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.1"
faster: -D 10 -R 2 -N 1 -L 22 -i S,0,2.50 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.1"
more sensitive, but ~10X slower: -D 15 -R 2 -N 1 -L 20 -i S,1,1.15 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.1"
super sensitive, but ~20X slower (Reduce -L if you want to be more sensitive): -D 15 -R 3 -N 1 -L 20 -i S,1,0.5 --np 1 --mp "1,1" --rdg "0,1" --rfg "0,1" --score-min "L,0,-0.1"
Now we have the reads mapped and ready for the next step!
Now is time to process the BAM file, filter out references with uneven coverages and estimate multiple metrics -including taxonomic abundances- and estimate the LCA.
First we will remove duplicates from the BAM files with sambamba, which similar heuristics than Picard, but is way faster:
sambamba markdup -r -t 5 -p PRI-TJPGK-CATN-160-162.sorted.bam PRI-TJPGK-CATN-160-162.sorted.rmdup.bam
At this point, duplicates are defined by the alignment position of the reads.
Then we reassign reads to the reference they (most likely) belong using an E-M algorithm that takes into account the alignment score.
filterBAM reassign \
-r data/misc/aegenomics.db.len.map \
--bam PRI-TJPGK-CATN-160-162.sorted.rmdup.bam \
-t 5 \
-i 0 \
-m "8G" \
-o PRI-TJPGK-CATN-160-162.reassigned.bam \
-n 3
Expand for an explanation of the printed output
INFO ::: 2024-09-25 20:29:30 ::: Using filterBAM version: 1.6.1 INFO ::: 2024-09-25 20:29:30 ::: Temporary directory: /home/ta-4/course/wdir/day4/tmpeh3_2i6w INFO ::: 2024-09-25 20:29:30 ::: Checking BAM file status INFO ::: 2024-09-25 20:29:30 ::: ::: Found 2,689 reference sequences INFO ::: 2024-09-25 20:29:30 ::: ::: BAM file looks good. INFO ::: 2024-09-25 20:29:30 ::: Resolving multi-mapping reads... INFO ::: 2024-09-25 20:29:30 ::: ::: Loading BAM file INFO ::: 2024-09-25 20:29:30 ::: ::: IO Threads: 1 | Processing Threads: 5 INFO ::: 2024-09-25 20:29:30 ::: ::: Found 2,689 reference sequences INFO ::: 2024-09-25 20:29:30 ::: ::: Removing references with less than 3 reads... INFO ::: 2024-09-25 20:29:30 ::: ::: Kept 11,265,278 alignments
INFO ::: 2024-09-25 20:29:30 ::: ::: Keeping 1,508 references INFO ::: 2024-09-25 20:29:30 ::: ::: Creating reference chunks with uniform read amounts... INFO ::: 2024-09-25 20:29:30 ::: ::: ::: Created 5 chunks
INFO ::: 2024-09-25 20:29:30 ::: ::: Extracting reads from BAM file... INFO ::: 2024-09-25 20:29:50 ::: ::: Collecting results...
INFO ::: 2024-09-25 20:29:57 ::: ::: ::: Removed 0 references without alignments INFO ::: 2024-09-25 20:29:57 ::: ::: Indexing references... INFO ::: 2024-09-25 20:29:57 ::: ::: Indexing reads... INFO ::: 2024-09-25 20:30:06 ::: ::: Allocating data... INFO ::: 2024-09-25 20:30:14 ::: ::: References: 1,508 | Reads: 2,237,078 | Alignments: 10,930,074(...)
Iter: n - R = removed alignments on the iteration; U = remaining unique mapping reads; NU = new unique reads; L = remaining alignments
(...)
INFO ::: 2024-09-25 20:31:00 ::: ::: ::: No more alignments to remove. Stopping. INFO ::: 2024-09-25 20:31:01 ::: ::: References: 498 | Reads: 2,237,078 | Alignments: 2,237,092 INFO ::: 2024-09-25 20:31:01 ::: ::: Unique mapping reads: 2,237,064 | Multimapping reads: 14 INFO ::: 2024-09-25 20:31:01 ::: ::: Mapping back indices... INFO ::: 2024-09-25 20:31:02 ::: ::: Calculating reads per subject... INFO ::: 2024-09-25 20:31:03 ::: ::: Removing references with less than 3... INFO ::: 2024-09-25 20:31:03 ::: ::: Total refs/reads combination: 2,237,021 INFO ::: 2024-09-25 20:31:03 ::: ::: Total references: 437
(...)
Removing references with less than 3... Total refs/reads combination: 3,052,050 Total references: 1,052
Then we run the filtering step, which estimates several metrics for each reference in our BAM file and filters out those that do not meet the defined criteria:
filterBAM filter \
--bam PRI-TJPGK-CATN-160-162.reassigned.bam \
-N \
-r data/misc/aegenomics.db.len.map \
-A 92 \
-a 94 \
-n 100 \
-b 0.75 \
-B 0.01 \
-e 0.6 \
-t 5 \
--sort-memory 8G \
--include-low-detection \
--stats PRI-TJPGK-CATN-160-162.stats.tsv.gz \
--stats-filtered PRI-TJPGK-CATN-160-162.stats-filtered.tsv.gz \
--bam-filtered PRI-TJPGK-CATN-160-162.filtered.bam
Expand for an explanation of the parameters
There's a lot going on in this last command: We use -N to sort the filtered BAM files by name, so it can be used by metaDMG. The `-r` specifies where we can find the non-concatenated length of the references for the abundance estimation. With `-A` we will only keep those reads with at least 92% ANI and `-a` will keep those references with at least 94% of average ANI. The `-n` will keep only those references with at least 100 reads mapping.And... what are the -b and -B accounting for?
Lastly, we perform an LCA analysis using the reads that passed the filtering criteria and estimate the taxonomic abundances:
filterBAM lca \
--bam PRI-TJPGK-CATN-160-162.filtered.bam \
-r data/misc/aegenomics.db.len.map \
-m "8G" \
-t 5 \
--names data/taxonomy/names.dmp \
--nodes data/taxonomy/nodes.dmp \
--acc2taxid data/taxonomy/acc2taxid.map.gz \
--lca-summary PRI-TJPGK-CATN-160-162.lca-summary.tsv.gz \
--stats PRI-TJPGK-CATN-160-162.stats-filtered.tsv.gz
Let's deactivate the mapping
environment:
conda deactivate
Let's have a look at the results of the filtering:
zcat PRI-TJPGK-CATN-160-162.stats-filtered.tsv.gz \
| csvtk cut -t -T -f "reference,n_reads,read_ani_mean,read_ani_std,coverage_mean,breadth,exp_breadth,breadth_exp_ratio,norm_entropy,norm_gini,cov_evenness,tax_abund_tad" \
| csvtk grep -r -t -v -f reference -p _plas -p _mito \
| csvtk sort -t -T -k "n_reads:Nr" \
| tabview -
We will explore four different examples to understand the effect of each filter. We will get the references GCA_002781685.1
, IMGVR_UViG_3300027782_000260
and GCA_014380485.1
.
printf "GCA_002781685.1\nIMGVR_UViG_3300027782_000260\nGCA_014380485.1" > ref-list.txt
getRPercId --bam PRI-TJPGK-CATN-160-162.sorted.rmdup.bam --reference-list ref-list.txt --threads 5 --sort-memory 8G
Activate the environment with the necessary tools:
conda activate env2
And let's explore the coverage patterns:
bamcov -w 0 -m GCA_002781685.1.bam
bamcov -w 0 -m GCA_014380485.1.bam
bamcov -w 0 -m IMGVR_UViG_3300027782_000260.bam
Now that we have our set of refined references, is time to authenticate them, this means to look for damage patterns. We recently developed a new method that can estimate damage over thousands of taxa (LCA mode), references (local mode) or provide a global estimate (global mode).
Let's first activate the needed environment:
conda activate metaDMG
We will use the LCA mode which allows us retrieving damage estimates for the different taxonomic levels across the tree structure. For this, we first run the lca command. This will create multiple objects as an lca file, the mismatch matrices which we be use as an input of the dfit function, and multiple stats.
metaDMG-cpp lca \
--bam PRI-TJPGK-CATN-160-162.filtered.bam \
--out_prefix PRI-TJPGK-CATN-160-162 \
--threads 5 \
--names data/taxonomy/names.dmp \
--nodes data/taxonomy/nodes.dmp \
--acc2tax data/taxonomy/acc2taxid.map.gz \
--fix_ncbi 0 \
--how_many 25 \
--sim_score_low 0.92 \
--weight_type 1 \
--out_prefix PRI-TJPGK-CATN-160-162
mv PRI-TJPGK-CATN-160-162.stat.gz PRI-TJPGK-CATN-160-162.lca.stat.gz
Next, we run the dfit function which performs numerical optimization of the deamination frequencies based on the mismatch matrix (PRI-TJPGK-CATN-160-162.bdamage.gz) to estimate four parameters: A,q,c,phi.
metaDMG-cpp dfit PRI-TJPGK-CATN-160-162.bdamage.gz \
--names data/taxonomy/names.dmp \
--nodes data/taxonomy/nodes.dmp \
--threads 5 \
--showfits 2 \
--nopt 5 \
--seed 1234 \
--nbootstrap 3 \
--out_prefix PRI-TJPGK-CATN-160-162
Lastly, we run the aggregate function to obtain some lca statistics through the tree structure:
metaDMG-cpp aggregate \
PRI-TJPGK-CATN-160-162.bdamage.gz \
--lcastat PRI-TJPGK-CATN-160-162.lca.stat.gz \
--names data/taxonomy/names.dmp \
--nodes data/taxonomy/nodes.dmp \
-o PRI-TJPGK-CATN-160-162
conda deactivate metaDMG
Let's load the R environment and lunch a r terminal attached to the VSC workspace:
conda activate r
radian
We will open the R scripts in the folder ~/course/data/day4/src
in VS Code and load the r
extension.