-
Notifications
You must be signed in to change notification settings - Fork 7
2. Read based analysis
One of the main tasks in metagenomics is to find which potential functions are present in our samples. There are different approaches to achieve this goal, for example the most straightforward one we can do, is to perform a translated similarity search (blastx) against a protein database using the reads. However, this approach has some caveats: for example, our results will be biased to the content of the database; doing this search using all reads against a large database like the NCBI protein non-redundant (nr) is computationally really expensive; or we are losing the higher sensitivity and specificity power of using profile based methods for searching protein domain databases like PFAM. A good approach to avoid all this problems is to do an ab-initio gene prediction with tools designed specifically to deal with fragmented sequences like the ones we will find in short reads or assembled sequences. A good review about different gene prediction tools applied to short-reads and the particularities of this prediction can be found here
For our reads we will use FragGeneScan. FragGeneScan is used by MG-RAST and the EBI MG-Portal. Analysing OSD102 will take roughly 32 minutes using only one core in our BioLinux virtual machine. You already will find the pre-computed results of this sample in your folder. Nevertheless, we will learn using FragGeneScan with a smaller file which we will create.
First we need to transform the merged sequences from fastq to fasta to be able to be process the sequences using FragGeneScan. This is an easy task with one of the scripts present in BBtools:
cd ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/merged
reformat.sh in=OSD102.qc.fastq out=OSD102.qc.fasta
Next, we will extract 1000 reads and we will do the gene prediction:
cd ~/Desktop/microbeco_course/analysis/1-read-level-analysis/0-gene-prediction
mkdir FGS-test
cd FGS-test
reformat.sh in=~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/merged/OSD102.qc.fasta out=OSD102.qc.1000.fasta reads=1000
Now we can run FragGeneScan on these 1000 merged reads in fasta format:
run_FragGeneScan.pl -genome OSD102.qc.1000.fasta -out OSD102.1000 -complete 0 -train illumina_5
In case we would like to run the gene prediction on the whole dataset we should do:
run_FragGeneScan.pl -genome ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/merged/OSD102.qc.fasta -out OSD102 -complete 0 -train illumina_5
We can easily get some statistics on the predicted sequences using one tiny utility from Sean Eddy in SQUID
cd ~/Desktop/microbeco_course/analysis/1-read-level-analysis/0-gene-prediction
seqstat OSD102.faa
seqstat OSD102.ffn
We can count sequences easily using grep :
grep -c '>' OSD102.faa
How many genes does FGS predict? What is the mean length of the nucleotide sequences? and of the amino acid sequences?
In shotgun metagenomic sequences we can also identify rDNA fragments and perform a taxonomic classification in a similar fashion we do with amplicon data. There are different approaches to identify rDNAs in metagenomic samples, we can use profile methods approaches like rRNASelector, filtering algorithms like SortMeRNA,sequence similarity searches using blastn or more complex using more complex sequence aligners like PyNAST or SINA. All this methods rely on some of the specialised rRNA databases like RDP, GreenGenes or SILVA. We will use SortMeRNA, it has a trade-off between accuracy and speed using SILVA as reference database. Afterwards, we will perform the taxonomic annotation using SILVAngs.
To use SortMeRNA first we need to index the databases:
cd ~/Desktop/microbeco_course/analysis/1-read-level-analysis/1-rRNA-identification
indexdb_rna --ref /usr/local/bioinf/sortmerna/rRNA_databases/silva-bac-16s-id90.fasta,/usr/local/bioinf/sortmerna/index/silva-bac-16s-db:\
/usr/local/bioinf/sortmerna/rRNA_databases/silva-arc-16s-id95.fasta,/usr/local/bioinf/sortmerna/index/silva-arc-16s-db:\
/usr/local/bioinf/sortmerna/rRNA_databases/silva-euk-18s-id95.fasta,/usr/local/bioinf/sortmerna/index/silva-euk-18s-db
And then we will be ready to identify the reads which are potentially rDNAs:
cd ~/Desktop/microbeco_course/analysis/1-read-level-analysis/1-rRNA-identification
sortmerna --reads ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/merged/OSD102.qc.fasta \
--ref /usr/local/bioinf/sortmerna/rRNA_databases/silva-bac-16s-id90.fasta,/usr/local/bioinf/sortmerna/index/silva-bac-16s-db:\
/usr/local/bioinf/sortmerna/rRNA_databases/silva-arc-16s-id95.fasta,/usr/local/bioinf/sortmerna/index/silva-arc-16s-db:\
/usr/local/bioinf/sortmerna/rRNA_databases/silva-euk-18s-id95.fasta,/usr/local/bioinf/sortmerna/index/silva-euk-18s-db \
--fastx \
--aligned OSD102_rDNA
Before submitting the sequences to SILVAngs we should check if we have duplicated headers. SILVAngs only uses the first field of the fasta header. If we have a look at our sequences, we see that we have duplicates:
cd /home/manager/Desktop/microbeco_course/analysis/1-read-level-analysis/1-rRNA-identification
grep '>' OSD102_rDNA.fasta | cut -f1 -d ' '| sort | uniq -d | head -n 10
We can use some awk magic to remove the space on the fasta header:
awk '{$0 ~ /^>/ ? gsub(" ","_") : $0; print}' OSD102_rDNA.fasta > OSD102_rDNA.silvangs.fasta
or we can follow a much easier approach using reformat.sh from BBtools:
reformat.sh in=OSD102_rDNA.fasta out=OSD102_rDNA.silvangs.fasta underscore=t
Now we will be ready to upload the file OSD102_rDNA.silvangs.fasta to SILVAngs. The sequences have been already analysed and you can get the results from:
wget 'https://owncloud.mpi-bremen.de/index.php/s/eaB3ChiDG6i9C6M/download?path=%2Fcourses%2Fmetagenomics%2Fmicrobeco2015%2FSILVAngs&files=resultarchive-Microbeco_course_2015.zip' -O Microbeco2015-SILVAngs.zip
We will have a look at the Krona plots to see the different taxonomic classifications:
unzip Microbeco2015-SILVAngs.zip
firefox results/ssu/tax_breakdown/krona/microbeco_course_2015---ssu---krona----Total---sim_93---tax_silva---td_20.html &
Now, that we identified the potential genes in our metagenomic samples, we can start to look for the functions they encode. Depending on the questions we have, we can use a more targeted approach. For example focusing on one single gene or taxonomic group, or we can do the annotation of the whole metagenomic sample. Depending on our computational resources, we can do it using our own facilities or we can use any of the public resources like MG-RAST or EBI Metagenomics Portal. Usually these resources use a variety of biological databases, which we can use too - in a local manner - depending on what we want to ask our metagenomic samples. Some of those databases are PFAM, KEGG, eggNOG, COG....
OSD102 has been already analysed at EBI-MG portal. And also in MG-RAST (login details provided during the course)
In the other hand, instead of having the full annotation of our metagenome, we can be interested in one function or for the functions of one specific organism. As an example, we will use protein domains in combination with the tool HMMER to search for potential bacterial ammonia monooxygenase (AmoA) proteins in our metagenome. Also, we will look for the proteins that might belong to Prochlorococcus genus; we extracted all aminoacid from the NCBI non-redundant database corresponding to the taxid 1218. We will use the blastx implemented in DIAMOND, a fast alternative to BLAST, although has a lower sensitivity than BLAST, can be 20,000X faster.
We will start searching the AmoA domain in our predicted aminoacid sequences. First we will donwload the HMM profile from the Pfam website:
cd ~/Desktop/microbeco_course/analysis/1-read-level-analysis/2-sequence-annotation
wget http://pfam.xfam.org/family/PF05145/hmm -O AmoA.hmm
then we will use hmmsearch from HMMER to analyse our predicted aminoacid sequences.
hmmsearch --cut_ga --domtblout OSD102.txt AmoA.hmm ~/Desktop/microbeco_course/analysis/1-read-level-analysis/0-gene-prediction/OSD102.faa
The parameters used for hmmer are --cut_ga, that means that HMMER will the gathering threshold defined by the PFAM curators to decide if the match is significant or not; with --domtblout we specify the output format for HMMER, in this case we want to save into a file a parseable table of per-domain hits.
Here you can find a nice explanation about the differences of using hmmscan or hmmsearch.
How many potential AmoA sequences are in our metagenomic sample? We easily can get the number of lines in the file using a combination of grep and wc.
Next, we will use DIAMOND to query our Prochlorococcus database using blastp and the aminoacid sequences as query sequences. We would be able to use the quality trimmed reads and blastx to achieve similar results. Before using DIAMOND we need to create a database using the fasta file located in ~/Desktop/microbeco_course/data/DB
:
cd ~/Desktop/microbeco_course/analysis/1-read-level-analysis/2-sequence-annotation
diamond makedb --in ~/Desktop/microbeco_course/data/DB/prochlo.fasta -d prochlo
The DIAMOND database, prochlo.dmnd
will be created in the folder. Now we are ready to search the database using our aminoacid sequences as query:
diamond blastp -e 1e-5 -d prochlo.dmnd -q ~/Desktop/microbeco_course/analysis/1-read-level-analysis/0-gene-prediction/OSD102.faa -a OSD102.prochlo
We set -e 1e-5 to report the hits with an e-value lower than 1e-25 . The output from the search (prochlo.daa
) is in DIAMOND alignment archive format. We can transform the file in a more human accessible format like the BLAST tabular:
diamond view -a OSD102.prochlo.daa -o OSD102.prochlo.txt
The columns for the new file are:
1. query
2. subject
3. % id
4. alignment length
5. mistmatches
6. gap openings
7. query start
8. query end
9. subject start
10. subject end
11. e-value
12. bit score
From the results, we can get the best hit for each of our sequences using a bit of awk:
awk '!a[$1]++' OSD102.prochlo.txt > OSD102.prochlo.bh.txt
Or we can be more strict and filter the best hits to have a lower e-value, let's say 1e-25:
awk '$11 <= 1e-25' OSD102.prochlo.bh.txt > OSD102.prochlo.bh.1e-25.txt
How many hits to Prochlorococcus proteins do we have? How many best hits? And how many they have a lower e-value than 1e-25?