Skip to content

2. Read based analysis

Antonio Fernandez-Guerra edited this page Nov 10, 2015 · 15 revisions

Session 2: 14:00-15:30

Table of contents

  1. Gene Prediction
  2. Identification of rDNA in metagenomic samples
  3. Gene annotation

Gene prediction

One of the main tasks in metagenomics is to find which potential functions are present in our samples. There are different approaches to achieve this goal, for example the most straightforward one we can do, is to perform a translated similarity search (blastx) against a protein database using the reads. However, this approach has some caveats: for example, our results will be biased to the content of the database; doing this search using all reads against a large database like the NCBI protein non-redundant (nr) is computationally really expensive; or we are losing the higher sensitivity and specificity power of using profile based methods for searching protein domain databases like PFAM. A good approach to avoid all this problems is to do an ab-initio gene prediction with tools designed specifically to deal with fragmented sequences like the ones we will find in short reads or assembled sequences. A good review about different gene prediction tools applied to short-reads and the particularities of this prediction can be found here

For our reads we will use FragGeneScan. FragGeneScan is used by MG-RAST and the EBI MG-Portal. Analysing OSD102 will take roughly 32 minutes using only one core in our BioLinux virtual machine. You already will find the pre-computed results of this sample in your folder. Nevertheless, we will learn using FragGeneScan with a smaller file which we will create.

First we need to transform the merged sequences from fastq to fasta to be able to be process the sequences using FragGeneScan. This is an easy task with one of the scripts present in BBtools:

cd ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/merged

reformat.sh in=OSD102.qc.fastq out=OSD102.qc.fasta

Next, we will extract 1000 reads and we will do the gene prediction:

cd ~/Desktop/microbeco_course/analysis/1-read-level-analysis/0-gene-prediction

mkdir FGS-test

cd FGS-test

reformat.sh in=~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/merged/OSD102.qc.fasta out=OSD102.qc.1000.fasta reads=1000

Now we can run FragGeneScan on these 1000 merged reads in fasta format:

run_FragGeneScan.pl -genome OSD102.qc.1000.fasta -out OSD102.1000 -complete 0 -train illumina_5

In case we would like to run the gene prediction on the whole dataset we should do:

run_FragGeneScan.pl -genome ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/merged/OSD102.qc.fasta -out OSD102 -complete 0 -train illumina_5

We can easily get some statistics on the predicted sequences using one tiny utility from Sean Eddy in SQUID

cd ~/Desktop/microbeco_course/analysis/1-read-level-analysis/0-gene-prediction

seqstat OSD102.faa

seqstat OSD102.ffn

We can count sequences easily using grep :

grep -c '>' OSD102.faa

How many genes does FGS predict? What is the mean length of the nucleotide sequences? and of the amino acid sequences?

Identification of rDNA in metagenomic samples

In shotgun metagenomic sequences we can also identify rDNA fragments and perform a taxonomic classification in a similar fashion we do with amplicon data. There are different approaches to identify rDNAs in metagenomic samples, we can use profile methods approaches like rRNASelector, filtering algorithms like SortMeRNA,sequence similarity searches using blastn or more complex using more complex sequence aligners like PyNAST or SINA. All this methods rely on some of the specialised rRNA databases like RDP, GreenGenes or SILVA. We will use SortMeRNA, it has a trade-off between accuracy and speed using SILVA as reference database. Afterwards, we will perform the taxonomic annotation using SILVAngs.

To use SortMeRNA first we need to index the databases:

cd ~/Desktop/microbeco_course/analysis/1-read-level-analysis/1-rRNA-identification

indexdb_rna --ref /usr/local/bioinf/sortmerna/rRNA_databases/silva-bac-16s-id90.fasta,/usr/local/bioinf/sortmerna/index/silva-bac-16s-db:\
/usr/local/bioinf/sortmerna/rRNA_databases/silva-arc-16s-id95.fasta,/usr/local/bioinf/sortmerna/index/silva-arc-16s-db:\
/usr/local/bioinf/sortmerna/rRNA_databases/silva-euk-18s-id95.fasta,/usr/local/bioinf/sortmerna/index/silva-euk-18s-db

And then we will be ready to identify the reads which are potentially rDNAs:

cd ~/Desktop/microbeco_course/analysis/1-read-level-analysis/1-rRNA-identification

sortmerna --reads ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/merged/OSD102.qc.fasta \
--ref /usr/local/bioinf/sortmerna/rRNA_databases/silva-bac-16s-id90.fasta,/usr/local/bioinf/sortmerna/index/silva-bac-16s-db:\
/usr/local/bioinf/sortmerna/rRNA_databases/silva-arc-16s-id95.fasta,/usr/local/bioinf/sortmerna/index/silva-arc-16s-db:\
/usr/local/bioinf/sortmerna/rRNA_databases/silva-euk-18s-id95.fasta,/usr/local/bioinf/sortmerna/index/silva-euk-18s-db \
--fastx \
--aligned OSD102_rDNA

Before submitting the sequences to SILVAngs we should check if we have duplicated headers. SILVAngs only uses the first field of the fasta header. If we have a look at our sequences, we see that we have duplicates:

cd /home/manager/Desktop/microbeco_course/analysis/1-read-level-analysis/1-rRNA-identification

grep '>' OSD102_rDNA.fasta | cut -f1 -d ' '| sort | uniq -d | head -n 10

We can use some awk magic to remove the space on the fasta header:

awk '{$0 ~ /^>/ ? gsub(" ","_") : $0; print}' OSD102_rDNA.fasta > OSD102_rDNA.silvangs.fasta

or we can follow a much easier approach using reformat.sh from BBtools:

reformat.sh in=OSD102_rDNA.fasta out=OSD102_rDNA.silvangs.fasta underscore=t

Now we will be ready to upload the file OSD102_rDNA.silvangs.fasta to SILVAngs. The sequences have been already analysed and you can get the results from:

wget 'https://owncloud.mpi-bremen.de/index.php/s/eaB3ChiDG6i9C6M/download?path=%2Fcourses%2Fmetagenomics%2Fmicrobeco2015%2FSILVAngs&files=resultarchive-Microbeco_course_2015.zip' -O Microbeco2015-SILVAngs.zip

We will have a look at the Krona plots to see the different taxonomic classifications:

unzip Microbeco2015-SILVAngs.zip

firefox results/ssu/tax_breakdown/krona/microbeco_course_2015---ssu---krona----Total---sim_93---tax_silva---td_20.html &
Gene annotation

Now, that we identified the potential genes in our metagenomic samples, we can start to look for the functions they encode. Depending on the questions we have, we can use a more targeted approach. For example focusing on one single gene or taxonomic group, or we can do the annotation of the whole metagenomic sample. Depending on our computational resources, we can do it using our own facilities or we can use any of the public resources like MG-RAST or EBI Metagenomics Portal. Usually these resources use a variety of biological databases, which we can use too - in a local manner - depending on what we want to ask our metagenomic samples. Some of those databases are PFAM, KEGG, eggNOG, COG....

OSD102 has been already analysed at EBI-MG portal. And also in MG-RAST (login details provided during the course)

In the other hand, instead of having the full annotation of our metagenome, we can be interested in one function or for the functions of one specific organism. As an example, we will use protein domains in combination with the tool HMMER to search for potential bacterial ammonia monooxygenase (AmoA) proteins in our metagenome. Also, we will look for the proteins that might belong to Prochlorococcus genus; we extracted all aminoacid from the NCBI non-redundant database corresponding to the taxid 1218. We will use the blastx implemented in DIAMOND, a fast alternative to BLAST, although has a lower sensitivity than BLAST, can be 20,000X faster.

We will start searching the AmoA domain in our predicted aminoacid sequences. First we will donwload the HMM profile from the Pfam website:

cd ~/Desktop/microbeco_course/analysis/1-read-level-analysis/2-sequence-annotation

wget http://pfam.xfam.org/family/PF05145/hmm -O AmoA.hmm

then we will use hmmsearch from HMMER to analyse our predicted aminoacid sequences.

hmmsearch --cut_ga --domtblout OSD102.txt AmoA.hmm ~/Desktop/microbeco_course/analysis/1-read-level-analysis/0-gene-prediction/OSD102.faa

The parameters used for hmmer are --cut_ga, that means that HMMER will the gathering threshold defined by the PFAM curators to decide if the match is significant or not; with --domtblout we specify the output format for HMMER, in this case we want to save into a file a parseable table of per-domain hits.

Here you can find a nice explanation about the differences of using hmmscan or hmmsearch.

How many potential AmoA sequences are in our metagenomic sample? We easily can get the number of lines in the file using a combination of grep and wc.

Next, we will use DIAMOND to query our Prochlorococcus database using blastp and the aminoacid sequences as query sequences. We would be able to use the quality trimmed reads and blastx to achieve similar results. Before using DIAMOND we need to create a database using the fasta file located in ~/Desktop/microbeco_course/data/DB:

cd ~/Desktop/microbeco_course/analysis/1-read-level-analysis/2-sequence-annotation

diamond makedb --in ~/Desktop/microbeco_course/data/DB/prochlo.fasta -d prochlo

The DIAMOND database, prochlo.dmnd will be created in the folder. Now we are ready to search the database using our aminoacid sequences as query:

diamond blastp -e 1e-5 -d prochlo.dmnd -q ~/Desktop/microbeco_course/analysis/1-read-level-analysis/0-gene-prediction/OSD102.faa -a OSD102.prochlo

We set -e 1e-5 to report the hits with an e-value lower than 1e-25 . The output from the search (prochlo.daa) is in DIAMOND alignment archive format. We can transform the file in a more human accessible format like the BLAST tabular:

diamond view -a OSD102.prochlo.daa -o OSD102.prochlo.txt

The columns for the new file are:

 1. query
 2. subject
 3. % id
 4. alignment length
 5. mistmatches
 6. gap openings
 7. query start
 8. query end
 9. subject start
10. subject end
11. e-value
12. bit score

From the results, we can get the best hit for each of our sequences using a bit of awk:

awk '!a[$1]++' OSD102.prochlo.txt > OSD102.prochlo.bh.txt

Or we can be more strict and filter the best hits to have a lower e-value, let's say 1e-25:

awk '$11 <= 1e-25' OSD102.prochlo.bh.txt > OSD102.prochlo.bh.1e-25.txt

How many hits to Prochlorococcus proteins do we have? How many best hits? And how many they have a lower e-value than 1e-25?