Skip to content

2. Read based analysis

Antonio Fernandez-Guerra edited this page Nov 9, 2015 · 15 revisions

Session 2: 14:00-15:30 1,30h

Gene prediction

One of the main tasks in metagenomics is to find which pontential functions are present in our samples. There are different approaches to achive this goal, for example the most starightforward one we can do, is to perform a translated similarity search (blastx) against a protein database using the reads, but this approach has some caveats, for example, our results will be biased to the content of the database; doing this search using all reads against a large database like the NCBI protein non-redundant (nr) is computationally really expensive; or we are losing the higher sensitivity and specificity power of using profile based methods for searching protein domain databases like PFAM. A good approach to avoid all this problems is to do an ab-initio gene prediction with tools designed specifically to deal with fragmented sequences like the ones we will find in short reads or assembled sequences. A good review about different gene prediction tools applied to short-reads and the particularities of this prediction can be found here

For our reads we will use FragGeneScan. FragGeneScan is used by MG-RAST and the EBI MG-Portal. Analysing OSD102 will take 32 min with only one core in our BioLinux. You already will find the pre-computed results of this sample, but we will try FragGeneScan with a smaller file we will create.

First we need to transform the merged sequences from fastq to fasta to be able to be processed by FragGeneScan. This is an easy task for one of the scripts present in BBtools:

cd ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/merged

reformat.sh in=OSD102.qc.fastq out=OSD102.qc.fasta

Next, we will extract 1000 reads and we will do the gene prediction:

cd ~/Desktop/Desktop/microbeco_course/analysis/1-read-level-analysis/0-gene-prediction

mkdir FGS-test

cd FGS-test

reformat.sh in=~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/merged/OSD102.qc.fasta out=OSD102.qc.1000.fasta reads=1000

Now we can run FragGeneScan on this 1000 merged reads in fasta format:

run_FragGeneScan.pl -genome OSD102.qc.1000.fasta -out OSD102.1000 -complete 0 -train illumina_5

In case we would like to do the gene prediction on the whole dataset we should do:

run_FragGeneScan.pl -genome ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/merged/OSD102.qc.fasta -out OSD102 -complete 0 -train illumina_5

We easily can get some statistics on the predicted sequences using one tiny utility from Sean Eddy in SQUID

cd ~/Desktop/Desktop/microbeco_course/analysis/1-read-level-analysis/0-gene-prediction

seqstat OSD102.faa

seqstat OSD102.ffn

We can count sequences easily using grep :

grep -c '>' OSD102.faa

How many genes does FGS predict? What are the mean lengths for nucleotide sequences? and for amino acid?

Identification of rDNA in metagenomic samples

In shotgun metagenomic sequences we also can identify rDNA fragments and perform a taxonomic classification in a similar fashion we do with amplicon data. There are different approaches to identify rDNAs in metagenomic samples, we can use profile methods approaches like rRNASelector, filtering algorithms like SortMeRNA,sequence similarity searches using blastn or more complex using more complex sequence aligners like PyNAST or SINA. All this methods rely on any of the specialised rRNA databases like RDP, GreenGenes or SILVA. In our example we will use SortMeRNA, it has a good combination between accuracy and speed using SILVA as reference database. Afterwards, we will perform the taxonomic annotation using SILVAngs.

To use SortMeRNA first we need to index the databases:

indexdb_rna --ref /usr/local/bioinf/sortmerna/rRNA_databases/silva-bac-16s-id90.fasta,/usr/local/bioinf/sortmerna/index/silva-bac-16s-db:\
	/usr/local/bioinf/sortmerna/rRNA_databases/silva-arc-16s-id95.fasta,/usr/local/bioinf/sortmerna/index/silva-arc-16s-db:\
	/usr/local/bioinf/sortmerna/rRNA_databases/silva-euk-18s-id95.fasta,/usr/local/bioinf/sortmerna/index/silva-euk-18s-db

And then we will be ready to identify the reads that can be potential rDNAs:

sortmerna --reads ~/Desktop/microbeco_course/analysis/0-pre-processing/2-quality-trimming/merged/OSD102.qc.fasta\
    --ref /usr/local/bioinf/sortmerna/rRNA_databases/silva-bac-16s-id90.fasta,/usr/local/bioinf/sortmerna/index/silva-bac-16s-db:\
	     /usr/local/bioinf/sortmerna/rRNA_databases/silva-arc-16s-id95.fasta,/usr/local/bioinf/sortmerna/index/silva-arc-16s-db:\
	     /usr/local/bioinf/sortmerna/rRNA_databases/silva-euk-18s-id95.fasta,/usr/local/bioinf/sortmerna/index/silva-euk-18s-db \
    --fastx \
    --aligned OSD102_rDNA

Before submitting the sequences to SILVAngs we should check if we have duplicated headers. SILVAngs only uses the first field of the fasta header, if we have a look to our sequences we will see that we have duplicated:

cd /home/manager/Desktop/microbeco_course/analysis/1-read-level-analysis/1-rRNA-identification

 grep '>' OSD102_rRNA.fasta | cut -f1 -d ' '| sort | uniq -d | head -n 10

We can use some awk magic to remove the space on the fasta header:

awk '{$0 ~ /^>/ ? gsub(" ","_") : $0; print}' OSD102_rDNA.fasta OSD102_rDNA.silvangs.fasta

or we can follow a much easier approach using reformat.sh from BBtools:

reformat.sh in=OSD102_rDNA.fasta out=OSD102_rDNA.silvangs.fasta underscore=t

Now we will be ready to upload the file OSD102_rDNA.silvangs.fasta to SILVAngs. The sequences have been already analysed and you can get the results from:

wget "https://owncloud.mpi-bremen.de/index.php/s/eaB3ChiDG6i9C6M/download?path=%2Fcourses%2Fmetagenomics%2Fmicrobeco2015%2FSILVAngs&files=resultarchive-Microbeco_course_2015.zip -O Microbeco2015-SILVAngs.zip"

We will have a look at the Krona plots to see the different taxonomic classifications:

unzip Microbeco2015-SILVAngs.zip

firefox results/ssu/tax_breakdown/krona/microbeco_course_2015---ssu---krona----Total---sim_93---tax_silva---td_20.html &
Gene annotation

Once we identified the potential genes in our metagenomic samples we can start to look for the function they encode. Depending the questions we have we can use a more targeted approach, for example focusing on one single gene or taxonomic group, or we can do the annotation of the whole metagenomic sample. Depending of our computational resources we can do it using our own facilities or we can use any of the public resources like MG-RAST or EBI Metagenomics Portal. Usually this resources use a variety of biological databases, which we can use too in a local manner depending on what we want to ask to our metagenomic samples, some of those databases are PFAM, KEGG, eggNOG, COG....

OSD102 has been already analysed at EBI-MG portal. And also in MG-RAST (login details provided during the course)