trinotateR
is an R
package to summarize annotation reports from Trinotate. Use devtools to install the package.
library(devtools)
install_github("cstubben/trinotateR")
read_trinotate
loads a tab-delimited Trinotate annotation report into a data.table for faster data manipulation.
library(trinotateR)
x <- read_trinotate("Trinotate_report.xls")
# Read 75228 rows
summary_trinotate
returns the number of unique and total annotations in the table.
summary_trinotate(x)
unique total
gene_id 56144 75228
transcript_id 65130 75228
prot_id 59260 59260
prot_coords 26889 59260
TrEMBL_Top_BLASTX_hit 38788 47358
TrEMBL_Top_BLASTP_hit 35836 43048
Pfam 20897 25504
sprot_Top_BLASTX_hit 18586 23806
gene_ontology_blast 5996 23569
sprot_Top_BLASTP_hit 18267 22347
gene_ontology_pfam 1428 17254
eggnog 1456 15225
TmHMM 6236 7907
SignalP 100 5947
RNAMMER 20 129
transcript 0 0
peptide 0 0
Most of the annotations contain mutliple hits in a backtick-delimited list and each hit contains multiple fields in a caret-delimited list. For example, the second Pfam annotation below contains two hits and each hit contains a pfam id, symbol, name, alignment and e-value. The split_pfam
functions splits multiple hits and fields, so the second Pfam annotation is now printed in rows 2 and 3 below.
na.omit(x$Pfam)[1:2]
[1] "PF02586.9^DUF159^Uncharacterised ACR, COG2135^37-105^E:9.1e-20"
[2] "PF01386.14^Ribosomal_L25p^Ribosomal L25p family^50-139^E:3.8e-07`PF14693.1^Ribosomal_TL5_C^Ribosomal protein TL5, C-terminal domain^154-209^E:3.5e-09"
x1 <- split_pfam(x)
# 46040 Pfam annotations
head(x1,3)
gene transcript protein pfam symbol name align evalue
1: GG10000|c0_g1 GG10000|c0_g1_i1 m.81222 PF02586 DUF159 Uncharacterised ACR, COG2135 37-105 9.1e-20
2: GG10001|c2_g1 GG10001|c2_g1_i1 m.81232 PF01386 Ribosomal_L25p Ribosomal L25p family 50-139 3.8e-07
3: GG10001|c2_g1 GG10001|c2_g1_i1 m.81232 PF14693 Ribosomal_TL5_C Ribosomal protein TL5, C-terminal domain 154-209 3.5e-09
Finally, the summary_pfam
lists the 3278 unique Pfam identifiers and the total number of genes, transcripts and proteins with a Pfam annotation.
x2 <- summary_pfam(x1)
# 3278 rows
head(x2)
pfam symbol name genes transcripts proteins total
1: PF00069 Pkinase Protein kinase domain 655 953 999 1030
2: PF07714 Pkinase_Tyr Protein tyrosine kinase 619 909 952 989
3: PF00400 WD40 WD domain, G-beta repeat 344 431 445 953
4: PF13504 LRR_7 Leucine rich repeat 333 383 393 2263
5: PF00023 Ank Ankyrin repeat 299 367 404 1029
6: PF12796 Ank_2 Ankyrin repeats (3 copies) 255 321 363 739
The summary table also includes a count
attribute with the number of unique genes, transcripts and proteins with a Pfam annotation, as well as the total number of annotations. In this example, there are 19418 unique genes with a Pfam annotation (35% of all genes) and 29952 total annotations to genes (since genes may have more than one Pfam annotation).
attr(x2, "count")
unique annotations
Pfam 3278 46040
genes 19418 29952
transcripts 23700 37093
proteins 25504 38144
Load the DataTables library to browse and search within the table using RStudio. The screenshot below shows the Pfam DataTable while searching for "photos" - the full table with all 3278 rows is available here.
library(DT)
z <- data.frame(x2)
z$pfam <- paste0('<a href="http://pfam.xfam.org/family/', z$pfam, '">', z$pfam, '</a>')
datatable(z , escape=1, options = list( pageLength = 25 ) )
The trinotateR
packages includes similar functions to split BLAST, GO, and eggnog columns. Additional functions to plot eggnogs and map the UniRef90 hits in the TrEMBL_Top_BLASTP_hit column to cluster names, KEGG pathways and any other UniProt database cross-reference will be described on the wiki pages.