Skip to content

Latest commit

 

History

History
32 lines (26 loc) · 7.57 KB

Description_of_MeTAfisher.md

File metadata and controls

32 lines (26 loc) · 7.57 KB

Identification of TA systems

Process overview

MeTAfisher was developed with the aim of retrieving and identifying type II TA systems in a metagenomic context. It uses specific genomic features of TA systems of type II in order to discriminate them. The design of the algorithm was inspired by RASTA-Bacteria web based tool approach (SevinandBarloy-Hubler2007) which aims also to identify TA loci in prokaryotic genomes. However, the approach has been adapted to the nature of the data and with the new information collected on TA systems in the last years.

Genomic features used for discriminating TA systems

TA systems of Type II have specific features that can be used to identified them. They consist of at least two genes forming an operon (Yamaguchi, Y., (2011). The spacer rarely extend 150 nucleotides, and the two genes may also overlap until 20 nucleotides (GRAPH of summary model). The majority of toxin and antitoxin proteins are relatively small from 30 to 300 amino acids long (Figure 3 and Figure 4). Finally the third feature that helps to identify TA systems, is the presence of conserved functional domains specific to them. The detection of TA systems is therefore based on these three features: length, relative distance between two neighbor genes located on the same strand and the presence of conserved domains specific to TA systems.

Conserved domain verification

The algorithm starts to search TA systems in the set of predicted genes of the metagenome (Figure 2 ). As TA system genes have often a small size and may overlap, sometimes regular gene prediction may overlook a downstream ORF, that will be integrated in a longer gene. However predicted genes are the most plausible coding sequences and therefore are a solid base to start the analysis. To search for conserved TAdomains in the set of predicted genes,the hmmsearch programp art of the HMMER tool was used (Mistryetal.2013). It uses a method called profile hidden Markov models (profile HMMs) using probabilistic models. Hmmsearch searches a query sequence against a profile database of conserved domains. The tool RASTA-Bacteria did a compilation of all known conserved TA systems until 2007. To build an updated conserved TA domains database, I have added some newly discovered domains (reported in Makarova, Wolf, and Koonin 2009 and Sberro et al. 2013). The database is composed of 81 profile domains originating from Pfam (version 31.0 Finn et al. 2014) and EggNOG (version 4.5.1 Huerta-Cepas et al. 2016) database. This database aims to be the most complete and representative of TA systems. The program identifies with hmmsearch, conserved domains (E-value < 0.5) of the database among the predicted genes of the metagenomes. It allows multiple domain hits per sequence.

Gene length checking

Default case

By default when the optional argument --Resize is not provided, the genes with identified conserved domains go through a simple size checking step. If their length do not fit the length threshold (by default from 30aa to 500aa), they are simply remove of the analysis.

With the argument --Resize:

When the optional argument --Resize is provided, every predicted gene with at least one hit go through a size checking step. The program retrieves the nucleotide sequence of each gene and searches all possible start in the sequence giving a gene with a fiting length and with at least one intact domains. In this way gene with length bigger than the maximal threshold can be rescued as the program will consider other start positions of its sequence. The possible start positions are stored and used by the program later to determine adjacent genes and to calculate the best score. (add graph maybe?? )

Gene ‘pair organization’ checking

The remaining list of genes goes through a clustering process. The program determines if a gene is adjacent to another one by taking into account the distance threshold (by default -100 nucleotide to 150 nucleotides). If two genes with conserved domains are close enough to fulfill the threshold, they form a group. In the case where the optional argument --Resize is given, each genes may have more than one possibles start positions. This possible start positions are taken into account in the ‘pair organization’ checking step. For instance two genes are following but the second gene overlap too much with the first one, the program checks if the second gene has possible start that could fit and then allow the groupping.

###Rescuing lonely genes: optional argument --Rescue When this option is provied by the user, a rescue step is applied on lonely gene (gene harbouring a domain but with no adjacent gene). At this step, genes harboring conserved domains belong to two different categories: i) genes associated to one or more genes and forming groups and ii) lonely genes which do not have any close neighbor. The annotation of the Genome or Metagenome is not super trustable, the set of predicted genes on which hmmsearch has been processed, represent only the most likely genes in the contigs with TA loci, but as mentioned above those genes are small and thus may have been potentially missed by the initial prediction step. Consequently, to be sure that the lonely genes with a conserved domain are not a real TA system in which the associated gene has been missed by the prediction step, the algorithm retrieves potential adjacent genes and applies the domain search on them. To achieve that, all open reading frames (ORFs) adjacent to a lonely gene are found by the algorithm. Only the ORF with a proper length that fit within the thresholds are considered (if the ORF is too large, it is then resized by choosing an alternative start codon). To be selected the ORF has to start or end within the distance thresholds of the surrounding ends of a lonely gene. The ORF that meets this criterium undergoes the hmmsearch step to further identify potential conserved TA domains (Figure 2 ). The result of hmmsearch on these ORF is merged with the first result of hmmsearch on the predicted genes. All genes with at least one hit are then submited to a gene ‘pair organization’ checking process. When two or more genes are close (according to the distance threshold),they form a pair of genes and thus they are grouped and are considered as plausible TA systems. Every ORF that gets a hit through the hmmsearch step are naturally grouped with the adjacent gene that was previously alone.

###Scoring method The scoring method is based on the three genetic feature used to identify TA systems. The length of all the toxin and antitoxin from the database TADB have been computed and blballablabalb In the same way the distance between have been computed and blballablabalb The score of the hit by hmmsearch is used. Finally the score of the distance, of the length of the 2 genes and of the best domain of the 2 genes are used to build a final score.