MeTAfisher was developed with the aim of retrieving and identifying type II TA systems in a metagenomic context. It uses specific genomic features of TA systems of type II in order to discriminate them. The design of the algorithm was inspired by RASTA-Bacteria web based tool approach (SevinandBarloy-Hubler2007) which aims also to identify TA loci in prokaryotic genomes. However, the approach has been adapted to the nature of the data and with the new information collected on TA systems in the last years.
TA systems of Type II have specific features that can be used to identified them. They consist of at least two genes forming an operon (Yamaguchi, Y., (2011). The spacer rarely extend 150 nucleotides, and the two genes may also overlap until 20 nucleotides (GRAPH of summary model). The majority of toxin and antitoxin proteins are relatively small from 30 to 300 amino acids long (Figure 3 and Figure 4). Finally the third feature that helps to identify TA systems, is the presence of conserved functional domains specific to them. The detection of TA systems is therefore based on these three features: length, relative distance between two neighbor genes located on the same strand and the presence of conserved domains specific to TA systems.
The algorithm starts to search TA systems in the set of predicted genes of the metagenome (Figure 2 ). As TA system genes have often a small size and may overlap, sometimes regular gene prediction may overlook a downstream ORF, that will be integrated in a longer gene. However predicted genes are the most plausible coding sequences and therefore are a solid base to start the analysis. To search for conserved TAdomains in the set of predicted genes,the hmmsearch programp art of the HMMER tool was used (Mistryetal.2013). It uses a method called profile hidden Markov models (profile HMMs) using probabilistic models. Hmmsearch searches a query sequence against a profile database of conserved domains. The tool RASTA-Bacteria did a compilation of all known conserved TA systems until 2007. To build an updated conserved TA domains database, I have added some newly discovered domains (reported in Makarova, Wolf, and Koonin 2009 and Sberro et al. 2013). The database is composed of 81 profile domains originating from Pfam (version 31.0 Finn et al. 2014) and EggNOG (version 4.5.1 Huerta-Cepas et al. 2016) database. This database aims to be the most complete and representative of TA systems. The program identifies with hmmsearch, conserved domains (E-value < 0.5) of the database among the predicted genes of the metagenomes. It allows multiple domain hits per sequence.
By default when the optional argument --Resize is not provided, the genes with identified conserved domains go through a simple size checking step. If their length does not fit the length threshold (by default from 30aa to 500aa), they are simply remove of the analysis.
When the optional argument --Resize is provided, every predicted gene with at least one hit go through a size checking step. The program retrieves the nucleotide sequence of each gene and searches all possible start in the sequence giving a gene with a fiting length and with at least one intact domains. In this way gene with length bigger than the maximal threshold can be rescued as the program will consider other start positions of its sequence. (add graph maybe?? )
The remaining list of genes goes through a clustering process. The program determines if a gene is adjacent to another one by taking into account the distance threshold (by default -100 nucleotide to 150 nucleotides). If two genes with conserved domains are close enough to fulfill the threshold, they form a group. In the case where a gene is close with two genes (or more) then this gene and its adjacent genes form a single group. Consequently, groups can span more than 2 genes even if this situation does not happen frequently.
###Rescuing lonely genes : optional argument --Rescue When this option is provied by the user, a rescue step is applied on lonely gene (gene harbouring a domain but with no adjacent gene). At this step, genes harboring conserved domains belong to two different categories: i) genes associated to one or more genes and forming groups and ii) lonely genes which do not have any close neighbor. The annotation of the Genome or Metagenome is not super trustable, the set of predicted genes on which hmmsearch has been processed, represent only the most likely genes in the contigs with TA loci, but as mentioned above those genes are small and thus may have been potentially missed by the initial prediction step. Consequently, to be sure that the lonely genes with a conserved domain are not a real TA system in which the associated gene has been missed by the prediction step, the algorithm retrieves potential adjacent genes and applies the domain search on them. To achieve that, all open reading frames (ORFs) adjacent to a lonely gene are found by the algorithm. Only the ORF with a proper length that fit within the thresholds are considered (if the ORF is too large, it is then resized by choosing an alternative start codon). To be selected the ORF has to start or end within the distance thresholds of the surrounding ends of a lonely gene. The ORF that meets this criterium undergoes the hmmsearch step to further identify potential conserved TA domains (Figure 2 ). The result of hmmsearch on these ORF is merged with the first result of hmmsearch on the predicted genes. All genes with at least one hit are then submited to a gene ‘pair organization’ checking process. When two or more genes are close (according to the distance threshold),they form a pair of genes and thus they are grouped and are considered as plausible TA systems. Every ORF that gets a hit through the hmmsearch step are naturally grouped with the adjacent gene that was previously alone.