continuation

JeanMainguy · Nov 11, 2017 · 5ec1b3d · 5ec1b3d
1 parent f78f7b0
commit 5ec1b3d
Showing 1 changed file with 13 additions and 4 deletions.
diff --git a/Description_of_MeTAfisher.md b/Description_of_MeTAfisher.md
@@ -9,14 +9,23 @@ TA systems of Type II have speciﬁc features that can be used to identiﬁed th
 The algorithm starts to search TA systems in the set of predicted genes of the metagenome (Figure 2 ). As TA system genes have often a small size and may overlap, sometimes regular gene prediction may overlook a downstream ORF, that will be integrated in a longer gene. However predicted genes are the most plausible coding sequences and therefore are a solid base to start the analysis. To search for conserved TAdomains in the set of predicted genes,the hmmsearch programp art of the HMMER tool was used (Mistryetal.2013). It uses a method called proﬁle hidden Markov models (proﬁle HMMs) using probabilistic models. Hmmsearch searches a query sequence against a proﬁle database of conserved domains. The tool RASTA-Bacteria did a compilation of all known conserved TA systems until 2007. To build an updated conserved TA domains database, I have added some newly discovered domains (reported in Makarova, Wolf, and Koonin 2009 and Sberro et al. 2013). The database is composed of 81 proﬁle domains originating from Pfam (version 31.0 Finn et al. 2014) and EggNOG (version 4.5.1 Huerta-Cepas et al. 2016) database. This database aims to be the most complete and representative of TA systems. The program identiﬁes with hmmsearch, conserved domains (E-value < 0.5) of the database among the predicted genes of the metagenomes. It allows multiple domain hits per sequence.
 ### Gene length checking
 #### Default case
-By default when the optional argument --Resize is not provided, the genes with identified conserved domains go through a simple size checking step. If their length does not fit the length threshold (by default from 30aa to 500aa), they are simply remove of the analysis.
+By default when the optional argument --Resize is not provided, the genes with identified conserved domains go through a simple size checking step. If their length do not fit the length threshold (by default from 30aa to 500aa), they are simply remove of the analysis.
 #### With the argument --Resize:
-When the optional argument --Resize is provided, every predicted gene with at least one hit go through a size checking step. The program retrieves the nucleotide sequence of each gene and searches all possible start in the sequence giving a gene with a fiting length and with at least one intact domains. In this way gene with length bigger than the maximal threshold can be rescued as the program will consider other start positions of its sequence. (add graph maybe?? )
+When the optional argument --Resize is provided, every predicted gene with at least one hit go through a size checking step. The program retrieves the nucleotide sequence of each gene and searches all possible start in the sequence giving a gene with a fiting length and with at least one intact domains. In this way gene with length bigger than the maximal threshold can be rescued as the program will consider other start positions of its sequence. The possible start positions are stored and used by the program later to determine adjacent genes and to calculate the best score.  (add graph maybe?? )
 
 <!-- When a gene does not ﬁt the length thresholds, the program tries to resize it. It ﬁnds all possible start codons in the sequence and chooses the ﬁrst one that makes the sequence length ﬁt the threshold. Then the hits of the resized gene are analyzed to check the integrity of the conserved domains. If a hit has lost more than 5% of its length due to the new start chosen, this hit is discarded, and if all the hits of a gene are discarded then the gene is not considered for the rest of the analyses. -->
 ### Gene ‘pair organization’ checking
-The remaining list of genes goes through a clustering process. The program determines if a gene is adjacent to another one by taking into account the distance threshold (by default -100 nucleotide to 150 nucleotides). If two genes with conserved domains are close enough to fulﬁll the threshold, they form a group. In the case where a gene is close with two genes (or more) then this gene and its adjacent genes form a single group. Consequently, groups can span more than 2 genes even if this situation does not happen frequently.
+The remaining list of genes goes through a clustering process. The program determines if a gene is adjacent to another one by taking into account the distance threshold (by default -100 nucleotide to 150 nucleotides). If two genes with conserved domains are close enough to fulﬁll the threshold, they form a group. In the case where the optional argument --Resize is given, each genes may have more than one possibles start positions. This possible start positions are taken into account in the ‘pair organization’ checking step. For instance two genes are following but the second gene overlap too much with the first one, the program checks if the second gene has possible start that could fit and then allow the groupping.
+<!-- PAS COMPLEMTEMENT VRAI A AMELIORER DANS LE PROG
+ In the case where a gene is close with two genes (or more) then this gene and its adjacent genes form a single group. Consequently, groups can span more than 2 genes even if this situation does not happen frequently. -->
 
-###Rescuing lonely genes : optional argument --Rescue
+###Rescuing lonely genes: optional argument --Rescue
 When this option is provied by the user, a rescue step is applied on lonely gene (gene harbouring a domain but with no adjacent gene).
 At this step, genes harboring conserved domains belong to two diﬀerent categories: i) genes associated to one or more genes and forming groups and ii) lonely genes which do not have any close neighbor. The annotation of the Genome or Metagenome is not super trustable, the set of predicted genes on which hmmsearch has been processed, represent only the most likely genes in the contigs with TA loci, but as mentioned above those genes are small and thus may have been potentially missed by the initial prediction step. Consequently, to be sure that the lonely genes with a conserved domain are not a real TA system in which the associated gene has been missed by the prediction step, the algorithm retrieves potential adjacent genes and applies the domain search on them. To achieve that, all open reading frames (ORFs) adjacent to a lonely gene are found by the algorithm. Only the ORF with a proper length that ﬁt within the thresholds are considered (if the ORF is too large, it is then resized by choosing an alternative start codon). To be selected the ORF has to start or end within the distance thresholds of the surrounding ends of a lonely gene. The ORF that meets this criterium undergoes the hmmsearch step to further identify potential conserved TA domains (Figure 2 ). The result of hmmsearch on these ORF is merged with the ﬁrst result of hmmsearch on the predicted genes. All genes with at least one hit are then submited to a gene ‘pair organization’ checking process. When two or more genes are close (according to the distance threshold),they form a pair of genes and thus they are grouped and are considered as plausible TA systems. Every ORF that gets a hit through the hmmsearch step are naturally grouped with the adjacent gene that was previously alone.
+
+###Scoring method
+The scoring method is based on the three genetic feature used to identify TA systems.
+The length of all the toxin and antitoxin from the database TADB have been computed and blballablabalb
+In the same way the distance between have been computed and blballablabalb
+The score of the hit by hmmsearch is used.
+Finally the score of the distance, of the length of the 2 genes and of the best domain of the 2 genes are used to build a final score.