Skip to content

Commit

Permalink
manual writing
Browse files Browse the repository at this point in the history
  • Loading branch information
JeanMainguy committed Oct 27, 2017
1 parent 370e99a commit 6eb7a08
Show file tree
Hide file tree
Showing 2 changed files with 60 additions and 10 deletions.
22 changes: 22 additions & 0 deletions Description_of_MeTAfisher.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Identification of TA systems
## Process overview
MeTAfisher was developed with the aim of retrieving and identifying type II TA systems in a metagenomic context. It uses specific genomic features of TA systems of type II in order to discriminate them. The design of the algorithm was inspired by RASTA-Bacteria web based tool approach (SevinandBarloy-Hubler2007) which aims also to identify TA loci in prokaryotic genomes. However, the approach has been adapted to the nature of the data and with the new information collected on TA systems in the last years.

## Genomic features used for discriminating TA systems
TA systems of Type II have specific features that can be used to identified them. They consist of at least two genes forming an operon (Yamaguchi, Y., (2011). The spacer rarely extend 150 nucleotides, and the two genes may also overlap until 20 nucleotides (GRAPH of summary model). The majority of toxin and antitoxin proteins are relatively small from 30 to 300 amino acids long (Figure 3 and Figure 4). Finally the third feature that helps to identify TA systems, is the presence of conserved functional domains specific to them. The detection of TA systems is therefore based on these three features: length, relative distance between two neighbor genes located on the same strand and the presence of conserved domains specific to TA systems.

### Conserved domain verification
The algorithm starts to search TA systems in the set of predicted genes of the metagenome (Figure 2 ). As TA system genes have often a small size and may overlap, sometimes regular gene prediction may overlook a downstream ORF, that will be integrated in a longer gene. However predicted genes are the most plausible coding sequences and therefore are a solid base to start the analysis. To search for conserved TAdomains in the set of predicted genes,the hmmsearch programp art of the HMMER tool was used (Mistryetal.2013). It uses a method called profile hidden Markov models (profile HMMs) using probabilistic models. Hmmsearch searches a query sequence against a profile database of conserved domains. The tool RASTA-Bacteria did a compilation of all known conserved TA systems until 2007. To build an updated conserved TA domains database, I have added some newly discovered domains (reported in Makarova, Wolf, and Koonin 2009 and Sberro et al. 2013). The database is composed of 81 profile domains originating from Pfam (version 31.0 Finn et al. 2014) and EggNOG (version 4.5.1 Huerta-Cepas et al. 2016) database. This database aims to be the most complete and representative of TA systems. The program identifies with hmmsearch, conserved domains (E-value < 0.5) of the database among the predicted genes of the metagenomes. It allows multiple domain hits per sequence.
### Gene length checking
#### Default case
By default when the optional argument --Resize is not provided, the genes with identified conserved domains go through a simple size checking step. If their length does not fit the length threshold (by default from 30aa to 500aa), they are simply remove of the analysis.
#### With the argument --Resize:
When the optional argument --Resize is provided, every predicted gene with at least one hit go through a size checking step. The program retrieves the nucleotide sequence of each gene and searches all possible start in the sequence giving a gene with a fiting length and with at least one intact domains. In this way gene with length bigger than the maximal threshold can be rescued as the program will consider other start positions of its sequence. (add graph maybe?? )

<!-- When a gene does not fit the length thresholds, the program tries to resize it. It finds all possible start codons in the sequence and chooses the first one that makes the sequence length fit the threshold. Then the hits of the resized gene are analyzed to check the integrity of the conserved domains. If a hit has lost more than 5% of its length due to the new start chosen, this hit is discarded, and if all the hits of a gene are discarded then the gene is not considered for the rest of the analyses. -->
### Gene ‘pair organization’ checking
The remaining list of genes goes through a clustering process. The program determines if a gene is adjacent to another one by taking into account the distance threshold (by default -100 nucleotide to 150 nucleotides). If two genes with conserved domains are close enough to fulfill the threshold, they form a group. In the case where a gene is close with two genes (or more) then this gene and its adjacent genes form a single group. Consequently, groups can span more than 2 genes even if this situation does not happen frequently.

###Rescuing lonely genes : optional argument --Rescue
When this option is provied by the user, a rescue step is applied on lonely gene (gene harbouring a domain but with no adjacent gene).
At this step, genes harboring conserved domains belong to two different categories: i) genes associated to one or more genes and forming groups and ii) lonely genes which do not have any close neighbor. The annotation of the Genome or Metagenome is not super trustable, the set of predicted genes on which hmmsearch has been processed, represent only the most likely genes in the contigs with TA loci, but as mentioned above those genes are small and thus may have been potentially missed by the initial prediction step. Consequently, to be sure that the lonely genes with a conserved domain are not a real TA system in which the associated gene has been missed by the prediction step, the algorithm retrieves potential adjacent genes and applies the domain search on them. To achieve that, all open reading frames (ORFs) adjacent to a lonely gene are found by the algorithm. Only the ORF with a proper length that fit within the thresholds are considered (if the ORF is too large, it is then resized by choosing an alternative start codon). To be selected the ORF has to start or end within the distance thresholds of the surrounding ends of a lonely gene. The ORF that meets this criterium undergoes the hmmsearch step to further identify potential conserved TA domains (Figure 2 ). The result of hmmsearch on these ORF is merged with the first result of hmmsearch on the predicted genes. All genes with at least one hit are then submited to a gene ‘pair organization’ checking process. When two or more genes are close (according to the distance threshold),they form a pair of genes and thus they are grouped and are considered as plausible TA systems. Every ORF that gets a hit through the hmmsearch step are naturally grouped with the adjacent gene that was previously alone.
48 changes: 38 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,55 @@
# MeTAfisher
Program to retrieve toxin antitoxin (TA) systems in metagenomes
Program to retrieve toxin antitoxin (TA) systems in metagenomes

[comment]: <> (General intro to the program)

## Program overview
MeTAfisher is written in Python 2.7. It is made of four files which need to be in the same folder.
* `main_MetaF.py` : It initialies the variable and launchs the different functions of the program inside a loop iterating the different contigs. This file takes argument to be able to correctly initilyse the variable
* `Function_MetaF.py` : All the general function of the program
* `Object_MetaF.py` : Classes and methods.
* `Orf_MetaF.py` : Fucntions specific to ORF process, used when the option `--Rescue` is called.


`main_MetaF.py` needs argument to run the program :
blballablabalb


## Dependence files
MeTAfisher requires specific files to work. These files need to be place in the same folder and the path to this folder given as an argument to the program.
The dependeces folder has to have:

* `distance.csv` and `length_TA.csv`: used to build the score
* `ALL_plus_MET_curatted` : the database of TA HMM profile.
* `domaines_METAfisher.csv` : a table with information about the profiles of the data base (i.e TA family, description, source..)

## Files format requierement
Metafisher take 4 different files as an input. The files need be in the same folder and to have the same name with only a specif extension for each.
### Fasta files
Metafisher requires 3 files in the fasta files format:
Metafisher requires 3 files in the fasta files format:
1. The chromosome or scaffold sequences with the extension **.fasta**
2. The protein sequences in amino acide with the extension **.faa**
3. The protein sequences in nucleotide with the extension **.fna**

For the chromosome or scaffold sequences the headers of the sequence need to start with the name of the chromosome or scaffold follow by a space or a new line.


```>CP000697.1 Acidiphilium cryptum JF-5, complete genome ``` here the name of the chromosome is `CP000697.1` and it is followed by a white space, the other information are ignored by the program. ```>ICM0007MP0313_1000001``` here the name of the metagenome is `ICM0007MP0313_1000001` and followed by a new line
`>CP000697.1 Acidiphilium cryptum JF-5, complete genome` here the name of the chromosome is `CP000697.1` and it is followed by a white space, the other information are ignored by the program. ```>ICM0007MP0313_1000001``` here the name of the metagenome is `ICM0007MP0313_1000001` and followed by a new line




For the protein sequences in amino acid and nucleotide (respectively .faa and .fna) there are currently two different possibile format for the header.


One typically found on NCBI:

```>lcl|CP000697.1_prot_Acry_0001_1 [gene=Acry_0001] [protein.....``` with the ```>lcl|``` followed by the name of the chromosome or scaffold and the id name of the gene (here ```_prot_Acry_0001_1```), in this id name there is the protein number 0001 wich is used by the program to identify the genes.
`>lcl|CP000697.1_prot_Acry_0001_1 [gene=Acry_0001] [protein.....` with the ```>lcl|``` followed by the name of the chromosome or scaffold and the id name of the gene (here ```_prot_Acry_0001_1```), in this id name there is the protein number 0001 wich is used by the program to identify the genes.



An other one simpler:

```>ICM0007MP0313_1000001|5 ``` with the name of the sequence analysed followed by an | and the protein number used by the program to identify the genes.
`>ICM0007MP0313_1000001|5` with the name of the sequence analysed followed by an | and the protein number used by the program to identify the genes.

### GFF file
> The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines.
Expand All @@ -45,15 +64,24 @@ The columns used by MeTAfisher:
9. **attribute** - A semicolon-separated list of tag-value pairs, providing additional information about each feature.


The program looks only at the line with feature equal to `CDS`.
The program looks only at the line with feature equal to `CDS`.
MeTAfisher needs to retrieve the protein number to be able to match the protein hited by hmmsearch with the proper gff line. To do so the attribute section needs to follow to possible formats:
1. One with `ID=< chromosome or scaffold>|<protein number>`.
1. One with `ID=< chromosome or scaffold>|<protein number>`.
2. One other possible format is with ????

Example of correct gff line : ```ICM0007MP0313_1000310 GPF CDS 13787 14128 . - 0 ID=ICM0007MP0313_1000310|19;partial=00;sta.....```

CP000569.1 Genbank gene 13754 15256 . + . ID=gene12;Name=murE;gbkey=Gene;gene=murE;gene_biotype=protein_coding;locus_tag=APL_0013
CP000569.1 Genbank gene 13754 15256 . + . ID=gene12;Name=murE;gbkey=Gene;gene=murE;gene_biotype=protein_coding;locus_tag=APL_0013

## Output file
Different possible output files exist.
* Short output :
* Human readable file :
* ...

The program also provide a file gathering quantitative information about the analysis of the contigs/chromosomes analysed.



REF
REF
http://www.ensembl.org/info/website/upload/gff.html

0 comments on commit 6eb7a08

Please sign in to comment.