diff --git a/tools/proteinortho/proteinortho.xml b/tools/proteinortho/proteinortho.xml index 52ada4557a5..64ebdac86bc 100644 --- a/tools/proteinortho/proteinortho.xml +++ b/tools/proteinortho/proteinortho.xml @@ -99,13 +99,13 @@ 2> >(sed -E "s/.\[([0-9]{1,2}(;[0-9]{1,2})?)?[mGK]//g" 1>&2) #if $more_options.selfblast: && - mv result.blast-graph_clean result.blast-graph; + mv result.blast-graph_clean result.blast-graph #end if #if $synteny.synteny_options == "specified": && mv result.poff-graph result.proteinortho-graph && mv result.poff.tsv result.proteinortho.tsv && - mv result.poff.html result.proteinortho.html ; + mv result.poff.html result.proteinortho.html #end if ]]> @@ -115,6 +115,8 @@ + + @@ -126,7 +128,7 @@ - + @@ -137,7 +139,7 @@ - + @@ -177,7 +179,7 @@ - + @@ -187,6 +189,16 @@ + + + + + + + + + + @@ -251,12 +263,12 @@ - + - - + + - + @@ -285,8 +297,8 @@ Proteinortho is a tool to detect orthologous proteins/genes within different spe * **(ii) Cluster the RBH** - | Using two clustering algorithms, edges are removed that weakly connect two connected components to reduce false positive hits. - | The resulting connected components are outputted in orthology-groups / -pairs + | A spectral clustering algorithm is used to remove weak connections, reducing false positives. + | The connected components from this process are output as orthology groups or pairs. ---- @@ -322,31 +334,58 @@ Proteinortho is a tool to detect orthologous proteins/genes within different spe | The result of the (ii) step, the clustered reciprocal best hit graph or the orthology groups. | Every line corresponds to an orthology group. - | The first 3 columns characterize the general properties of that group: number of proteins, species, and algebraic connectivity. The higher the algebraic connectivity the more edges are there and the better the group is connected to itself in general. + | The first 3 columns characterize the general properties of that group: number of proteins, species, and algebraic connectivity. The higher the algebraic connectivity the more edges are there and the better the group is connected to itself. | Then a column for each species follows containing the proteins of these species. | If a species contributes with more than one protein to a group of orthologs, then they are ordered by descending connectivity. | The '*' represents that this species does not contribute to the group. .. csv-table:: - Species,Genes,alg.-conn.,ecoli.faa,human.faa,snail.faa,wale.faa,ebola.faa + Species,Genes,alg.-conn.,ecoli.faa,human.faa,snail.faa,wale.faa,mouse.faa 5,5,0.715,C_10,C_10;test,E_10,L_10,M_10 4,6,0.115,*,C_12,E_315,L_313,M_313 4,5,0.167,*,C_63,E_19,L_19,M_19 4,4,0.816,*,C_64,E_18,L_18,M_18 +---- + + | The first group is comprised of 5 proteins of 5 species: 'C_10' of ecoli.faa, 'C_10;test' of human.faa, 'E_10' of snail.faa, 'L_10' of wale.faa, and 'M_10' of mouse.faa. + | The alg.-conn. (algebraic connectivity) of 0.715 indicates the connectivity of this group, the higher the more edges are connecting these 5 proteins (at most there can be 10 and at least there need to be 4). + | The second group contains 6 proteins distributed over 4 species. The star indicates the species where no protein was found (in this case ecoli.faa). + +.. csv-table:: + + seqidA,seqidB,evalue_ab,bitscore_ab,evalue_ba,bitscore_ba + # ecoli.faa,human.faa + # 1.91e-112,357.5,1.825e-113,360 + L_10,C_10;test,4.32e-151,447,4.30e-151,446 + L_11,C_11,1.17e-68,209,3.00e-69,210 + L_14,C_14,3.64e-139,422,1.19e-142,431 + L_15,C_15,3.51e-100,303,2.12e-102,308 + L_16,C_16,3.75e-49,157,7.06e-50,159 + L_17,C_17,2.96e-195,578,5.50e-196,579 + ---- * **orthology-pairs** - | The same as orthology-groups but every edge is printed one-by-one instead of the whole group. The output is formatted the same as the RBH graph: + | Similar to orthology groups, but each edge is printed individually. + | The output is formatted the same as the RBH graph. + | For example extracting all hits of the second group of the example orthology-group output ('4,6,0.115,*,C_12,E_315,L_313,M_313') using grep (-E, regular expression="(C_12|E_315|L_313|M_313).*(C_12|E_315|L_313|M_313)", input file=proteinortho-graph) would reveal all edges of this groups: .. csv-table:: - - seqidA,seqidB,evalue_ab,bitscore_ab,evalue_ba,bitscore_ba + + seqidA,seqidB,evalue_ab,bitscore_ab,evalue_ba,bitscore_ba + M_313,C_12,1.18e-115,407,6.12e-116,407 + C_12,E_315,4.50e-127,445,4.09e-127,445 + L_313,M_313,0.00e+00,1368,0.00e+00,1368 + L_313,C_12,3.76e-114,402,1.94e-114,402 ---- + | Especially L_313 and M_313 are very similar, probably identical. + | The group cotnains 4 edges out of the 6 possible edges for a group of 4 proteins. The missing edges are M_313-E_315 as well as L_313-E_315. This means that E_315 is only connected to the other 3 proteins via C_12 and thus could be considered as a weak link in the group. + **Proteinortho-Tools for downstream analysis** * `proteinortho grab proteins` : find gene(s)/protein(s) in a given fasta file and retrieve their sequence(s). You can also use a orthology-groups file or a subset (e.g. filter by Species>10). @@ -354,9 +393,11 @@ Proteinortho is a tool to detect orthologous proteins/genes within different spe More information can be found on github https://gitlab.com/paulklemm_PHD/proteinortho -**Citations:** - ]]> - + + 10.3389/fbinf.2023.1322477 + 10.1186/1471-2105-12-124 + 10.1371/journal.pone.0105015 + diff --git a/tools/proteinortho/proteinortho_grab_proteins.xml b/tools/proteinortho/proteinortho_grab_proteins.xml index eea9407a682..8eacae26fe1 100644 --- a/tools/proteinortho/proteinortho_grab_proteins.xml +++ b/tools/proteinortho/proteinortho_grab_proteins.xml @@ -112,5 +112,9 @@ proteinortho_grab_proteins : find gene(s)/protein(s) in a given fasta file and r More information can be found on github https://gitlab.com/paulklemm_PHD/proteinortho ]]> - + + 10.3389/fbinf.2023.1322477 + 10.1186/1471-2105-12-124 + 10.1371/journal.pone.0105015 + diff --git a/tools/proteinortho/proteinortho_macros.xml b/tools/proteinortho/proteinortho_macros.xml index c738b029b5e..38a9f110f73 100644 --- a/tools/proteinortho/proteinortho_macros.xml +++ b/tools/proteinortho/proteinortho_macros.xml @@ -1,15 +1,8 @@ - 6.3.1 - 0 - 22.05 - - - 10.1186/1471-2105-12-124 - 10.1371/journal.pone.0105015 - 10.3389/fbinf.2023.1322477 - - + 6.3.4 + 0 + 22.05 proteinortho @@ -22,6 +15,7 @@ blast ucsc-blat last + mmseqs2 diff --git a/tools/proteinortho/proteinortho_summary.xml b/tools/proteinortho/proteinortho_summary.xml index 98c330c529c..e0da675b1e3 100644 --- a/tools/proteinortho/proteinortho_summary.xml +++ b/tools/proteinortho/proteinortho_summary.xml @@ -120,5 +120,9 @@ Or given 2 orthology-pairs from the same set of fasta files with different param More information can be found on github https://gitlab.com/paulklemm_PHD/proteinortho ]]> - + + 10.3389/fbinf.2023.1322477 + 10.1186/1471-2105-12-124 + 10.1371/journal.pone.0105015 +