-
Notifications
You must be signed in to change notification settings - Fork 11
GeneRax
If this page does not help you, please write to us on our google group: https://groups.google.com/g/generaxusers
GeneRax is a parallel tool for maximum likelihood based gene tree inference under gene duplication, transfer, and loss. It infers reconciled gene trees from:
- a rooted bifurcating species tree
- (per-family) multiple sequence alignments
- the mapping between gene taxa and species taxa
SpeciesRax is a feature of GeneRax that infers a rooted species tree (or reroot an existing rooted species tree) from:
- an optional starting species tree
- a set of unrooted gene trees OR a set of multiple sequence alignments.
- the mapping between gene taxa and species taxa.
This section describes all the steps implemented in our pipeline. Note that all the steps are optional. Which steps to run depends on your input data (MSAs, gene trees and/or species trees) and on the results you are interested in (reconciled gene trees and/or rooted species tree).
- gene tree inference from the MSAs
- rooted species tree inference from the gene trees (gene trees) or just species tree rooting
- gene tree correction
- species tree branch support (from the gene trees with speciation-driven quartet score)
- gene tree / species tree reconciliation
- species tree branch lengths estimation
To use parallelization, please call GeneRax with mpiexec
. (mpiexec -np NUMBER_OF_CORES build/bin/generax [arguments]
)
General commands:
Command | Comment |
---|---|
-h , --help
|
Print the help message |
-f , --families <value>
|
Families file to describe per-family MSA, gene tree etc. (syntax). You should also have a quick look at the different ways to map genes to species here. |
-s , --species-tree {filepath, random, MiniNJ}
|
Starting species tree. If the species tree is given as a file path, the corresponding file should contain one rooted bifurcating tree in newick format. random and MiniNJ only make sense to generate a starting tree for running species tree inference. random generates a random rooted species tree. MiniNJ infers a starting species tree from the gene trees with our distance method MiniNJ (see SpeciesRax paper). Note that MiniNJ does not infer a relevant root. ) |
-r , --rec-model {UndatedDL, UndatedDTL}
|
The probabilistic model used to compute the reconciliation likelihood. UndatedDL accounts for duplication and losses. UndatedDTL also accounts for horizontal gene transfers. Default is UndatedDTL. |
-p , --prefix <value>
|
Output directory. This directory will be created. If the directory already exists and corresponds to a previous run, GeneRax will try to continue the analysis where it stopped. We do not recommend running GeneRax twice if the input data or input command changed. If you run species tree inference, this step will be run again from scratch. |
--seed <value> |
Random seed. Default is 123. |
--skip-family-filtering |
Skip all safety checks performed before running GeneRax, such as checking that gene-species mappings are well defined. |
--per-family-rates |
Optimize the DTL rates individually for each gene family (by default, DTL rates are shared among all families) |
--per-species-rates |
Optimize the DTL rates individually for each species branch. This option in incompatible with --per-family-rates . SpeciesRax does not support this option yet. |
--prune-species-tree |
Activate pruned species tree mode. Recommended for species tree inference in presence of missing data. |
--mad-rooting |
Weight the gene tree root likelihoods using MAD. We multiply each root likelihood by (m + 1.0)^-2, where m is the deviation. Higher deviations penalize the root position likelihood more. Available from GeneRax 2.1.1. |
--enforce-gene-tree-root |
Force GeneRax to use the root of the input gene trees. This is compatible with gene tree correction, but subtrees are not allowed to cross the root. This option is available from v2.1.2 |
--dtl-rates-opt NONE |
Disable DTL rates optimization. You can use --dup-rate val , --loss-rate val , and --transfer-rate val` to set the rate values. We do not recommend using this option, unless you want to study the impact of rates on the outcome. |
--no-dup |
Disable duplications. This option is not compatible with the UndatedDL model (it would only allow losses), but can be used with the UndatedDTL model. |
Gene tree correction commands:
Command | Comment |
---|---|
--strategy {EVAL, SPR, SKIP} |
Search mode: EVAL does not optimize the tree topology, and just evaluates the likelihood, the DTL rates and the reconciliation of the starting gene trees. SPR performs a tree search (with SPR moves). SKIP skips the gene trees optimization and the joint likelihood optimization (for instance if you just want to run SpeciesRax). Default is SPR . |
--max-spr-radius <value> |
The maximum radius used for SPR moves in the tree search. Increasing this number makes the search more exhaustive, but also slower. Default (and recommended) value is 5. |
Species tree inference (SpeciesRax method) commands:
Command | Comment |
---|---|
--si-strategy {SKIP, EVAL, REROOT, HYBRID} |
Species tree inference strategy. Set to SKIP by default. EVAL evaluates the reconciliation likelihood without optimizing the species tree. REROOT infers the species tree root without optimizing its topology. `HYBRID̀ enables species tree inference. When enabling species tree inference, please carefully read the SpeciesRax wiki page, in particular the recommendations section. |
--si-spr-radius <value> |
SPR radius used for the local SPR search when optimizing the species tree. Set to 1 by default. |
--si-small-root-radius <value> |
Radius used to search for a better species tree root position along the tree search. Set to 3 by default. |
--si-big-root-radius <value> |
Radius used to search for a better species tree root position at the end of the tree search. Set to 5 by default. |
--si-estimate-bl |
Enable species tree branch lengths estimation in unit of substitutions per site (given that the gene trees have branch lengths in the same unit). |
--si-quartet-support |
Enable the computation of species-driven quartet scores (QPIC, EQPIC, quartet frequencies) on each branch of the species tree from the gene trees. |
--si-eqpic-radius <value> |
Radius used to compute the EQPIC score. Set to 3 by default. |
- The families file allows you to specify per-family parameters.
- In a families file, everything after a
#
will be ignored. - The file should start with the tag
[FAMILIES]
. - A family block starts with
-
and the family name. A family block contains:
Please note that a valid family should contain at least three sequences. Also, GeneRax only supports taxon names that raxml-ng supports: in particular, taxon labels with spaces, tabs, newlines, commas, colons, semicolons and parenthesis are invalid.
When no gene-to-species mapping file is given, GeneRax infers the mappings from the taxa names (see this page (optional)).
Example:
[FAMILIES] # this is a comment
- family_1
starting_gene_tree = raxml_tree_1.newick
alignment = alignment_1.fasta
mapping = mapping_file_1.link
subst_model = GTR+G
- family_2
alignment = alignment_2.fasta
subst_model = GTR+G
- family_3
starting_gene_tree = raxml_tree_3.newick
alignment = alignment_3.fasta
mapping = mapping_file_3.link
subst_model = GTR+G
The gene-species mapping can either be specified with mapping files or will be inferred from the gene labels if no mapping file is specified. More information here.
We provide two python scripts to help you generating the family file from your input data:
- this script script generates a family file from a set of directories containing the alignments (optional), the starting trees (optional) and the mapping files (optional). The substitution model is given as input and will be the same for all families
- this other script generates a family file from a ParGenes run. ParGenes is a tool that runs modeltesting (with ModelTestNG) and gene tree inference (with RAxML-NG) on thousands of gene families in parallel, and is recommended to generate the starting trees for GeneRax. The script takes as input the path to the ParGenes output directory and the path to the family file to generate. If modeltesting was used during the ParGenes execution, each family will be associated with its best-fit substitution model in the family file. However, if you need per-family mapping files, you need to adapt the script to add the paths to those mapping files.
For each family, a starting gene tree can be given. If no starting gene tree is given, GeneRax generates a random starting gene tree. When starting from a random tree, GeneRax adds an additional step to optimize the gene tree based on the sequences only.
The substitution model describes under which model sequences evolved.
You can give as input the substitution model name (GTR, DAYHOFF etc.). Alternatively, you can give a file containing the model string. We use the same syntax as RAxML-NG. In particular, if you inferred gene trees with RAxML-NG, you can use the .best.model
file as substitution model.
Recommendation: we observed (on simulations) that the gene tree quality increases significantly when using a substitution model with several gamma rate categories (GTR+G
, LG+G
instead of GTR
or LG
). We strongly encourage users to add +G
to their substitution model.
General outputs:
-
generax.log
: log files (same content as the logs printed in the console) -
per_species_coverage.txt
: for each terminal species, the percentage of gene families that contain a gene of this species at least once. -
per_species_event_counts.txt
: number of gene event per species branch. -
per_species_rates.txt
: only generated when--per-species-rates
is enabled. Contains the per-species DTL rates. WARNING: the order of the rates is D,L,T (or D,L if HGTs are disabled). The speciation rate is always1.0
. -
species_trees/starting_species_tree.newick
: The starting species tree: user-given, randomly generated, generated with MiniNJ... -
species_trees/inferred_species_tree.newick
: The inferred species tree if SpeciesRax is run.
For a given family fam_xxx
, GeneRax outputs:
-
results/fam_xxx/geneTree.newick
: the inferred gene tree with branch lengths (in terms of expected substitutions per unit of time) -
results/fam_xxx/stats.txt
: a file with the phylogenetic and the reconciliation log-likelihoods in the first line, and the duplication, loss and transfer (in this order!) rates in the second line. -
reconciliations/fam_xxx_eventCounts.txt
: a file with the number of inferred gene events (speciation, speciation+loss, duplication, transfer, transfer+loss, loss only (should be 0), terminal node). -
reconciliations/fam_xxx_reconciliated.nhx
: the inferred gene tree reconciled with the species tree, in NHX format (can be opened for visualization with Notung). -
reconciliations/fam_xxx_reconciliated.xml
: the inferred gene tree reconciled with the species tree, in RecPhyloxml format (can be opened for online visualization with ThirdKind or alternatively with recphylovisu). -
reconciliations/fam_xxx_speciesEventCounts.txt
: number of events per species, in the following order: speciations (including leaves), duplications, losses, transfers. Note that a speciation-loss event will be counted as one speciation and one loss (same applies to transfer-loss). -
reconciliations/fam_xxx_transfers.txt
: mainly for internal use. Contains, for each inferred transfer, the giving and the receiving species. -
reconciliations/fam_xxx_orthogroups.txt
: please ignore this file, the corresponding feature is still under development.
GeneRax assumes that genes evolve through gene duplication, gene loss, horizontal gene transfer, and speciation events. In practice, we cannot infer where exactly a gene loss occurred. We say that an ancestral gene goes extinct when none of its descendants is observed in the input alignment (either they got lost or they were not sampled). We say that an ancestral gene survives until the present if at least one of its descendants is present in the input alignment. GeneRax infers the following events:
- speciation (S): when a species A splits into two children species, the gene present in A splits into two children genes. Each of the two new gene lineages survives until the present.
- speciation-loss (SL): when a species A splits into two children species B and C, the gene present in A splits into two children genes. One of these children genes survives until the present, and the other one goes extinct.
- duplication (D): a gene undergoes a gene duplication and gives rise to two new gene lineages that both survive until the present.
- horizontal gene transfer (T): a gene present in a species A is transferred to a species B. Each gene copy (the one in A and the one in B) survive until the present.
Note that we do not infer TL (transfer-loss) events, and you should not pay attention to the TL count (it should be equal to 0 all the time).
Comment 1: it is not possible to infer the exact number of losses, for two reasons. First, if a gene is present in a species, but not sampled, GeneRax will interpret this as a loss (in practice, SL). Second, a gene can give rise to several lineages (through speciation, duplication, and transfers) which could all get lost later on. Since those multiple lineages cannot be observed, GeneRax will only infer one SL event. Alternatively, it can be interesting to compare the duplication, loss, and tranfer rates, in order to estimate the gene event frequencies. Note that the speciation rate is always 1.0
and is not shown in the output files.
Comment 2: the best way to estimate the size of an ancestral genome is to count the number of S and SL events associated with this genome. Do not forget the SL events.
We recommend the following viewer to visualize GeneRax reconciled trees:
- ThirdKind can read one or several RecPhyloXML files. It provides very interesting options (see their wiki and examples) and output SVG files.
Alternatively:
-
RecPhyloVisu, using the RecPhyloXML format (
.xml
files). When the website does not work, an alternative downloadable version can be found here. Be aware that the viewer adds an unnecessary loss after each HGT. -
Notung, with the NHX format (
.nhx
files).
Please let me know if you know any better viewer for reconciled gene trees!