Skip to content

Large datasets

BenoitMorel edited this page Apr 24, 2023 · 4 revisions

Large datasets

This page discusses the options you have when dealing with a dataset that is too large to be analyzed in a reasonable amount of time with GeneRax. In short, you can:

  • Reduce the number of search steps with the --max-spr-radius parameter
  • Filter out gene families
  • Reduce the number of the species
  • Run GeneRax on a large cluster

Your choice will depends on your dataset and on which compromise you can make. This page will help you to make this choice...

Runtime estimation

Don't waste to much time analyzing a dataset that is way too large! There is no way to accurately predict the runtime, because GeneRax implements a search heuristic (which might converge fast or not). The best way to have a rough estimation is to drastically down-sample the size of your dataset, and to run the analysis on this smaller dataset. If such an analysis is already too slow, then GeneRax won't be able to handle the whole dataset. If GeneRax runs fast, try again with a larger subset of your dataset.

Dataset dimensions and runtime

The runtime depends on:

  • The number of species
  • The number of gene families
  • The number of sequences in each family ( = size of the gene trees)
  • The number of sites (columns) in your alignments
  • The reconcilation and substitution models

When correcting the gene trees, GeneRax treats each gene family independently, and most time is spent in evaluating the joint likelihood (product of the reconciliation and the phylogenetic likelihoods). When evaluating the species tree with SpeciesRax, most time is spent in evaluating the reconciliation likelihood only.

The runtime depends on how many times the likelihoods have to be computed, and on the time required for one likelihood evaluation.

  • The number of times the likelihood is evaluated should be roughly linear to the number of sequences in the alignment (note that this is a very rough approximation!)
  • The time spent in one likelihood evaluation is split between the phylogenetic and the reconciliation likelihood scores.
    • Phylogenetic likelihood: its evaluation is linear to the number of sites (columns) time the number of sequences. It also depends on the substitution model. For instance, GTR+G should be 4 times slower than GTR. PROTGTR is really not recommended (it is very slow and does not make much sense for a gene alignment).
    • Reconciliation likelihood: its evaluation is linear to the number of sequences ( = the size of the gene tree) times the number of species tree. The reconciliation model also has an impact (UndatedDL should be faster than UndatedDTL).

Memory requirements

When correcting the gene trees and when estimating the per family DTL rates (`--per-family-rates), GeneRax treats each gene family independently, one after each other. In this case, the memory should not be a problem.

When estimating the species tree with SpeciesRax, and when estimating the global DTL model parameters (not with --per-family-rates), GeneRax treats all gene families simultaneously. If you have a large species tree and many gene families, memory might be an issue. For a given family, let n be the size of the species tree and m be the size of the gene tree. The memory requirement for this specific family should be linear to n*m. The total memory requirement is the sum of the memory requirements for each family.

Changing the search radius

The maximum SPR radius (--max-spr-radius) affects the gene tree search step. In this step, for each subtree u of the the gene tree, we prune u and try to regraft it at every branch v that is within a radius around u. This procedure is repeated until no better tree is found. At the first search step, the radius is set to 1, and we only try to regraft at the 4 neighboring branches. Then, at each step, we increase the SPR radius by one, until we read the maximum SPR radius.

When you reduce the maximum SPR radius, the search is less exhaustive. GeneRax might miss better gene trees, but the runtime will be reduced. It is hard to know in advance how much runtime you can save by reducing the radius: sometimes the first step with radius one is the slowest one, because there are many potential improvements, and sometimes the last step with radius 5 is the slowest because it has to try a lot of regraft branches.

In my experience, most likelihood improvements are obtained with the first three rounds, and --max-spr-radius 3 performs quite good in terms of accuracy. But even --max-spr-radius 1 is much better than no tree correction at all.

Reducing the number of species

The number of species has a great impact on the runtime and memory requirements, because it affects both the size of the species tree and the size of the gene trees (if you remove a species from the analysis, you also have to remove its sequences...). In theory, the runtime could increase cubic (O(species^3)) to the number of species. Reducing the number of species is the option that should yield the largest speedups.

Reducing the number of gene families

There are several ways of reducing the number of gene families:

  • Randomly filter out x% of input gene families. If you have 10,000 families and you only keep 1000 families, the run should be around 10x faster.
  • Filter out the largest families. Larger families (with very long sequences or with many sequences) are slower to process than smaller families. The speedup you might get depends a lot on the distribution of the family dimensions. Note that this might bias the overall results: for instance, large families might be families with many duplications. By filtering them out, you might underestimate the overall number of duplications.
  • Filter out families with your own criteria (e.g., families with too many gaps).

Parallelize computations

GeneRax is parallelized with MPI. This means that you can run it on a cluster. The parallel efficiency usually scales quite well with large datasets (with many families and/or with large gene trees), and using more cores should bring important speed ups. Note that cluster usually limit a job to 24h or less. GeneRax implements a checkpoint system: if the run is interrupted, you can restart it from where it stopped by reusing the exact same command.