-
Notifications
You must be signed in to change notification settings - Fork 3
evomics2019
- Check input data:
cd /home/phylogenomics/workshop_materials/ng-tutorial/
ls
you should see roughly the following:
5.phy dna.map multi199.part multi5.fa prim2.part prim.part prim.phy.raxml.log prot140.phy prot21.fa.ckp prot21.fa.out prot21.part README.md terrace.part
bad.fa fusob.phy multi199.phy multi5.map prim.cons prim.phy prot140.part prot21.fa prot21.fa.log prot21.fa.tree rbcl.phy terrace.fa
- Run RAxML-NG binary without parameters
raxml-ng
this will show RAxML-NG version and short usage help:
RAxML-NG v. 0.8.0 BETA released on 11.01.2019 by The Exelixis Lab.
[...]
Usage: raxml-ng [OPTIONS]
Commands (mutually exclusive):
--help display help information
[...]
- It is always a good idea to run alignment sanity check before starting the actual analysis:
raxml-ng --check --msa prim.phy --model GTR+G
This alignment is clean, so you should see:
Alignment can be successfully read by RAxML-NG.
4*. Run --check
command again with bad.fa
and examine error and warning messages.
- Infer ML tree with default parameters (10 random + 10 parsimony starting trees):
raxml-ng --msa prim.phy --model GTR+G --prefix S1
Check log-likelihoods for all 20 resulting trees:
grep "logLikelihood:" S1.raxml.log
Result:
[00:00:00] ML tree search #1, logLikelihood: -5708.961164
[00:00:01] ML tree search #2, logLikelihood: -5709.001321
[00:00:02] ML tree search #3, logLikelihood: -5708.928444
[00:00:03] ML tree search #4, logLikelihood: -5708.958315
[00:00:03] ML tree search #5, logLikelihood: -5708.932260
[00:00:04] ML tree search #6, logLikelihood: -5708.941449
[00:00:05] ML tree search #7, logLikelihood: -5708.959505
[00:00:05] ML tree search #8, logLikelihood: -5708.951658
[00:00:06] ML tree search #9, logLikelihood: -5709.022061
[00:00:07] ML tree search #10, logLikelihood: -5708.926872
[00:00:08] ML tree search #11, logLikelihood: -5709.016549
[00:00:08] ML tree search #12, logLikelihood: -5709.022648
[00:00:09] ML tree search #13, logLikelihood: -5709.009746
[00:00:10] ML tree search #14, logLikelihood: -5709.012081
[00:00:10] ML tree search #15, logLikelihood: -5709.017948
[00:00:11] ML tree search #16, logLikelihood: -5709.017067
[00:00:11] ML tree search #17, logLikelihood: -5709.030238
[00:00:12] ML tree search #18, logLikelihood: -5709.014300
[00:00:13] ML tree search #19, logLikelihood: -5709.018029
[00:00:13] ML tree search #20, logLikelihood: -5709.017251
- Check topological distances between trees (so-called Robinson-Foulds or RF distance):
raxml-ng --rfdist --tree S1.raxml.mlTrees --prefix RF1
Result:
Average absolute RF distance in this tree set: 0.000000
Average relative RF distance in this tree set: 0.000000
Number of unique topologies in this tree set: 1
3*. Repeat steps 1 and 2 for fusob.phy
.
- Infer bootstrap replicate trees:
raxml-ng --bootstrap --msa prim.phy --model GTR+G --prefix B1
- Map bootstrap support values to the best ML trees:
raxml-ng --support --tree S1.raxml.bestTree --bs-trees B1.raxml.bootstraps --prefix B2
- Open
B2.raxml.support
in the tree viewer of your choice.
ML tree search, bootstrapping and branch support mapping can be performed with a single raxml-ng --all
invocation:
raxml-ng --all --msa prim.phy --model GTR+G --prefix A1
However, this is not always possible/efficient for large datasets!
The --evaluate
command computes the likelihood of a fixed tree topology after re-optimizing all model parameters and branch lengths. We will use it to compare different evolutionary models:
raxml-ng --evaluate --msa prim.phy --tree S1.raxml.bestTree --model GTR+G --prefix E_GTRG
-
Repeat the above command for
GTR+R4
,GTR
,JC
andJC+G
models. Don't forget to change the--prefix
! -
Check the log-likelihood scores:
grep "Final logLikelihood" E*.raxml.log
Which model yields the highest log-likelihood? Does it mean that this model should be preferred?
Parameter-rich models are more prone to overfitting the data. Thus, special criteria have been developed to compare models with different number of free parameters. In phylogenetics, AIC/AICc and BIC are most commonly used.
- Check AIC/BIC scores for our runs (lower values are better):
grep "AIC score" E*.raxml.log
Which model should be preferred according to information theoretical criteria ?
- Check online help
modeltest-ng --help
Important options are:
-i ALIGNMENT
-
-d DATATYPE
, whereDATATYPE
isnt
(default) oraa
- Run model selection for
prot21.fa
(protein alignment):
modeltest-ng -i prot21.fa -d aa
- Run tree inference with the best-scoring model determined by ModelTest-NG.
4A. Re-run ModelTest-NG including the LG4M
and LG4X
models which are not tested by default.
Partitioned model is defined in a file, e.g. --model prim2.part
.
cat prim2.part
GTR+G+FO, NADH4=1-504/3,2-504/3
JC+I, tRNA=505-656
GTR+R4+FC, NADH5=657-898
HKY, NADH4p3=3-504/3
- Re-run tree inference for
prim.phy
using partitioned model inprim2.part
. Compare the results (log-likelihood and tree topology) to the Exercise 1.
raxml-ng --msa prim.phy --model prim2.part --prefix S2
Likelihoods:
grep "Final logLikelihood" S{1,5}.raxml.log
RF distances:
cat S1.raxml.bestTree S5.raxml.bestTree > S1S5.trees
raxml-ng --rfdist --tree S1S5.trees --prefix RF5
Vital-IT/SIB kindly developed and maintain the RAxML-NG web server available here:
No registration or payment required. However, allocated computational resources are limited, so it is not a good option for large datasets.
If you have time, please feel free to play around with the web server (e.g., re-run Exercises 1 - 5).
Before analyzing a large dataset, it is highly recommended to convert alignment into RAxML binary format (RBA) using the --parse
command:
raxml-ng --parse --msa fusob.phy --model GTR+G --prefix fusob
Doing this pre-processing step has two advantages:
- binary alignments are faster to load than PHYLIP/FASTA
- you will get estimated memory requirements and the recommended number of threads for this dataset
Use --threads
option to re-run the same analysis with varying number of threads:
raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -threads 1 -prefix T1
raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -threads 2 -prefix T2
raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -threads 4 -prefix T4
Compare the runtimes:
grep "Elapsed time:" T*.raxml.log
Which number of threads yields the lowest runtime? Which number of threads is "optimal"?
Alternatively, we can parallelize across starting trees by running multiple RAxML-NG instances simultaneously:
for i in {1..2}; do (raxml-ng -msa fusob.raxml.rba -tree rand{5} -seed $i -threads 1 -prefix CT$i >CTlog$i &); done
This will start 2 RAxML-NG instances, each doing 5 independent tree searches on 1 CPU core. RAxML-NG will run in background, so use htop
to monitor progress (look at per-core CPU load!).
Once all searches have finished, check the runtimes and compare them to fine-grained parallelization with 4 threads:
grep "Elapsed time:" CT*.raxml.log T*.raxml.log
Is it worth to use coarse-grained parallelization on this dataset?
Check the likelihood scores and pick the best tree:
grep "Final LogLikelihood" CT*.raxml.log | sort -k 3
Does it differ topologically from the best-scoring tree we obtained in Exercise 2? Do the likelihood scores differ? Why?