-
Notifications
You must be signed in to change notification settings - Fork 3
evomics2024
This is the full solution script which includes all command lines and results.
For the shortened version which will be presented to the students, see evomics2024.md.
NOTE: Optional tasks are marked with an asterisk (*).
- Check input data:
cd /home/phylogenomics/workshop_materials/ng-tutorial/
ls
you should see roughly the following:
5.phy dna.map multi199.part multi5.fa prim2.part prim.part prim.phy.raxml.log prot140.phy prot21.fa.ckp prot21.fa.out prot21.part README.md terrace.part
bad.fa fusob.phy multi199.phy multi5.map prim.cons prim.phy prot140.part prot21.fa prot21.fa.log prot21.fa.tree rbcl.phy terrace.fa
- Run RAxML-NG binary without parameters
raxml-ng
this will show RAxML-NG version and short usage help:
RAxML-NG v. 1.2.1 released on 22.12.2023 by The Exelixis Lab.
[...]
System: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, 4 cores, 23 GB RAM
Usage: raxml-ng [OPTIONS]
Commands (mutually exclusive):
--help display help information
[...]
- It is always a good idea to run alignment sanity check before starting the actual analysis:
raxml-ng --check --msa prim.phy --model GTR+G
This alignment is clean, so you should see:
Alignment can be successfully read by RAxML-NG.
4*. Run --check
command again with bad.fa
and examine error and warning messages.
- Infer ML tree with default parameters (10 random + 10 parsimony starting trees):
raxml-ng --msa prim.phy --model GTR+G --prefix S1
- Check log-likelihoods for all 20 resulting trees:
grep "logLikelihood:" S1.raxml.log
Result:
[00:00:00] [worker #0] ML tree search #1, logLikelihood: -5708.940514
[00:00:00] [worker #2] ML tree search #3, logLikelihood: -5709.367652
[00:00:00] [worker #3] ML tree search #4, logLikelihood: -5708.950769
[00:00:00] [worker #1] ML tree search #2, logLikelihood: -5708.981882
[00:00:00] [worker #0] ML tree search #5, logLikelihood: -5708.969503
[00:00:00] [worker #2] ML tree search #7, logLikelihood: -5708.949393
[00:00:00] [worker #1] ML tree search #6, logLikelihood: -5708.936930
[00:00:00] [worker #3] ML tree search #8, logLikelihood: -5709.023648
[00:00:01] [worker #0] ML tree search #9, logLikelihood: -5708.976056
[00:00:01] [worker #2] ML tree search #11, logLikelihood: -5709.009527
[00:00:01] [worker #1] ML tree search #10, logLikelihood: -5708.943575
[00:00:01] [worker #3] ML tree search #12, logLikelihood: -5709.015055
[00:00:01] [worker #0] ML tree search #13, logLikelihood: -5708.968725
[00:00:01] [worker #2] ML tree search #15, logLikelihood: -5709.020541
[00:00:01] [worker #1] ML tree search #14, logLikelihood: -5709.013903
[00:00:01] [worker #3] ML tree search #16, logLikelihood: -5709.012914
[00:00:01] [worker #0] ML tree search #17, logLikelihood: -5709.010325
[00:00:01] [worker #2] ML tree search #19, logLikelihood: -5709.021621
[00:00:01] [worker #1] ML tree search #18, logLikelihood: -5709.021075
[00:00:01] [worker #3] ML tree search #20, logLikelihood: -5709.006120
- Check topological distances between trees (so-called Robinson-Foulds or RF distance):
raxml-ng --rfdist --tree S1.raxml.mlTrees --prefix RF1
Result:
Average absolute RF distance in this tree set: 0.000000
Average relative RF distance in this tree set: 0.000000
Number of unique topologies in this tree set: 1
4*. Repeat steps 1, 2 and 3 for fusob.phy
, using 5 random + 5 parsimony starting trees.
raxml-ng --msa fusob.phy --model GTR+G --prefix S2
grep "logLikelihood:" S2.raxml.log
[00:00:02] [worker #2] ML tree search #3, logLikelihood: -9974.673091
[00:00:02] [worker #0] ML tree search #1, logLikelihood: -9974.665558
[00:00:02] [worker #1] ML tree search #2, logLikelihood: -9980.906638
[00:00:03] [worker #3] ML tree search #4, logLikelihood: -9974.663320
[00:00:04] [worker #2] ML tree search #7, logLikelihood: -9980.906494
[00:00:04] [worker #1] ML tree search #6, logLikelihood: -9974.670122
[00:00:04] [worker #3] ML tree search #8, logLikelihood: -9974.669716
[00:00:05] [worker #0] ML tree search #5, logLikelihood: -9974.671526
[00:00:06] [worker #1] ML tree search #10, logLikelihood: -9974.666947
[00:00:07] [worker #0] ML tree search #9, logLikelihood: -9981.909653
raxml-ng --rfdist --tree S2.raxml.mlTrees --prefix RF2
Loaded 10 trees with 38 taxa.
Average absolute RF distance in this tree set: 5.244444
Average relative RF distance in this tree set: 0.074921
Number of unique topologies in this tree set: 3
- Run "all-in-one" analysis (ML tree search + bootstrapping + branch support mapping):
raxml-ng --all --msa prim.phy --model GTR+G --prefix A1
This is convenient, but not always possible/efficient for large datasets! We will explore an alternative approach in Lab #2.
- Open
A1.raxml.support
in the tree viewer of your choice.
3*. Repeat bootstrapping using a fixed number of replicates (100). Did this change the support values?
The --evaluate
command computes the likelihood of a fixed tree topology after re-optimizing all model parameters and branch lengths. We will use it to compare different evolutionary models:
raxml-ng --evaluate --msa prim.phy --tree S1.raxml.bestTree --model GTR+G --prefix E_GTRG
-
Repeat the above command for
GTR+R4
,GTR
,JC
andJC+G
models. Don't forget to change the--prefix
! -
Check the log-likelihood scores:
grep "Final LogLikelihood:" E*.raxml.log
Which model yields the highest log-likelihood? Does it mean that this model should be preferred?
Parameter-rich models are more prone to overfitting the data. Thus, special criteria have been developed to compare models with different number of free parameters. In phylogenetics, AIC/AICc and BIC are most commonly used.
- Check AIC/BIC scores for our runs (lower values are better):
grep "AIC score" E*.raxml.log
Which model should be preferred according to information theoretical criteria ?
Results:
grep "Final LogLikelihood:" E*.raxml.log
E_GTR.raxml.log:Final LogLikelihood: -5934.159081
E_GTRG.raxml.log:Final LogLikelihood: -5709.005399
E_GTRR.raxml.log:Final LogLikelihood: -5706.012286
E_JC.raxml.log:Final LogLikelihood: -6424.203377
E_JCG.raxml.log:Final LogLikelihood: -6272.469065
$ grep "AIC score" E*.raxml.log
E_GTR.raxml.log:AIC score: 11926.318162 / AICc score: 11928.322770 / BIC score: 12065.523094
E_GTRG.raxml.log:AIC score: 11478.010798 / AICc score: 11480.156127 / BIC score: 11622.015900
E_GTRR.raxml.log:AIC score: 11482.024572 / AICc score: 11484.948006 / BIC score: 11650.030524
E_JC.raxml.log:AIC score: 12890.406755 / AICc score: 12891.461549 / BIC score: 12991.210326
E_JCG.raxml.log:AIC score: 12588.938129 / AICc score: 12590.094701 / BIC score: 12694.541871
- Check online help
modeltest-ng --help
Important options are:
-i ALIGNMENT
-
-d DATATYPE
, whereDATATYPE
isnt
(default) oraa
- Run model selection for
prot21.fa
(protein alignment):
modeltest-ng -i prot21.fa -d aa
- Run tree inference with the best-scoring model determined by ModelTest-NG.
4*. Re-run ModelTest-NG including the LG4M
and LG4X
models which are not tested by default.
Results:
modeltest-ng -i prot21.fa -d aa
Partition 1/1:
Model Score Weight
----------------------------------------------------------
BIC LG+G4 6005.4554 0.5062
AIC LG+I+G4 5893.6825 0.7923
AICc LG+G4 5941.3599 0.5402
$ raxml-ng --msa prot21.fa --model LG+G4 --prefix S6
Final LogLikelihood: -2872.979205```
Partitioned model is defined in a file, e.g. --model prim2.part
.
cat prim2.part
GTR+G+FO, NADH4=1-504/3,2-504/3
JC+I, tRNA=505-656
GTR+R4+FC, NADH5=657-898
HKY, NADH4p3=3-504/3
- Re-run tree inference for
prim.phy
using partitioned model inprim2.part
. - Compare the results (log-likelihood and tree topology) to the Exercise 1.
raxml-ng --msa prim.phy --model prim2.part --prefix P1
Likelihoods:
grep "Final LogLikelihood:" {S,P}1.raxml.log
S1.raxml.log:Final LogLikelihood: -5708.926872
P1.raxml.log:Final LogLikelihood: -5673.806570
RF distances:
raxml-ng --rf S1.raxml.bestTree,P1.raxml.bestTree
Average absolute RF distance in this tree set: 0.000000
Average relative RF distance in this tree set: 0.000000
Number of unique topologies in this tree set: 1
Before analyzing a large dataset, it is highly recommended to convert alignment into RAxML binary format (RBA) using the --parse
command. Doing this pre-processing step has two advantages:
- binary alignments are faster to load than PHYLIP/FASTA
- you will get estimated memory requirements and the recommended number of threads for this dataset
- Compress alignment file
fusob.phy
raxml-ng --parse --msa fusob.phy --model GTR+G --prefix fusob
Partition 0: noname
Model: GTR+FO+G4m
Alignment sites / patterns: 1602 / 635
Gaps: 10.13 %
Invariant sites: 67.79 %
NOTE: Binary MSA file created: fusob.raxml.rba
* Estimated memory requirements : 6 MB
* Recommended number of threads / MPI processes: 3
2*. Explore how resource estimates change depending on the selected --model
(e.g., GTR
, GTR+R8
)
- Run a tree search with automatic parallelization to establish a baseline (note
-seed 1
!):
raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -prefix TXWX
Analysis options:
parallelization: coarse-grained (auto), PTHREADS (auto)
Parallelization scheme autoconfig: 4 worker(s) x 1 thread(s)
- Re-run the same tree search with different number of
--threads
and--workers
. Can you find a configuration which beats the default heuristic in terms of runtime?
HINT: don't forget to change the--prefix
for every run!
HINT: you can usegrep "Elapsed time:" T*.raxml.log
to quickly compare runtimes.
raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -workers 1 -prefix TXW1
raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -workers 2 -prefix TXW2
raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -threads 2 -workers 2 -prefix T2W2
Results:
T2W2.raxml.log:Elapsed time: 11.995 seconds
TXW1.raxml.log:Elapsed time: 9.752 seconds
TXW2.raxml.log:Elapsed time: 8.551 seconds
TXWX.raxml.log:Elapsed time: 10.398 seconds
3*. Try to oversubscribe CPU cores by using 9 or more threads. What do you observe?
raxml-ng -search -msa fusob.raxml.rba -threads 10 -prefix T10
parallelization: coarse-grained (auto), PTHREADS (10 threads), thread pinning: OFF
ERROR: CPU core oversubscription detected! RAxML-NG will terminate now to avoid wasting resources.
NOTE: Details: https://github.com/amkozlov/raxml-ng/wiki/Parallelization#core-oversubscription
NOTE: You can use '--force perf_threads' to disable this check, but ONLY if you are 200% sure this is a false alarm!
- For this exercise, we will be using an experimental version of RAxML-NG. Please make sure that you specify the correct binary,
raxml-ng-adaptive
:
raxml-ng-adaptive
RAxML-NG v. 1.2.1-adaptive released on 12.01.2024 by The Exelixis Lab.
- Run the adaptive tree search for the dataset from the Exercise 5 (
prim.phy
with partitioned model). Compare runtimes, likelihoods, and best-found ML tree topologies. Can you spot the differences in the log files?
raxml-ng-adaptive --adaptive --msa prim.phy --model prim2.part --prefix P1A
Adaptive:
Analysis options:
run mode: Adaptive ML tree search
[00:00:00] Predicted difficulty: 0.00
Starting ML tree search with 2 distinct starting trees
Final LogLikelihood: -5673.844106
Elapsed time: 1.635 seconds
Standard:
Analysis options:
run mode: ML tree search
Starting ML tree search with 20 distinct starting trees
Final LogLikelihood: -5673.760245
Elapsed time: 8.618 seconds
RF distance:
raxml-ng-adaptive --rf P1.raxml.bestTree,P1A.raxml.bestTree
Average absolute RF distance in this tree set: 0.000000
Average relative RF distance in this tree set: 0.000000
Number of unique topologies in this tree set: 1
3*. Run the adaptive tree search for fusob.phy
. Compare the results with Excercise 1.4.
- Infer bootstrap replicate trees:
raxml-ng --bootstrap --msa prim.phy --model GTR+G --prefix B1
- Compute and map bootstrap support values to the best ML tree:
raxml-ng --support --tree S1.raxml.bestTree --bs-trees B1.raxml.bootstraps --prefix B2
- Open
B2.raxml.support
in the tree viewer of your choice.
4*. Recompute support values using the transfer bootstrap algorithm (TBE). Did this change the support values?
HINT: you do NOT need to re-infer the bootstrap trees!
- Check ParGenes options
python ~/software/ParGenes/pargenes/pargenes.py --help
- Analyze all alignments in the
~/software/ParGenes/examples/data/small/fasta_files/
folder () using default ParGenes settings
python ~/software/ParGenes/pargenes/pargenes.py -a ~/software/ParGenes/examples/data/small/fasta_files/ -o parout -c 2 -m --scheduler openmp
2*. Run model testing and tree inference from 1 parsimony + 5 random starting trees for the prot21.fa
alignment.
python ~/software/ParGenes/pargenes/pargenes.py -a ~/workshop_materials/ng-tutorial/ --msa-filter msa_filter -o parout2 -c 2 -m --scheduler openmp -d aa -s 5 -p 1
- Examine the results from both runs
Results 1:
########################
# PARGENES v1.0.1 #
########################
ParGenes was called as follow:
/home/phylogenomics/software/ParGenes/pargenes/pargenes.py -a /home/phylogenomics/software/ParGenes/examples/data/small/fasta_files/ -o parout -c 2 -m --scheduler openmp
########################
# PARGENES v1.0.1 #
########################
ParGenes was called as follow:
/home/phylogenomics/software/ParGenes/pargenes/pargenes.py -a /home/phylogenomics/software/ParGenes/examples/data/small/fasta_files/ -o parout -c 2 -m --scheduler openmp
[0:00:00] end of MSAs initializations
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../raxml-ng/bin/raxml-ng parout/parse_run/parse_command.txt parout/parse_run 0
Logs will be redirected to parout/parse_run/logs.txt
[Warning] 2/9 commands failed
Average number of taxa: 9
Max number of taxa: 22
Average number of sites: 1711
Max number of sites: 6489
Recommended MAXIMUM number of cores: 1
[Warning] Found 2 invalid MSAs (see parout/invalid_msas.txt)
[0:00:00] end of parsing mpi-scheduler run
[0:00:00] end of anlysing parsing results
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../modeltest/bin/modeltest-ng parout/modeltest_run/modeltest_command.txt parout/modeltest_run 0
Logs will be redirected to parout/modeltest_run/logs.txt
[0:00:51] end of modeltest mpi-scheduler run
[0:00:51] end of parsing modeltest results
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../raxml-ng/bin/raxml-ng parout/parse_run/parse_command.txt parout/parse_run 0
Logs will be redirected to parout/parse_run/logs.txt
[Warning] 2/9 commands failed
Average number of taxa: 9
Max number of taxa: 22
Average number of sites: 1711
Max number of sites: 6489
Recommended MAXIMUM number of cores: 1
[Warning] Found 2 invalid MSAs (see parout/invalid_msas.txt)
[0:00:51] end of the second parsing step
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../raxml-ng/bin/raxml-ng parout/mlsearch_run/mlsearch_command.txt parout/mlsearch_run 0
Logs will be redirected to parout/mlsearch_run/logs.txt
[0:00:54] end of mlsearch mpi-scheduler run
[Warning] Total number of jobs that failed: 3
[Warning] For a detailed list, see parout/failed_commands.txt
[0:00:54] END OF THE RUN OF pargenes.py
Results 2:
########################
# PARGENES v1.0.1 #
########################
ParGenes was called as follow:
/home/phylogenomics/software/ParGenes/pargenes/pargenes.py -a /home/phylogenomics/workshop_materials/ng-tutorial/ --msa-filter msa_filter -o parout2 -c 2 -m --scheduler openmp -d aa -s 5 -p 1
########################
# PARGENES v1.0.1 #
########################
ParGenes was called as follow:
/home/phylogenomics/software/ParGenes/pargenes/pargenes.py -a /home/phylogenomics/workshop_materials/ng-tutorial/ --msa-filter msa_filter -o parout2 -c 2 -m --scheduler openmp -d aa -s 5 -p 1
[0:00:00] end of MSAs initializations
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../raxml-ng/bin/raxml-ng parout2/parse_run/parse_command.txt parout2/parse_run 0
Logs will be redirected to parout2/parse_run/logs.txt
Average number of taxa: 21
Max number of taxa: 21
Average number of sites: 111
Max number of sites: 111
Recommended MAXIMUM number of cores: 1
[0:00:00] end of parsing mpi-scheduler run
[0:00:00] end of anlysing parsing results
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../modeltest/bin/modeltest-ng parout2/modeltest_run/modeltest_command.txt parout2/modeltest_run 0
Logs will be redirected to parout2/modeltest_run/logs.txt
[0:00:10] end of modeltest mpi-scheduler run
[0:00:10] end of parsing modeltest results
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../raxml-ng/bin/raxml-ng parout2/parse_run/parse_command.txt parout2/parse_run 0
Logs will be redirected to parout2/parse_run/logs.txt
Average number of taxa: 21
Max number of taxa: 21
Average number of sites: 111
Max number of sites: 111
Recommended MAXIMUM number of cores: 3
[0:00:11] end of the second parsing step
Calling mpi-scheduler: /home/phylogenomics/software/ParGenes/MPIScheduler/build/mpi-scheduler --openmp-scheduler /home/phylogenomics/software/ParGenes/pargenes/../raxml-ng/bin/raxml-ng parout2/mlsearch_run/mlsearch_command.txt parout2/mlsearch_run 0
Logs will be redirected to parout2/mlsearch_run/logs.txt
[0:00:22] end of mlsearch mpi-scheduler run
[0:00:23] end of selecting the best ML tree
[0:00:23] END OF THE RUN OF pargenes.py