Skip to content

evomics2024

Oleksiy Kozlov edited this page Jan 25, 2024 · 17 revisions

This is the full solution script which includes all command lines and results.
For the shortened version which will be presented to the students, see evomics2024.md.

NOTE: Optional tasks are marked with an asterisk (*).

Lab #1: RAxML-NG Basics

Exercise 0 : Getting ready

1. Check input data:

cd /home/phylogenomics/workshop_materials/ng-tutorial/
ls

you should see roughly the following:

5.phy   dna.map    multi199.part  multi5.fa   prim2.part  prim.part  prim.phy.raxml.log  prot140.phy  prot21.fa.ckp  prot21.fa.out   prot21.part  README.md   terrace.part
bad.fa  fusob.phy  multi199.phy   multi5.map  prim.cons   prim.phy   prot140.part        prot21.fa    prot21.fa.log  prot21.fa.tree  rbcl.phy     terrace.fa

2. Run RAxML-NG binary without parameters

raxml-ng

this will show RAxML-NG version and short usage help:

RAxML-NG v. 1.2.1 released on 22.12.2023 by The Exelixis Lab.

[...]
System: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, 4 cores, 23 GB RAM

Usage: raxml-ng [OPTIONS]

Commands (mutually exclusive):
  --help                                     display help information
[...]

3. It is always a good idea to run alignment sanity check before starting the actual analysis:

raxml-ng --check --msa prim.phy --model GTR+G

This alignment is clean, so you should see:

Alignment can be successfully read by RAxML-NG.

4.* Run --check command again with bad.fa and examine error and warning messages.

Exercise 1 : Tree search

1. Infer ML tree with default parameters (10 random + 10 parsimony starting trees):

raxml-ng --msa prim.phy --model GTR+G --prefix S1

2. Check log-likelihoods for all 20 resulting trees:

grep "logLikelihood:" S1.raxml.log

Result:

[00:00:00] [worker #0] ML tree search #1, logLikelihood: -5708.940514
[00:00:00] [worker #2] ML tree search #3, logLikelihood: -5709.367652
[00:00:00] [worker #3] ML tree search #4, logLikelihood: -5708.950769
[00:00:00] [worker #1] ML tree search #2, logLikelihood: -5708.981882
[00:00:00] [worker #0] ML tree search #5, logLikelihood: -5708.969503
[00:00:00] [worker #2] ML tree search #7, logLikelihood: -5708.949393
[00:00:00] [worker #1] ML tree search #6, logLikelihood: -5708.936930
[00:00:00] [worker #3] ML tree search #8, logLikelihood: -5709.023648
[00:00:01] [worker #0] ML tree search #9, logLikelihood: -5708.976056
[00:00:01] [worker #2] ML tree search #11, logLikelihood: -5709.009527
[00:00:01] [worker #1] ML tree search #10, logLikelihood: -5708.943575
[00:00:01] [worker #3] ML tree search #12, logLikelihood: -5709.015055
[00:00:01] [worker #0] ML tree search #13, logLikelihood: -5708.968725
[00:00:01] [worker #2] ML tree search #15, logLikelihood: -5709.020541
[00:00:01] [worker #1] ML tree search #14, logLikelihood: -5709.013903
[00:00:01] [worker #3] ML tree search #16, logLikelihood: -5709.012914
[00:00:01] [worker #0] ML tree search #17, logLikelihood: -5709.010325
[00:00:01] [worker #2] ML tree search #19, logLikelihood: -5709.021621
[00:00:01] [worker #1] ML tree search #18, logLikelihood: -5709.021075
[00:00:01] [worker #3] ML tree search #20, logLikelihood: -5709.006120

3. Check topological distances between trees (so-called Robinson-Foulds or RF distance):

raxml-ng --rfdist --tree S1.raxml.mlTrees --prefix RF1

Result:

Average absolute RF distance in this tree set: 0.000000
Average relative RF distance in this tree set: 0.000000
Number of unique topologies in this tree set: 1

4.* Repeat steps 1, 2 and 3 for fusob.phy, using 5 random + 5 parsimony starting trees.

raxml-ng --msa fusob.phy --model GTR+G --prefix S2  --tree pars{5},rand{5}
grep "logLikelihood:" S2.raxml.log
[00:00:02] [worker #2] ML tree search #3, logLikelihood: -9974.673091
[00:00:02] [worker #0] ML tree search #1, logLikelihood: -9974.665558
[00:00:02] [worker #1] ML tree search #2, logLikelihood: -9980.906638
[00:00:03] [worker #3] ML tree search #4, logLikelihood: -9974.663320
[00:00:04] [worker #2] ML tree search #7, logLikelihood: -9980.906494
[00:00:04] [worker #1] ML tree search #6, logLikelihood: -9974.670122
[00:00:04] [worker #3] ML tree search #8, logLikelihood: -9974.669716
[00:00:05] [worker #0] ML tree search #5, logLikelihood: -9974.671526
[00:00:06] [worker #1] ML tree search #10, logLikelihood: -9974.666947
[00:00:07] [worker #0] ML tree search #9, logLikelihood: -9981.909653
raxml-ng --rfdist --tree S2.raxml.mlTrees --prefix RF2
Loaded 10 trees with 38 taxa.

Average absolute RF distance in this tree set: 5.244444
Average relative RF distance in this tree set: 0.074921
Number of unique topologies in this tree set: 3

Exercise 2 : Bootstrapping / "all-in-one"

1. Run "all-in-one" analysis (ML tree search + bootstrapping + branch support mapping):

raxml-ng --all --msa prim.phy --model GTR+G --prefix A1

This is convenient, but not always possible/efficient for large datasets! We will explore an alternative approach in Lab #2.

2. Open A1.raxml.support in the tree viewer of your choice.

3.* Repeat step 1. using a fixed number of bootstrap replicates (100). Did this change the support values?

Exercise 3: Tree likelihood evaluation

The --evaluate command computes the likelihood of a fixed tree topology after re-optimizing all model parameters and branch lengths. We will use it to compare different evolutionary models:

raxml-ng --evaluate --msa prim.phy --tree S1.raxml.bestTree --model GTR+G --prefix E_GTRG

1. Repeat the above command for GTR+R4, GTR, JC and JC+G models. Don't forget to change the --prefix!

2. Check the log-likelihood scores:

grep "Final LogLikelihood:" E*.raxml.log

Which model yields the highest log-likelihood? Does it mean that this model should be preferred?

Parameter-rich models are more prone to overfitting the data. Thus, special criteria have been developed to compare models with different number of free parameters. In phylogenetics, AIC/AICc and BIC are most commonly used.

3. Check AIC/BIC scores for our runs (lower values are better):

grep "AIC score" E*.raxml.log

Which model should be preferred according to information theoretical criteria ?

Results:

grep "Final LogLikelihood:" E*.raxml.log

E_GTRG.raxml.log:Final LogLikelihood: -5715.693753
E_GTRR4.raxml.log:Final LogLikelihood: -5714.610468
E_GTR.raxml.log:Final LogLikelihood: -5934.558984
E_JCG.raxml.log:Final LogLikelihood: -6272.478819
E_JC.raxml.log:Final LogLikelihood: -6424.202453
grep "AIC score" E*.raxml.log

E_GTRG.raxml.log:AIC score: 11491.387506 / AICc score: 11493.532834 / BIC score: 11635.392608
E_GTRR4.raxml.log:AIC score: 11499.220936 / AICc score: 11502.144370 / BIC score: 11667.226889
E_GTR.raxml.log:AIC score: 11927.117968 / AICc score: 11929.122576 / BIC score: 12066.322900
E_JCG.raxml.log:AIC score: 12588.957638 / AICc score: 12590.114210 / BIC score: 12694.561380
E_JC.raxml.log:AIC score: 12890.404907 / AICc score: 12891.459701 / BIC score: 12991.208478

Exercise 4: Partitioned models

Partitioned model is defined in a file, e.g. --model prim2.part.

cat prim2.part 
GTR+G+FO, NADH4=1-504/3,2-504/3
JC+I, tRNA=505-656
GTR+R4+FC, NADH5=657-898
HKY, NADH4p3=3-504/3

1. Re-run tree inference for prim.phy using partitioned model in prim2.part.

2. Compare the results (log-likelihood and tree topology) to the Exercise 1.

raxml-ng --msa prim.phy --model prim2.part --prefix P1

Likelihoods:

grep "Final LogLikelihood:" {S,P}1.raxml.log
S1.raxml.log:Final LogLikelihood: -5708.926872
P1.raxml.log:Final LogLikelihood: -5673.806570

RF distances:

raxml-ng --rf S1.raxml.bestTree,P1.raxml.bestTree
Average absolute RF distance in this tree set: 0.000000
Average relative RF distance in this tree set: 0.000000
Number of unique topologies in this tree set: 1

Exercise 5: Constrained tree search

1. Establish a baseline: run regular (unconstrained) tree search on prim.phy with 100 random + 100 parsimony starting trees:

raxml-ng --search --msa prim.phy --model GTR+G -tree rand{100},pars{100} --prefix C0 --seed 1

2. Re-run the same search with a topological constraint defined in prim1.cons:

raxml-ng --search --msa prim.phy --model GTR+G -tree rand{100},pars{100} --prefix C1 --seed 1 --tree-constraint prim1.cons

3. Re-run the same search again now with a diffent constraint, prim2.cons:

raxml-ng --search --msa prim.phy --model GTR+G -tree rand{100},pars{100} --prefix C2 --seed 1 --tree-constraint prim2.cons

4. Compare runtimes and likelihoods among all three runs. How can we interpret the results?

Results:

grep "Final LogLikelihood:" C*.raxml.log
C0.raxml.log:Final LogLikelihood: -5708.923405
C1.raxml.log:Final LogLikelihood: -5708.926239
C2.raxml.log:Final LogLikelihood: -5779.491583
grep "Elapsed time:" C*.raxml.log
C0.raxml.log:Elapsed time: 23.629 seconds
C1.raxml.log:Elapsed time: 17.553 seconds
C2.raxml.log:Elapsed time: 17.987 seconds

Exercise 6: Protein data and ModelTest-NG

1. Check online help

modeltest-ng-static --help

Important options are:

  • -i ALIGNMENT
  • -d DATATYPE, where DATATYPE is nt (default) or aa

2. Run model selection for prot21.fa (protein alignment):

modeltest-ng-static -i prot21.fa -d aa

3. Run tree inference with the best-scoring model determined by ModelTest-NG.

4.* Re-run ModelTest-NG including the LG4M and LG4X models which are not tested by default.

Results:

modeltest-ng -i prot21.fa -d aa

Partition 1/1:
                         Model         Score        Weight
----------------------------------------------------------
       BIC               LG+G4     6005.4633        0.5063
       AIC             LG+I+G4     5893.6905        0.7923
      AICc               LG+G4     5941.3678        0.5403
raxml-ng --msa prot21.fa --model LG+G4 --prefix S6
Final LogLikelihood: -2872.979205

Lab #2: Parallelization / Analyzing Large Datasets

Exercise 7: Alignment compression

Before analyzing a large dataset, it is highly recommended to convert alignment into RAxML binary format (RBA) using the --parse command. Doing this pre-processing step has two advantages:

  • binary alignments are faster to load than PHYLIP/FASTA
  • you will get estimated memory requirements and the recommended number of threads for this dataset

1. Compress alignment file fusob.phy

raxml-ng --parse --msa fusob.phy --model GTR+G --prefix fusob
Partition 0: noname
Model: GTR+FO+G4m
Alignment sites / patterns: 1602 / 635
Gaps: 10.13 %
Invariant sites: 67.79 %

NOTE: Binary MSA file created: fusob.raxml.rba

* Estimated memory requirements                : 6 MB

* Recommended number of threads / MPI processes: 3

2.* Explore how resource estimates change depending on the selected --model (e.g., GTR, GTR+R8)

Exercise 8: Automatic and manual parallelization

1. Run a tree search with automatic parallelization to establish a baseline (note -seed 1!):

raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -prefix TXWX
Analysis options:
  parallelization: coarse-grained (auto), PTHREADS (auto)

Parallelization scheme autoconfig: 4 worker(s) x 1 thread(s)

2. Re-run the same tree search with different number of --threads and --workers. Can you find a configuration which beats the default heuristic in terms of runtime?
HINT: don't forget to change the --prefix for every run!
HINT: you can use grep "Elapsed time:" T*.raxml.log to quickly compare runtimes.

raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -workers 1 -prefix TXW1
raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -workers 2 -prefix TXW2
raxml-ng -search -msa fusob.raxml.rba -tree rand{10} -seed 1 -threads 2 -workers 2 -prefix T2W2

Results:

T2W2.raxml.log:Elapsed time: 11.995 seconds
TXW1.raxml.log:Elapsed time: 9.752 seconds
TXW2.raxml.log:Elapsed time: 8.551 seconds
TXWX.raxml.log:Elapsed time: 10.398 seconds

3.* Try to oversubscribe CPU cores by using 9 or more threads. What do you observe?

raxml-ng -search -msa fusob.raxml.rba -threads 10 -prefix T10
  parallelization: coarse-grained (auto), PTHREADS (10 threads), thread pinning: OFF

ERROR: CPU core oversubscription detected! RAxML-NG will terminate now to avoid wasting resources.
NOTE:  Details: https://github.com/amkozlov/raxml-ng/wiki/Parallelization#core-oversubscription
NOTE:  You can use '--force perf_threads' to disable this check, but ONLY if you are 200% sure this is a false alarm!

For more details on RAxML-NG parallelization, please read: https://github.com/amkozlov/raxml-ng/wiki/Parallelization

Exercise 9: Bootstrapping revisited

1. Infer 100 bootstrap replicate trees in two batches of 50:

raxml-ng --bootstrap --msa prim.phy --model GTR+G --prefix BS1 --seed $RANDOM --bs-trees 50
raxml-ng --bootstrap --msa prim.phy --model GTR+G --prefix BS2 --seed $RANDOM --bs-trees 50

2. Combine bootstrap trees froma all runs into a single file (HINT: use cat):

cat BS*.raxml.bootstraps > bstrees.txt

3. Compute and map bootstrap support values to the best ML tree (S1.raxml.bestTree):

raxml-ng --support --tree S1.raxml.bestTree --bs-trees bstrees.txt --prefix B2 

4. Open <PREFIX>.raxml.support in the tree viewer of your choice.

5.* Recompute support values using the Transfer Bootstrap Expectation (TBE) algorithm. Did this change the support values?
HINT: you do NOT need to re-infer the bootstrap trees!

raxml-ng --support --tree S1.raxml.bestTree --bs-trees bstrees.txt --bs-metric TBE --prefix B3

Exercise 10: Adaptive search

1. For this exercise, we will be using an experimental version of RAxML-NG. Please make sure that you specify the correct binary, raxml-ng-adaptive:

raxml-ng-adaptive
RAxML-NG v. 1.2.1-adaptive released on 12.01.2024 by The Exelixis Lab.

2. Run the adaptive tree search for the dataset from the Exercise 5 (prim.phy with partitioned model). Compare runtimes, likelihoods, and best-found ML tree topologies. Can you spot the differences in the log files?

raxml-ng-adaptive --adaptive --msa prim.phy --model prim2.part --prefix P1A

Adaptive:

Analysis options:
  run mode: Adaptive ML tree search

[00:00:00] Predicted difficulty: 0.00

Starting ML tree search with 2 distinct starting trees

Final LogLikelihood: -5673.844106

Elapsed time: 1.635 seconds

Standard:

Analysis options:
  run mode: ML tree search

Starting ML tree search with 20 distinct starting trees

Final LogLikelihood: -5673.760245

Elapsed time: 8.618 seconds

RF distance:

raxml-ng-adaptive --rf P1.raxml.bestTree,P1A.raxml.bestTree

Average absolute RF distance in this tree set: 0.000000
Average relative RF distance in this tree set: 0.000000
Number of unique topologies in this tree set: 1

3.* Run the adaptive tree search for fusob.phy. Compare the results with Excercise 1.4.

raxml-ng-adaptive --adaptive --msa fusob.phy --model GTR+G --prefix S2A

Adaptive:

Analysis options:
  run mode: Adaptive ML tree search

[00:00:00] Predicted difficulty: 0.37

Starting ML tree search with 13 distinct starting trees

Final LogLikelihood: -9974.668668

Elapsed time: 7.681 seconds

Standard:

Analysis options:
  run mode: ML tree search

Starting ML tree search with 10 distinct starting trees

Final LogLikelihood: -9974.663320

Elapsed time: 7.531 seconds

RF distance:

raxml-ng-adaptive --rf S2.raxml.bestTree,S2A.raxml.bestTree

Average absolute RF distance in this tree set: 0.000000
Average relative RF distance in this tree set: 0.000000
Number of unique topologies in this tree set: 1

Exercise 11: ParGenes

0. Check ParGenes options

python ~/software/ParGenes/pargenes/pargenes.py  --help

1. Analyze all alignments in the ~/software/ParGenes/examples/data/small/fasta_files/ folder using default ParGenes settings

python ~/software/ParGenes/pargenes/pargenes.py  -a ~/software/ParGenes/examples/data/small/fasta_files/  -o parout -c 4 -m --scheduler fork

2.*. Run model testing and tree inference from 1 parsimony + 5 random starting trees for the prot21.fa alignment.

echo "prot21.fa" > msa_filter.txt

python ~/software/ParGenes/pargenes/pargenes.py  -a ~/workshop_materials/ng-tutorial/ --msa-filter msa_filter.txt -o parout2 -c 4 -m --scheduler fork -d aa -s 5 -p 1

3. Examine the results from both runs

Results 1:

########################
#    PARGENES v1.2.0   #
########################

2024-01-18 17:11:10.365173
ParGenes was called as follow:
/home/alex/hits/ParGenes/pargenes/pargenes.py -a /home/alex/hits/ParGenes/examples/data/small/fasta_files/ -o parout -c 4 -m --scheduler fork --scheduler fork

[0:00:00] end of MSAs initializations
Binaries directory: /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries
Calling mpi-scheduler: /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/mpi-scheduler --fork-scheduler 4 /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/raxml-ng parout/parse_run/parse_command.txt parout/parse_run --threads-arg --threads
Logs will be redirected to parout/parse_run/logs.txt
  Number of families: 9
  Average number of taxa: 9
  Max number of taxa: 22
  Average number of sites: 1711
  Max number of sites: 6489
  Recommended MAXIMUM number of cores: 1
[Warning] [0:00:00] Found 2 invalid MSAs (see parout/invalid_msas.txt)
[0:00:00] end of parsing mpi-scheduler run
[0:00:00] end of anlysing parsing results
Calling mpi-scheduler: /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/mpi-scheduler --fork-scheduler 4 /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/modeltest-ng parout/modeltest_run/modeltest_command.txt parout/modeltest_run --threads-arg -p
Logs will be redirected to parout/modeltest_run/logs.txt
[0:00:16] end of modeltest mpi-scheduler run
[0:00:16] end of parsing  modeltest results
Calling mpi-scheduler: /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/mpi-scheduler --fork-scheduler 4 /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/raxml-ng parout/parse_run/parse_command.txt parout/parse_run --threads-arg --threads
Logs will be redirected to parout/parse_run/logs.txt
  Number of families: 9
  Average number of taxa: 9
  Max number of taxa: 22
  Average number of sites: 1711
  Max number of sites: 6489
  Recommended MAXIMUM number of cores: 1
[Warning] [0:00:16] Found 2 invalid MSAs (see parout/invalid_msas.txt)
[0:00:16] end of the second parsing step
TODO UPDATE CHECKPOINT
Calling mpi-scheduler: /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/mpi-scheduler --fork-scheduler 4 /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/raxml-ng parout/mlsearch_run/mlsearch_command.txt parout/mlsearch_run --threads-arg --threads
Logs will be redirected to parout/mlsearch_run/logs.txt
[0:00:18] end of mlsearch mpi-scheduler run
[0:00:18] END OF THE RUN OF pargenescore.py

Results 2:

########################
#    PARGENES v1.2.0   #
########################

2024-01-18 17:18:33.982886
ParGenes was called as follow:
/home/alex/hits/ParGenes/pargenes/pargenes.py -a /home/alex/hits/ng-tutorial/ --msa-filter msa_filter.txt -o parout2 -c 4 -m --scheduler fork -d aa -s 5 -p 1 --scheduler fork

[0:00:00] end of MSAs initializations
Binaries directory: /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries
Calling mpi-scheduler: /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/mpi-scheduler --fork-scheduler 4 /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/raxml-ng parout2/parse_run/parse_command.txt parout2/parse_run --threads-arg --threads
Logs will be redirected to parout2/parse_run/logs.txt
  Number of families: 1
  Average number of taxa: 21
  Max number of taxa: 21
  Average number of sites: 111
  Max number of sites: 111
  Recommended MAXIMUM number of cores: 1
[0:00:00] end of parsing mpi-scheduler run
[0:00:00] end of anlysing parsing results
Calling mpi-scheduler: /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/mpi-scheduler --fork-scheduler 4 /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/modeltest-ng parout2/modeltest_run/modeltest_command.txt parout2/modeltest_run --threads-arg -p
Logs will be redirected to parout2/modeltest_run/logs.txt
[0:00:02] end of modeltest mpi-scheduler run
[0:00:02] end of parsing  modeltest results
Calling mpi-scheduler: /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/mpi-scheduler --fork-scheduler 4 /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/raxml-ng parout2/parse_run/parse_command.txt parout2/parse_run --threads-arg --threads
Logs will be redirected to parout2/parse_run/logs.txt
  Number of families: 1
  Average number of taxa: 21
  Max number of taxa: 21
  Average number of sites: 111
  Max number of sites: 111
  Recommended MAXIMUM number of cores: 3
[0:00:02] end of the second parsing step
TODO UPDATE CHECKPOINT
Calling mpi-scheduler: /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/mpi-scheduler --fork-scheduler 4 /home/alex/hits/ParGenes/pargenes/pargenes_src/../pargenes_binaries/raxml-ng parout2/mlsearch_run/mlsearch_command.txt parout2/mlsearch_run --threads-arg --threads
Logs will be redirected to parout2/mlsearch_run/logs.txt
[0:00:07] end of mlsearch mpi-scheduler run
[0:00:07] end of selecting the best ML tree
[0:00:07] END OF THE RUN OF pargenescore.py

Further reading

RAxML-NG documentation: https://github.com/amkozlov/raxml-ng/wiki

Detailed RAxML-NG tutorial: https://github.com/amkozlov/raxml-ng/wiki/Tutorial