enhancement suggestion: use 'rapidnj' for starting trees #89

roblanf · 2020-05-07T11:36:40Z

This is just a suggestion based on some observations of estimating large (>10K sequences) trees. I note that it takes raxml-ng a long time to estimate the starting parsimony tree (I let it run for an hour then killed it). Of course, I could use a random tree but that probably makes the later optimisation impractical.

On the same data, I was able to estimate a surprisingly good tree with rapidnj (https://github.com/johnlees/rapidnj) in ~3 minutes on one not-very-fast CPU. More details are here: https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md

So I thought I'd mention it here in case it is useful. Perhaps for datasets above a certain number of tips one could switch to rapidnj starting trees, either with 10 bootstrap trees or 10 trees that are the rapidnj tree plus nine other that are one SPR away for the 10 starting trees.

And thanks for the excellent software!

The text was updated successfully, but these errors were encountered:

amkozlov · 2020-05-07T15:25:05Z

Hi Rob, thanks for your suggestion. we will definitely consider adding NJ starting trees as an option! Parsimony is not parallelized in raxml-ng, which is one of the reason it is quite slow on large datasets.

Just a side note: we are also working with this virus data (surprise :) ), so it was very interesting to read about your experiments. I'm wondering, however, why do you want to keep duplicate sequences in your analyses? After removing identical sequences (and doing some filtering), we end up with a dataset of <5K sequences, on which we can run full raxml-ng tree search in "just" a few hours.

roblanf · 2020-05-08T05:29:50Z

That's a good question. I only keep the identical sequences in IQ-TREE, because the code in for removing them in IQ-TREE is not well optimised.

I don't keep them in raxml-ng though. Having said that I still end up with ~7K sequences (using the latest data from GISAID, and trimming as in my repo). What other filtering are you doing?

raxml-ng is running fine (using the fasttree tree, which I made into a bifurcating tree first, as the starting tree), but still is fairly slow (7 hours so far, and on iteration 5).

My CPUs aren't great though (2.4GhZ I think) so maybe that's part of the problem!!

amkozlov added the enhancement label Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement suggestion: use 'rapidnj' for starting trees #89

enhancement suggestion: use 'rapidnj' for starting trees #89

roblanf commented May 7, 2020

amkozlov commented May 7, 2020

roblanf commented May 8, 2020

enhancement suggestion: use 'rapidnj' for starting trees #89

enhancement suggestion: use 'rapidnj' for starting trees #89

Comments

roblanf commented May 7, 2020

amkozlov commented May 7, 2020

roblanf commented May 8, 2020