You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is just a suggestion based on some observations of estimating large (>10K sequences) trees. I note that it takes raxml-ng a long time to estimate the starting parsimony tree (I let it run for an hour then killed it). Of course, I could use a random tree but that probably makes the later optimisation impractical.
So I thought I'd mention it here in case it is useful. Perhaps for datasets above a certain number of tips one could switch to rapidnj starting trees, either with 10 bootstrap trees or 10 trees that are the rapidnj tree plus nine other that are one SPR away for the 10 starting trees.
And thanks for the excellent software!
The text was updated successfully, but these errors were encountered:
Hi Rob, thanks for your suggestion. we will definitely consider adding NJ starting trees as an option! Parsimony is not parallelized in raxml-ng, which is one of the reason it is quite slow on large datasets.
Just a side note: we are also working with this virus data (surprise :) ), so it was very interesting to read about your experiments. I'm wondering, however, why do you want to keep duplicate sequences in your analyses? After removing identical sequences (and doing some filtering), we end up with a dataset of <5K sequences, on which we can run full raxml-ng tree search in "just" a few hours.
That's a good question. I only keep the identical sequences in IQ-TREE, because the code in for removing them in IQ-TREE is not well optimised.
I don't keep them in raxml-ng though. Having said that I still end up with ~7K sequences (using the latest data from GISAID, and trimming as in my repo). What other filtering are you doing?
raxml-ng is running fine (using the fasttree tree, which I made into a bifurcating tree first, as the starting tree), but still is fairly slow (7 hours so far, and on iteration 5).
My CPUs aren't great though (2.4GhZ I think) so maybe that's part of the problem!!
This is just a suggestion based on some observations of estimating large (>10K sequences) trees. I note that it takes raxml-ng a long time to estimate the starting parsimony tree (I let it run for an hour then killed it). Of course, I could use a random tree but that probably makes the later optimisation impractical.
On the same data, I was able to estimate a surprisingly good tree with
rapidnj
(https://github.com/johnlees/rapidnj) in ~3 minutes on one not-very-fast CPU. More details are here: https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.mdSo I thought I'd mention it here in case it is useful. Perhaps for datasets above a certain number of tips one could switch to rapidnj starting trees, either with 10 bootstrap trees or 10 trees that are the
rapidnj
tree plus nine other that are one SPR away for the 10 starting trees.And thanks for the excellent software!
The text was updated successfully, but these errors were encountered: