Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement suggestion: use 'rapidnj' for starting trees #89

Open
roblanf opened this issue May 7, 2020 · 2 comments
Open

enhancement suggestion: use 'rapidnj' for starting trees #89

roblanf opened this issue May 7, 2020 · 2 comments

Comments

@roblanf
Copy link

roblanf commented May 7, 2020

This is just a suggestion based on some observations of estimating large (>10K sequences) trees. I note that it takes raxml-ng a long time to estimate the starting parsimony tree (I let it run for an hour then killed it). Of course, I could use a random tree but that probably makes the later optimisation impractical.

On the same data, I was able to estimate a surprisingly good tree with rapidnj (https://github.com/johnlees/rapidnj) in ~3 minutes on one not-very-fast CPU. More details are here: https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md

So I thought I'd mention it here in case it is useful. Perhaps for datasets above a certain number of tips one could switch to rapidnj starting trees, either with 10 bootstrap trees or 10 trees that are the rapidnj tree plus nine other that are one SPR away for the 10 starting trees.

And thanks for the excellent software!

@amkozlov
Copy link
Owner

amkozlov commented May 7, 2020

Hi Rob, thanks for your suggestion. we will definitely consider adding NJ starting trees as an option! Parsimony is not parallelized in raxml-ng, which is one of the reason it is quite slow on large datasets.

Just a side note: we are also working with this virus data (surprise :) ), so it was very interesting to read about your experiments. I'm wondering, however, why do you want to keep duplicate sequences in your analyses? After removing identical sequences (and doing some filtering), we end up with a dataset of <5K sequences, on which we can run full raxml-ng tree search in "just" a few hours.

@roblanf
Copy link
Author

roblanf commented May 8, 2020

That's a good question. I only keep the identical sequences in IQ-TREE, because the code in for removing them in IQ-TREE is not well optimised.

I don't keep them in raxml-ng though. Having said that I still end up with ~7K sequences (using the latest data from GISAID, and trimming as in my repo). What other filtering are you doing?

raxml-ng is running fine (using the fasttree tree, which I made into a bifurcating tree first, as the starting tree), but still is fairly slow (7 hours so far, and on iteration 5).

My CPUs aren't great though (2.4GhZ I think) so maybe that's part of the problem!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants