Runtime? #58

ilante · 2021-08-24T10:37:09Z

ilante
Aug 24, 2021

A few practical questions:

What is the runtime of the vgp-assembly/pipeline/? I was unable to find it in the documentation.
How long would it approximately take to run the entire workflow of the vgp-assembly for a species with a ~ 300 Mb genome?
What would the computing cost roughly be?

Any rough estimates or pointers to the documentation would be highly appreciated!

Aug 25, 2021

Hi Ilante,
Just roughly. These are based on amount of data.
Falcon (and most OLC based assemblers) O(n^2)
Arrow polishing/ Freebayes polishing and most mapper O(n); note that the main reason these two steps are expensive is because memory requirement
The rest of them are not expensive, so I never look in detail. I would guess they are O(n) too.

The actual computing cost is very hard to predict for genome assembly. repeat content does matter the most. For example, maize (2G) is actually more expensive than mammalian (3G). All genomes I ever work with that are bigger than 6G are always repetitive. However, some insect or algae genome even if it's just 1G could be repetitive as well.

View full answer

Arkarachai · 2021-08-25T17:07:43Z

Arkarachai
Aug 25, 2021
Collaborator

It would be quite varied and depend on where you run it (spec, cloud vs HPC), species of interest (repeat content, size), and data (coverage, length)
The cost and run time are highly correlated. It would be roughly in this order: Falcon + Falcon-Unzip > Arrow Polishing > short read polishing > the rest. The purge_dup, 10x scaffolding, Bionano scaffolding, and Hi-C scaffolding are pretty cheap. The 300Mb genome is pretty small, so about $5k budget should cover all the compute cost. It would depend on where you run this and how efficient you are with your computing set up though.
Hope this helps a bit.

1 reply

ilante Aug 25, 2021
Author

Dear Arkarachai,

Thanks a lot that is already a good summary for me to get a rough idea!

I was wondering if you had the time complexity for the different components of the pipeline as well?

Thanks in advance!

Arkarachai · 2021-08-25T17:40:44Z

Arkarachai
Aug 25, 2021
Collaborator

Hi Ilante,
Just roughly. These are based on amount of data.
Falcon (and most OLC based assemblers) O(n^2)
Arrow polishing/ Freebayes polishing and most mapper O(n); note that the main reason these two steps are expensive is because memory requirement
The rest of them are not expensive, so I never look in detail. I would guess they are O(n) too.

The actual computing cost is very hard to predict for genome assembly. repeat content does matter the most. For example, maize (2G) is actually more expensive than mammalian (3G). All genomes I ever work with that are bigger than 6G are always repetitive. However, some insect or algae genome even if it's just 1G could be repetitive as well.

2 replies

ilante Aug 25, 2021
Author

Thanks a lot for these insights!

Just to be sure I got it right when you write G you mean Giga bases = Gb?

Arkarachai Aug 25, 2021
Collaborator

yes, that is correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime? #58

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Runtime? #58

ilante Aug 24, 2021

A few practical questions:

Replies: 2 comments · 3 replies

Arkarachai Aug 25, 2021 Collaborator

ilante Aug 25, 2021 Author

Arkarachai Aug 25, 2021 Collaborator

ilante Aug 25, 2021 Author

Arkarachai Aug 25, 2021 Collaborator

ilante
Aug 24, 2021

Replies: 2 comments 3 replies

Arkarachai
Aug 25, 2021
Collaborator

ilante Aug 25, 2021
Author

Arkarachai
Aug 25, 2021
Collaborator

ilante Aug 25, 2021
Author

Arkarachai Aug 25, 2021
Collaborator