Dynamic parallelization of rule execution #99

rvosa · 2024-05-02T10:28:25Z

As indicated in the DAG, numerous steps in the pipeline can be trivially parallelized, allowing for horizontal scaling. However, the current implementation for this is based on the scattergather compute model, which needs to be told ahead of time how many parallel processes are going to be involved. The number of processes is specified in the config file on the basis of the number of distinct taxonomic families in the input set (e.g.: the order Primates has 17 families, which is entered in the config file under nfamilies and from there ends up in the Snakefile). This is an awkward that users tend to get wrong, hence a dynamic solution where the pipeline learns the parallelization strategy from the input data would be better. However, there are some complication:

using the dynamic construct in recent versions of SnakeMake appears to interfere with the ability to generate a DAG, which is one of the requirements for submission to WorkFlowHub
the number of families must be learned from the input data in combination with the applicable marker gene, i.e. a simple cut | sort | uniq | wc -l (or similar) approach will be error prone
within the full data set, there's a small set of families (<10) whose size may exceed the capacity of the subtree inference step, meaning that those families may have to be partitioned at subfamily or genus level, increasing the number of parallel processes

This is considered 'done' when users no longer have to defined scatter/gather parameters.

The text was updated successfully, but these errors were encountered:

rvosa added this to the Roadmap NLeSC/Naturalis collaboration milestone May 2, 2024

rvosa mentioned this issue May 2, 2024

Scalability of large subsets #100

Open

rvosa added this to BACTRIA moon shot May 29, 2024

rvosa moved this to Todo in BACTRIA moon shot May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic parallelization of rule execution #99

Dynamic parallelization of rule execution #99

rvosa commented May 2, 2024 •

edited

Loading

Dynamic parallelization of rule execution #99

Dynamic parallelization of rule execution #99

Comments

rvosa commented May 2, 2024 • edited Loading

rvosa commented May 2, 2024 •

edited

Loading