You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As indicated in the DAG, numerous steps in the pipeline can be trivially parallelized, allowing for horizontal scaling. However, the current implementation for this is based on the scattergather compute model, which needs to be told ahead of time how many parallel processes are going to be involved. The number of processes is specified in the config file on the basis of the number of distinct taxonomic families in the input set (e.g.: the order Primates has 17 families, which is entered in the config file under nfamilies and from there ends up in the Snakefile). This is an awkward that users tend to get wrong, hence a dynamic solution where the pipeline learns the parallelization strategy from the input data would be better. However, there are some complication:
using the dynamic construct in recent versions of SnakeMake appears to interfere with the ability to generate a DAG, which is one of the requirements for submission to WorkFlowHub
the number of families must be learned from the input data in combination with the applicable marker gene, i.e. a simple cut | sort | uniq | wc -l (or similar) approach will be error prone
within the full data set, there's a small set of families (<10) whose size may exceed the capacity of the subtree inference step, meaning that those families may have to be partitioned at subfamily or genus level, increasing the number of parallel processes
This is considered 'done' when users no longer have to defined scatter/gather parameters.
The text was updated successfully, but these errors were encountered:
As indicated in the DAG, numerous steps in the pipeline can be trivially parallelized, allowing for horizontal scaling. However, the current implementation for this is based on the
scattergather
compute model, which needs to be told ahead of time how many parallel processes are going to be involved. The number of processes is specified in the config file on the basis of the number of distinct taxonomic families in the input set (e.g.: the order Primates has 17 families, which is entered in the config file undernfamilies
and from there ends up in the Snakefile). This is an awkward that users tend to get wrong, hence a dynamic solution where the pipeline learns the parallelization strategy from the input data would be better. However, there are some complication:dynamic
construct in recent versions of SnakeMake appears to interfere with the ability to generate a DAG, which is one of the requirements for submission to WorkFlowHubcut | sort | uniq | wc -l
(or similar) approach will be error proneThis is considered 'done' when users no longer have to defined scatter/gather parameters.
The text was updated successfully, but these errors were encountered: