Skip to content

Campaign Requirements

Ioannis Paraskevakos edited this page Oct 11, 2019 · 2 revisions

Campaign Requirements:

  1. 24 users want to run 100k uncorrelated task from half core hour to 100cores-hours and have different types of clusters. There are different queues to provide different priorities between tasks. up to million tasks per hour.
  2. Up to 1k workflows at any given point in time.
  3. Add new resources.

They have priority queues, tasks are labeled high, medium, low priority tasks For every workflow that needs to be executed fast enough a new queue is created. Enormous value to run at scale and not investing human time. They push the workflow and do not care for some time ( task require 12k core-hours and can not scale above 300cores) Force field is burning million core hours to fix the parameters. Most of the data used ML targets, canonical ML,

Small tasks, huge number: 22 million task/ 10 minutes task independent and run them to 7 different resources. Collections index task outputs.

The main part of the tool is structured data and the parallelization comes based on the dataset.

Quantum chemistry is usually workload, or a pipeline as workflows. Force field executes workflows and can adaptively change with a plus minus 30%

Each runtime system takes tasks equal to the capacity plus 20%. High priority tasks may stack behind low priority tasks. What you can do when a task takes 90% of a pilot's wall time Priority is defined by task generator based on user or workflow, user based priorities

Workflow priority may propagate to the tasks. Basic scheduling problem. The deadline of the campaign may change

Users may cancel workflows or tasks.

High priority workflows to go faster would be good.

Check how to introduce runtime changes from the user. The executables can utilize a lot of different type of resources

Property Estimator Chodera CHECK

If the runtime system is not stateful is not usable.

The biggest challenge are campus clusters due to heterogeneity of the cluster and how to setup runtime systems on those.

How do we classify a bad resource and shut it down? How to classify errors to know how to treat a resource. Take information of types of error to identify when errors are caused from the resource an exclude it.


  1. Horizontal task grouping or queueing from workflows.

  2. Tagging workflows to resources.

  3. Priority,tag and then data locality

  4. Non optimal on TTS

  5. All tasks are single node.

  6. Dynamic resources acquisition

  7. Map tasks on resources based on software stack on the resources

  8. When users bring their own program they have to register their program.

    • Building a component ecosystem. Come up a schema that represents the high level scientific problem.
    • Spin different program based on price-performance ratio.
    • Compare outputs from different tasks executors and find use the one that fits the based on what the user can see.
  9. 10k workflows, 20 million tasks, a couple of millions of core-hours.

  10. Extreme heterogeneous resources - Campus cluster, kubernetes, XSEDE is used in the single percentage.

  11. People run on XSEDE

  12. People stay on their tool, unless they want performance. Dask composability is great, but performance is different

  13. Dask from composing workflows, Parsl is nice for single node execution, RP for extreme performance

  14. KNIME, galaxy in Europe

  15. Galaxy compose for researchers that do not know the domain of MD.

  16. QCEngine runs functions, libraries, and executables.

  17. People with large allocations want to drain them.

  18. Run a thousand workflows concurrently.

  19. Workflows structure: - Fan out workflows that can be reduced to PST - workflows with loose decisions trees - conversion criteria loops with complex workflows in every loops - Task have small IO

  20. People want to be able to include workflows and bag of tasks in their campaigns.

  21. Tasks are labeled

  22. The service always a workload.

  23. A campaign manager that can handle tasks with heterogeneous task requirement.

  24. The campaign manager that can make decisions if a task of workflow can still run in the end in the pilot.

  25. Two categories:

    • Campaign managers that take workflows and spits workloads
    • Campaign managers that work on workflow level
  26. There are MD cases that produce 10G to 1T that need to analyze the generated

  27. Be able to delete intermediate data from LFS in the level of the pipeline.

Clone this wiki locally