Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WFRunner] Handle resource limits and CPU better #1532

Conversation

benedikt-voelkel
Copy link
Contributor

@benedikt-voelkel benedikt-voelkel commented Mar 14, 2024

  • Account for relative CPU factor in case of sampling
    Studies have shown that being able to backfill tasks can have a
    difference for CPU efficiency, especially for transport.
    That is in particular important for high-efficiency jobs such as
    high-interaction-rate pp simulations.

  • Abort by default if estimated resources exceed limits

  • Run anyway, if --optimistic-resources is passed

    • Fix: Actually reset the overestimated resources to the limits as
      otherwise the runner would silently quit when nothing else can be
      done.
  • In case of dynamically sampled resources and if a corresponding task
    has been run already:
    Reset the assigned resources to the limits if they exceed the
    boundaries.

Copy link

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2022-pp-apass4
async-2023-pbpb-apass
async-2023-pp-apass1
async-data
async-mc
async-2022-pp-apass6

@benedikt-voelkel
Copy link
Contributor Author

@sawenzel do you have ay comments here? If not, I would go ahead and merge.

@sawenzel
Copy link
Contributor

The commit message sounds reasonable. But is this necessary to test the dynamic system (just wondering how time-critical it is)?

@benedikt-voelkel
Copy link
Contributor Author

It's not necessary for testing right now.
It is mostly to get that fix in. Because before, one would --optimistic-resources (e.g. to run locally anyway although TPCDigi might have the initial estimate of 16GB but the local RAM is lower).
And to consolidate that a little bit more, I encapsulated the checking and limiting of resource boundaries a bit more.

Tested and works just fine.

* Account for relative CPU factor in case of sampling
  Studies have shown that being able to backfill tasks can have a
  difference for CPU efficiency, especially for transport.
  That is in particular important for high-efficiency jobs such as
  high-interaction-rate pp simulations.

* Abort by default if estimated resources exceed limits

* Run anyway, if --optimistic-resources is passed
  * Fix: Actually reset the overestimated resources to the limits as
    otherwise the runner would silently quit when nothing else can be
    done.

* In case of dynamically sampled resources and if a corresponding task
  has been run already:
  Reset the assigned resources to the limits if they exceed the
  boundaries.
@benedikt-voelkel benedikt-voelkel force-pushed the dynamic-resources-limits branch from c4787ea to c01d0ad Compare March 17, 2024 10:50
@benedikt-voelkel benedikt-voelkel changed the title [WFRunner] Better treatment of resource limits [WFRunner] Handle resource limits and CPU better Mar 17, 2024
@benedikt-voelkel benedikt-voelkel added async-2022-pp-apass4 async-2023-pbpb-apass4 Request porting to async-2023-pbpb-apass4 mc labels Mar 17, 2024
@benedikt-voelkel
Copy link
Contributor Author

@chiarazampolli
When ported, first, we also need to take

Anyway, first of all to be discussed with @sawenzel

This is important to keep the CPU efficiency where it is in case it is already high by default. I will explain in an email.

@benedikt-voelkel benedikt-voelkel merged commit ce0ec6e into AliceO2Group:master Mar 18, 2024
6 checks passed
@benedikt-voelkel benedikt-voelkel deleted the dynamic-resources-limits branch March 20, 2024 08:36
fcatalan92 pushed a commit to fcatalan92/O2DPG that referenced this pull request Apr 9, 2024
* Account for relative CPU factor in case of sampling
  Studies have shown that being able to backfill tasks can have a
  difference for CPU efficiency, especially for transport.
  That is in particular important for high-efficiency jobs such as
  high-interaction-rate pp simulations.

* Abort by default if estimated resources exceed limits

* Run anyway, if --optimistic-resources is passed
  * Fix: Actually reset the overestimated resources to the limits as
    otherwise the runner would silently quit when nothing else can be
    done.

* In case of dynamically sampled resources and if a corresponding task
  has been run already:
  Reset the assigned resources to the limits if they exceed the
  boundaries.

Co-authored-by: Benedikt Volkel <[email protected]>
noferini pushed a commit that referenced this pull request Apr 12, 2024
* Account for relative CPU factor in case of sampling
  Studies have shown that being able to backfill tasks can have a
  difference for CPU efficiency, especially for transport.
  That is in particular important for high-efficiency jobs such as
  high-interaction-rate pp simulations.

* Abort by default if estimated resources exceed limits

* Run anyway, if --optimistic-resources is passed
  * Fix: Actually reset the overestimated resources to the limits as
    otherwise the runner would silently quit when nothing else can be
    done.

* In case of dynamically sampled resources and if a corresponding task
  has been run already:
  Reset the assigned resources to the limits if they exceed the
  boundaries.

Co-authored-by: Benedikt Volkel <[email protected]>
@benedikt-voelkel benedikt-voelkel removed the async-2023-pbpb-apass4 Request porting to async-2023-pbpb-apass4 label Apr 24, 2024
benedikt-voelkel added a commit that referenced this pull request Apr 26, 2024
* Account for relative CPU factor in case of sampling
  Studies have shown that being able to backfill tasks can have a
  difference for CPU efficiency, especially for transport.
  That is in particular important for high-efficiency jobs such as
  high-interaction-rate pp simulations.

* Abort by default if estimated resources exceed limits

* Run anyway, if --optimistic-resources is passed
  * Fix: Actually reset the overestimated resources to the limits as
    otherwise the runner would silently quit when nothing else can be
    done.

* In case of dynamically sampled resources and if a corresponding task
  has been run already:
  Reset the assigned resources to the limits if they exceed the
  boundaries.

Co-authored-by: Benedikt Volkel <[email protected]>
benedikt-voelkel added a commit that referenced this pull request Apr 26, 2024
* Account for relative CPU factor in case of sampling
  Studies have shown that being able to backfill tasks can have a
  difference for CPU efficiency, especially for transport.
  That is in particular important for high-efficiency jobs such as
  high-interaction-rate pp simulations.

* Abort by default if estimated resources exceed limits

* Run anyway, if --optimistic-resources is passed
  * Fix: Actually reset the overestimated resources to the limits as
    otherwise the runner would silently quit when nothing else can be
    done.

* In case of dynamically sampled resources and if a corresponding task
  has been run already:
  Reset the assigned resources to the limits if they exceed the
  boundaries.

Co-authored-by: Benedikt Volkel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants