[data] add backpressure reason #48009

Jay-ju · 2024-10-14T07:00:38Z

Background

The concurrency of raydata's operators is mostly set manually, and the automatic perception mostly cannot meet the situation of resource utilization. However, manually setting the operator concurrency is very easy to trigger rate limiting, but it is currently difficult to obtain the reason for rate limiting.

Backpressure trigger conditions

There are a total of four trigger conditions:

Reserved resources: DataContext.get_current()data_context.op_resource_reservation_enabled = true (default). Each time a new task is submitted, it will check whether the resources required by the task can still be met. The overall resources will be divided into reserved and shared.
Unreserved resources: DataContext.get_current()data_context.op_resource_reservation_enabled = false. Resources are managed uniformly.
Backpressure policy: In the open-source policy, only ConcurrencyCapBackpressurePolicy is supported, whether the number of running tasks exceeds the set concurrency number.
Whether the Op can increase input: For the actor pool, whether there is remaining free slots space, mainly depending on the difference between the number of running tasks on the actor and the default number.

Backpressure enhancement

Backpressure observability enhancement

Before enhancement:
After enhancement
- Concurrent backpressure policy
- Backpressure in non-reserved mode
- Backpressure in reserved mode
- Insufficient free slots of actor

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

raulchen · 2024-10-15T18:30:09Z

Can you add PR description? Ideally also add some screenshots. Thanks.
My main concern is that this can add too much content to the progress bar, making it truncated.
cc @scottjlee @omatthew98 who are actively improving progress bars for review.

scottjlee · 2024-10-21T18:37:37Z

The reason string content generally makes sense to me. My main concern is that the progress bar outputs will become very verbose, making it difficult to read. Also, the notation with G, D, without explanation seems it would be confusing to users.

To address both of these points, I think we should configure this as an advanced feature, and disable by default. The user should enable seeing the backpressure reason through DataContext, e.g. DataContext.get_current().show_backpressure_reason = True. We should also add docs to explain how to interpret the reasoning output -- you can add this to the Ray Data progress bar section.

(Ray Data doesn't explicitly truncate the stats outputs for each operator, only truncates the operator name if it is too long).

Jay-ju · 2024-11-05T02:53:12Z

The reason string content generally makes sense to me. My main concern is that the progress bar outputs will become very verbose, making it difficult to read. Also, the notation with G, D, without explanation seems it would be confusing to users.

To address both of these points, I think we should configure this as an advanced feature, and disable by default. The user should enable seeing the backpressure reason through DataContext, e.g. DataContext.get_current().show_backpressure_reason = True. We should also add docs to explain how to interpret the reasoning output -- you can add this to the Ray Data progress bar section.

(Ray Data doesn't explicitly truncate the stats outputs for each operator, only truncates the operator name if it is too long).

Thank you for your review. I have made the changes according to your suggestions.

alexeykudinkin · 2024-11-08T20:32:07Z

python/ray/data/_internal/execution/interfaces/physical_operator.py

        self._in_task_submission_backpressure = False
+        self._in_task_submission_backpressure_reason = ""
        self._in_task_output_backpressure = False
+        self._in_task_output_backpressure_reason = ""


Let's wrap this into new structure like:

self._tasks_state = TaskState(...) class TaskState: submission_throttled: bool submission_throttled_reason: str = ... # ...

Signed-off-by: jukejian <[email protected]>

Jay-ju requested review from scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners October 14, 2024 07:00

Jay-ju force-pushed the backpressure_reason branch 12 times, most recently from a0e9072 to 9f8f680 Compare October 15, 2024 13:27

Jay-ju force-pushed the backpressure_reason branch 2 times, most recently from d7fbffb to 557d43f Compare October 16, 2024 05:02

anyscalesam self-assigned this Oct 16, 2024

anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels Oct 16, 2024

Jay-ju requested review from a team, edoakes, zcin, GeneDer and akshay-anyscale as code owners November 5, 2024 02:48

Jay-ju requested review from richardliaw, aslonnie, hongchaodeng and pcmoritz as code owners November 5, 2024 02:48

Jay-ju force-pushed the backpressure_reason branch from 37730b1 to 55e6abc Compare November 5, 2024 02:50

aslonnie removed request for a team, edoakes, zcin, hongpeng-guo, SongGuyang, woshiyyya, aslonnie and akshay-anyscale November 5, 2024 04:37

Jay-ju requested a review from srinathk10 as a code owner November 6, 2024 06:54

Jay-ju force-pushed the backpressure_reason branch 3 times, most recently from 7720ed9 to a198e13 Compare November 7, 2024 01:56

scottjlee assigned scottjlee and unassigned anyscalesam Nov 8, 2024

alexeykudinkin reviewed Nov 8, 2024

View reviewed changes

Jay-ju force-pushed the backpressure_reason branch from e281f8d to 5e17527 Compare November 9, 2024 08:57

Jay-ju force-pushed the backpressure_reason branch 4 times, most recently from 1827029 to 0191e5e Compare November 26, 2024 13:56

[data] add backpressure reason

1a86bbd

Signed-off-by: jukejian <[email protected]>

Jay-ju force-pushed the backpressure_reason branch from 9fd90d5 to 1a86bbd Compare November 28, 2024 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] add backpressure reason #48009

[data] add backpressure reason #48009

Jay-ju commented Oct 14, 2024 •

edited

Loading

raulchen commented Oct 15, 2024

scottjlee commented Oct 21, 2024

Jay-ju commented Nov 5, 2024

alexeykudinkin Nov 8, 2024

Jay-ju Nov 9, 2024

[data] add backpressure reason #48009

Are you sure you want to change the base?

[data] add backpressure reason #48009

Conversation

Jay-ju commented Oct 14, 2024 • edited Loading

Background

Backpressure trigger conditions

Backpressure enhancement

Backpressure observability enhancement

Why are these changes needed?

Related issue number

Checks

raulchen commented Oct 15, 2024

scottjlee commented Oct 21, 2024

Jay-ju commented Nov 5, 2024

alexeykudinkin Nov 8, 2024

Choose a reason for hiding this comment

Jay-ju Nov 9, 2024

Choose a reason for hiding this comment

Jay-ju commented Oct 14, 2024 •

edited

Loading