Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] add backpressure reason #48009

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Jay-ju
Copy link

@Jay-ju Jay-ju commented Oct 14, 2024

Background

The concurrency of raydata's operators is mostly set manually, and the automatic perception mostly cannot meet the situation of resource utilization. However, manually setting the operator concurrency is very easy to trigger rate limiting, but it is currently difficult to obtain the reason for rate limiting.

Backpressure trigger conditions

There are a total of four trigger conditions:

  • Reserved resources: DataContext.get_current()data_context.op_resource_reservation_enabled = true (default). Each time a new task is submitted, it will check whether the resources required by the task can still be met. The overall resources will be divided into reserved and shared.
  • Unreserved resources: DataContext.get_current()data_context.op_resource_reservation_enabled = false. Resources are managed uniformly.
  • Backpressure policy: In the open-source policy, only ConcurrencyCapBackpressurePolicy is supported, whether the number of running tasks exceeds the set concurrency number.
  • Whether the Op can increase input: For the actor pool, whether there is remaining free slots space, mainly depending on the difference between the number of running tasks on the actor and the default number.

Backpressure enhancement

Backpressure observability enhancement

  • Before enhancement:
    image

  • After enhancement

    • Concurrent backpressure policy
      image

    • Backpressure in non-reserved mode
      image

    • Backpressure in reserved mode
      image

    • Insufficient free slots of actor
      image

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@raulchen
Copy link
Contributor

Can you add PR description? Ideally also add some screenshots. Thanks.
My main concern is that this can add too much content to the progress bar, making it truncated.
cc @scottjlee @omatthew98 who are actively improving progress bars for review.

@Jay-ju Jay-ju force-pushed the backpressure_reason branch 2 times, most recently from d7fbffb to 557d43f Compare October 16, 2024 05:02
@anyscalesam anyscalesam self-assigned this Oct 16, 2024
@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) data Ray Data-related issues labels Oct 16, 2024
@scottjlee
Copy link
Contributor

The reason string content generally makes sense to me. My main concern is that the progress bar outputs will become very verbose, making it difficult to read. Also, the notation with G, D, without explanation seems it would be confusing to users.

To address both of these points, I think we should configure this as an advanced feature, and disable by default. The user should enable seeing the backpressure reason through DataContext, e.g. DataContext.get_current().show_backpressure_reason = True. We should also add docs to explain how to interpret the reasoning output -- you can add this to the Ray Data progress bar section.

(Ray Data doesn't explicitly truncate the stats outputs for each operator, only truncates the operator name if it is too long).

@Jay-ju
Copy link
Author

Jay-ju commented Nov 5, 2024

The reason string content generally makes sense to me. My main concern is that the progress bar outputs will become very verbose, making it difficult to read. Also, the notation with G, D, without explanation seems it would be confusing to users.

To address both of these points, I think we should configure this as an advanced feature, and disable by default. The user should enable seeing the backpressure reason through DataContext, e.g. DataContext.get_current().show_backpressure_reason = True. We should also add docs to explain how to interpret the reasoning output -- you can add this to the Ray Data progress bar section.

(Ray Data doesn't explicitly truncate the stats outputs for each operator, only truncates the operator name if it is too long).

Thank you for your review. I have made the changes according to your suggestions.

Comment on lines 191 to 194
self._in_task_submission_backpressure = False
self._in_task_submission_backpressure_reason = ""
self._in_task_output_backpressure = False
self._in_task_output_backpressure_reason = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wrap this into new structure like:

self._tasks_state = TaskState(...)

class TaskState:
  submission_throttled: bool
  submission_throttled_reason: str = ...

  # ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Jay-ju Jay-ju force-pushed the backpressure_reason branch 4 times, most recently from 1827029 to 0191e5e Compare November 26, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants