Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from ray-project:master #2625

Merged
merged 185 commits into from
Nov 19, 2024
Merged

Conversation

pull[bot]
Copy link

@pull pull bot commented Oct 29, 2024

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

zcin and others added 14 commits October 29, 2024 06:25
## Why are these changes needed?

Move `test_deployment_scheduler.py` from small to medium.


Signed-off-by: Cindy Zhang <[email protected]>
If args is not a list, an exception will be thrown in the above if statement.
## Why are these changes needed?

Implement full recursive cancellation at Serve layer.

Currently, at the time that a parent request is cancelled, if:
- The child request has not been assigned to a replica, _meaning_ the
replica scheduler in the router is still processing that request
- AND the parent request is not directly blocking and waiting on the
result of the child request

Then the child request will not be correctly cancelled. Note that if the
second bullet point were actually flipped, i.e. the parent request _was_
waiting on the result at the time of cancellation, then an
`asyncio.CancelledError` will be sent directly to the replica scheduler
and the child request will get successfully cancelled. However obviously
this cannot be guaranteed to be true.

This PR adds a cache for asyncio tasks that are launched in the router
to assign a request to a replica. When a parent request is cancelled, if
there are any "pending assignment requests" in the cache, they are
cancelled.



Signed-off-by: Cindy Zhang <[email protected]>
Stacked on: #48223

Reimplements the `.bind()` codepath to avoid using the Ray DAG codepath
which adds a lot of complexity for little benefit.

The DAG traversal code is much simpler, around 100 lines of
self-contained code in `build_app.py`. I've also added unit tests for
it.

There should be no behavior change here.

---------

Signed-off-by: Edward Oakes <[email protected]>
…joco 3.2.4, pettingzoo 1.24.3 supersuit 3.9.3)."" (#48308)
Previously the error message was injected directly as the HTTP status
message

Signed-off-by: Richo Healey <[email protected]>
## Why are these changes needed?

fix router.py

Signed-off-by: Cindy Zhang <[email protected]>
…g compilation (#47757)

Leaf nodes are nodes that are not output nodes and have no downstream nodes. If a leaf node raises an exception, it will not be propagated to the driver. Therefore, this PR raises an exception if a leaf node is found during compilation.

Another solution: implicitly add leaf node to MultiOutputNode
Currently, the function execute can return multiple CompiledDAGRefs. The UX we want to provide is to implicitly add leaf nodes to the MultiOutputNode but not return the references of the leaf nodes. For example, a MultiOutputNode is containing 3 DAG nodes (2 normal DAG nodes + 1 leaf node).

x, y = compiled_dag.execute(input_vals) # We don't return the ref for the leaf node.
However, the ref for leaf node will be GC(ed) in execute, and CompiledDAGRef’s del will call get if it was never called which makes execute to be a sync instead of an async operation which is not acceptable.
…#48358)

Some tests are failing when we duplicate the original release tests. Marking them as unstable to be non-blocking.
Will mark as stable once they are truly stable.

Signed-off-by: Rui Qiao <[email protected]>
GcsServer manages several IO Contexts and their threads, and hard codes
which class uses which IO Context. This is not flexible in that we can't
easily declare another IO Context and adapt some classes without a
pervasive code change.

Introduces an IoContextProvider that manages IO Contexts based on a
static, compile-time Policy, whose author can determine class-IO Context
mapping. This way we can change the mappings in a central place without
big chunk of changes everywhere.

Signed-off-by: Ruiyang Wang <[email protected]>
@pull pull bot added the ⤵️ pull label Oct 29, 2024
sven1977 and others added 15 commits October 29, 2024 22:40
…ls.cache decorators on methods can lead to memory leaks (#48186)

While running the pre-commit hook of flake8, the following error occurs
if Python version is 3.12. It's because the version of flake8 is too
old.


![image](https://github.com/user-attachments/assets/bf56370f-04d7-4b9a-aea9-e1c578547ed4)


version:
- python: 3.12.7
- flake8: 7.1.1 
- flake8-bugbear: 24.8.19


Signed-off-by: win5923 <[email protected]>
…3FileSystem` (#48216)

## Why are these changes needed?

When PyArrow does not have support for S3, accessing `pyarrow.fs.S3FileSystem` raises an exception.
Fix by checking if the `S3FileSystem` is importable before doing this typecheck.

---------

Signed-off-by: Charles Coulombe <[email protected]>
Signed-off-by: Charles Coulombe <[email protected]>
…oing_requests` (#47681) (#48274)

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->
This PR modifies the actor_options used when deploying replicas.
Deployment will use the configured `max_ongoing_requests` attribute of
the deployment config as the replica's `max_concurrency` if
the concurrency is not explicitly set. This is to prevent replica's
`max_concurrency` from capping
`max_ongoing_requests`.

## Related issue number

<!-- For example: "Closes #1234" -->
Closes #47681



Signed-off-by: akyang-anyscale <[email protected]>
## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->
This PR moves `ProxyStatus` out of the `_private` directory, allowing it
to be included in the API docs. This is the final attribute of
`ServeStatus` that needs to be included in the documentation.

## Related issue number

<!-- For example: "Closes #1234" -->
Closes #43394

---------

Signed-off-by: akyang-anyscale <[email protected]>
…10.1, mujoco 3.2.4, pettingzoo 1.24.3 supersuit 3.9.3).""" (#48435)

Reverts #48308
Pyarrow 18 causes errors on the MacOS build. Since Pyarrow 18 is not officially supported on Ray Data yet, we can just pin it for now.

Signed-off-by: Scott Lee <[email protected]>
#48415)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

I was initially confused that I couldn't join another paused task while
a debugger was in "continue" mode.

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Philipp Moritz <[email protected]>
Co-authored-by: bhuang <[email protected]>
Co-authored-by: angelinalg <[email protected]>
…nccl_dag (#48433)

Reduce the actors needed for test_mocked_nccl_dag test.

Closes #48288.

Signed-off-by: Stephanie Wang <[email protected]>
Adds a metric to monitor event loop lag for `instrumented_io_context`.
This is by default applied to IO Contexts in GCS only.

By default every 250ms we post an async task and use
std::chrono::steady_clock to record the lag. The metric is a GAUGE with
the thread name recorded.

I tested this on my laptop and it works - it's a ~0 on an idle cluster,
and 1000ms if I use
`RAY_testing_asio_delay_us=event_loop_lag_probe=1000000:1000000`.

Signed-off-by: Ruiyang Wang <[email protected]>
CoreWorkerMemoryStore has GetAsync method which calls callbacks when the
object is ready, immediately or on a Put() call. However despite the
name, the call is not `Async`: it just calls on the GetAsync() or Put()
call stack. If a mutex is held by any call frame in the callstack, it
remains held in the callback, ripe for deadlocks.

Adds another ctor argument io_context and posts any callbacks to the
context to avoid those deadlocks. Note many cpp tests creates
CoreWorkerMemoryStore without a io_context so I kept the option of
creating an owned io_context with thread.

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Gives clangd the ability to fully index ray's c++ code after you run
```bazel run :refresh_compile_commands```. Gives you lsp support +
clang-tidy linting in ide to catch bad cpp practices in files based on
rules already setup in .clang-tidy.
https://github.com/ray-project/ray/blob/master/.clang-tidy A lot of
these don't seem to be followed though and seem contrary to some of the
style of existing code, but some are critical for not giving up
performance gains off of small things. Trimming down to more important
lints could be helpful.

Signed-off-by: dayshah <[email protected]>
Co-authored-by: Ruiyang Wang <[email protected]>
…8447)

Adds dependency injection to create a custom `DeploymentHandle` type in
`build_app`. This will be used to create a special type of handle for
local testing mode.

This also allowed me to remove the mocking/patching happening in the
unit tests.

---------

Signed-off-by: Edward Oakes <[email protected]>
bveeramani and others added 29 commits November 13, 2024 18:09
…" (#48686)

This reverts commit 4a9c424.

---------

Signed-off-by: Balaji Veeramani <[email protected]>
https://github.com/ray-project/ray/pull/48693/files#diff-81d7373cd5567e997c002b244c491e6b3498d206b12c093f4dc4d30e9b5848af
added a test that uses tensorflow. We currently need to skip all
tensorflow-related tests for python 3.12 since we don't support
tensorflow for python 3.12.

Also this is a test for the deprecation of tensorflow ;)

Test:
- CI

Signed-off-by: can <[email protected]>
…tensors > 2Gb) (#48629)

Enabling V2 Arrow Tensor extension type by default (allowing tensors >
2Gb)

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
…Error (#48636)

In a recent investigation, we found that when we call the
`ray._private.internal_api.free()` from a task the same time as a Raylet
is gracefully shutting down, the task might fail with application level
Broken pipe IOError. This resulted in job failure without any task
retries.

However, as the Broken pipe happens because the unhealthiness of the
local Raylet, the error should be a system level error and should be
retried automatically.

Updated changes in commit
[01f5f11](01f5f11):
This PR add the logic for the above bahvior:
* When IOError is received in the `CoreWorker::Delete`, throw a system
error exception so that the task can retry

Why not add the exception check in the `free_objects` function?
* It is better to add the logic in the `CoreWorker::Delete` because it
can cover the case for other languages as well.
* The `CoreWorker::Delete` function is intended to be open to all
languages to call and is not called in other ray internal code paths.

Why not crash the worker when IOError is encountered in the
`WriteMessage` function?
* `QuickExit()` function will directly exit the process without
executing any shutdown logic for the worker. Directly calling the
function in the task execution might potentially causing resource leak
* At the same time, the write message function is called also on the
graceful shutdown scenario and it is possible during the graceful
shutdown process that the local Raylet is unreachable. Therefore, in the
graceful shutdown scenario, we shouldn't exit early but let the shutdown
logic finish.
* At the same time, it is not clear in the code regarding the behavior
of the graceful vs force shutdown. We might need some effort to make
them clear. The todo is added in the PR.

Updated changes in commit
[2029d36](2029d36):
> This PR add the logic for the above behavior:
> * When IOError is received in the `free_objects()` function, throw a
system error exception so that the task can retry

Changes in commit
([9d57b29](9d57b29))
:
> This PR add the logic for the above behavior:
> * Today, the internal `free` API deletes the objects from the local
Raylet object store by writing a message through a socket
> * When the write failed because the local Raylet is terminated, there
is already logic to quick exit the task
> * However, the current termination check didn't cover the case where
the local Raylet process is a Zombie process and IOError happens during
write messages.
> * This fix update the check criteria and fail the task when the Raylet
process is terminated or the write message function returns an IOError~


Signed-off-by: Mengjin Yan <[email protected]>
## Why are these changes needed?
While we calling `xxx.map_groups(..., batch_format="...")`, we may
invoke sort function and creating empty blocks which still uses pyarrow
by default. And, when we invoke another sort call on top of it, we will
hit `AttributeError: 'DataFrame' object has no attribute 'num_rows'`
since we uses first block type. (However, we may have different blocks).
See more details in #46748

## Related issue number

Close #46748

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Xingyu Long <[email protected]>
Co-authored-by: Scott Lee <[email protected]>
Created by release automation bot.

Update with commit e393a71

Signed-off-by: Jiajun Yao <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
…es (#48478)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

When writing blocks to parquet, there might be blocks with fields that
differ ONLY in nullability - by default, this would be rejected since
some blocks might have a different schema than the ParquetWriter.
However, we could potentially allow it to happen by tweaking the schema.

This PR goes through all blocks before writing them to parquet, and
merge schemas that differ only in nullability of the fields.
It also casts the table to the newly merged schema so that the write
could happen.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

Closes #48102

---------

Signed-off-by: rickyx <[email protected]>
…48697)

## Why are these changes needed?

This makes SortAggregate more consistent by unifying the API on the
SortKey object, similar to how SortTaskSpec is implemented.


## Related issue number

This is related to #42776 and
#42142


Signed-off-by: Richard Liaw <[email protected]>
ray.util.state.get_actor currently have a type annotation specifying
that the function should return an Optional[Dict] but it actually
returns
[ActorState](https://github.com/ray-project/ray/blob/3141dfe4031cc715515b365278cd1d6b8955154e/python/ray/util/state/common.py#L416)
(as the Docstring speficies). This pull request simply changes this type
annotation.


Signed-off-by: Miguel Teixeira <[email protected]>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

We currently report `iter_total_blocked_seconds` and `iter_user_seconds`
as **Gauge** while we tracking them as counters, i.e.:
- For each iteration, we had a timer that sums locally for each
iteration into an aggregated value (which is the sum of total blocked
seconds)
- When the iteration ends or the iterator GCed, the gauge metric value
is currently set to 0.
- This creates confusion for users as a counter value (total time
blocked on a dataset) should not be going back to 0, generating charts
like below.

---------

Signed-off-by: rickyx <[email protected]>
Some C++ improvements to our codebase.

---------

Signed-off-by: dentiny <[email protected]>
## Why are these changes needed?

Adds a Sentinel value for making it possible to sort.

Fixes #42142 

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liaw <[email protected]>
… batches first, instead of newest, BUT drop oldest batches if queue full). (#48702)
Signed-off-by: Aziz Belaweid <[email protected]>
…ard (#48745)

## Why are these changes needed?

Currently, there are some cases where the `Rows Outputted` value on the
Ray Job page's `Ray Data Overview` section says "0", even after the
dataset execution completes. The root cause of the bug is that we clear
iteration/execution metrics after the dataset completes. This was
previously used to "reset" the metrics to 0 after dataset completion, so
that the last emitted value would not persist on the dashboard, even
after the job finishes. Now that we display rates on the dashboard, this
hack is no longer needed, and we can skip the metrics clearing.

Fixed result:
<img width="1860" alt="Screenshot at Nov 14 12-11-24"
src="https://github.com/user-attachments/assets/35061b3f-9359-412b-8ab2-f4bcce412994">

## Related issue number

Closes #44635

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Scott Lee <[email protected]>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
Fixed typo

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: mohitjain2504 <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Gene Der Su <[email protected]>
too ancient; and there are no tests.

this should remove the outdated paramiko package from the ray image

Signed-off-by: Lonnie Liu <[email protected]>
I randomly find lancedb issue:
lancedb/lancedb#1480
which discloses a high-severity CVE

Considering as lancedb, ray only has one use case for `retry` package, I
took the same approach as lancedb/lancedb#1749,
which names all variables better with unit and default value.

---------

Signed-off-by: dentiny <[email protected]>
nightly is inherently unstable, and latest is just last stable release.
neither of them should block releases

Signed-off-by: Lonnie Liu <[email protected]>
fix NCCL_BFLOAT16 typo in TORCH_NCCL_DTYPE_MAP
also uses 1.3 syntax and heredoc for multiline commands

Signed-off-by: Lonnie Liu <[email protected]>
Support read from Hudi table into Ray dataset.

---------

Signed-off-by: Shiyan Xu <[email protected]>
@pull pull bot merged commit acc1b07 into miqdigital:master Nov 19, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.