Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Refactors WorkerPool with Prestarts. #48677

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

rynewang
Copy link
Contributor

@rynewang rynewang commented Nov 10, 2024

Refactors WorkerPool and adds an extra PrestartWorkers API.

Changes:

  1. PopWorkerRequest now has a new field to let the started idle worker to keep-alive for a duration. After the duration, or after it's assigned a task, the keep-alive is lifted and it can be idle-killed.
  2. Make the WorkerPool process more clear: now we have distinct PopWorker -> StartNewWorker -> StartWorkerProcess with their differences documented.
  3. Adds a NodeManagerService.PrestartWorkers gRPC method. Callers can ask to prestart num_workers workers, capped by RAY_restart_workers_api_max_num_workers, with runtime_env and job_id.
  4. Changes idle-killing logic in worker_pool. Previously we kept a worker "idle since" timestamp and compare it with now + idle_worker_killing_time_threshold_ms. In this PR we change to keep a worker "keep alive until" timestamp, set to idle time + idle_worker_killing_time_threshold_ms or create time + keep_alive_duration, and compare with now.

Signed-off-by: Ruiyang Wang <[email protected]>
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Nov 12, 2024
src/ray/common/ray_config_def.h Outdated Show resolved Hide resolved
src/ray/core_worker/core_worker.cc Show resolved Hide resolved
src/ray/core_worker/core_worker.h Outdated Show resolved Hide resolved
"RAY_restart_workers_api_max_num_workers";
}

auto pop_worker_request = std::make_shared<PopWorkerRequest>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PopWorkerRequest is supposed to be a private thing inside worker_pool. Let's just make StartNewWorkers to accept needed parameters to construct PopWorkerRequest inside worker pool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartNewWorker(PopWorkerRequest) is also used by PopWorker(PopWorkerRequest) internally. The Request does not have anything private so I guess we can just expose it and allow node_manager.cc to use it? It will be much easier. If we just expose another method with the scattered 10 arguments it doesn't have any added values any way.

src/ray/raylet/worker.h Outdated Show resolved Hide resolved
//
// Note: NONE of these methods guarantee that pop_worker_request.callback will be called
// with the started worker. It may be called with any fitting workers.
void StartNewWorker(const std::shared_ptr<PopWorkerRequest> &pop_worker_request);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrestartWorker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is also used internally by PopWorker

src/ray/protobuf/node_manager.proto Outdated Show resolved Hide resolved
src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved
src/ray/raylet/worker_pool.h Outdated Show resolved Hide resolved
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

src/ray/raylet/node_manager.cc Outdated Show resolved Hide resolved
const auto &job_id = idle_worker->GetAssignedJobId();
if (finished_jobs_.contains(job_id)) {
const auto &job_id = it->worker->GetAssignedJobId();
if (finished_jobs_.count(job_id) > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel contains() is more clear than count() > 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some tests

Comment on lines 1338 to 1343
WorkerID reused_worker_id = PopWorker(pop_worker_request);
if (!reused_worker_id.IsNil()) {
RAY_LOG(DEBUG).WithField(task_spec.TaskId()).WithField(reused_worker_id)
<< "Re-using worker for task.";
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you had this for debugging. Should we change back to void PopWorker(pop_worker_request).

If we want to log, we can log inside PopWorker()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thought process was: PopWorker(pop_worker_request) does not know about task id, because it's abstracted out into an opaque callback. Now, I just removed the debug log.

@@ -506,46 +565,10 @@ class WorkerPool : public WorkerPoolInterface, public IOWorkerPoolInterface {
rpc::RuntimeEnvInfo runtime_env_info;
/// The dynamic_options.
std::vector<std::string> dynamic_options;
/// The duration to keep the worker alive even if it's idle.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this comment more specific: it's only used by prestart workers and only when the first time it's idle (before running any tasks).

ideally we should also rename the variable name to reflect that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to worker_startup_keep_alive_duration

Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
@rynewang rynewang requested a review from a team as a code owner November 27, 2024 23:34
Signed-off-by: Ruiyang Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants