-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Refactors WorkerPool with Prestarts. #48677
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
"RAY_restart_workers_api_max_num_workers"; | ||
} | ||
|
||
auto pop_worker_request = std::make_shared<PopWorkerRequest>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PopWorkerRequest is supposed to be a private thing inside worker_pool. Let's just make StartNewWorkers
to accept needed parameters to construct PopWorkerRequest inside worker pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StartNewWorker(PopWorkerRequest)
is also used by PopWorker(PopWorkerRequest)
internally. The Request does not have anything private so I guess we can just expose it and allow node_manager.cc to use it? It will be much easier. If we just expose another method with the scattered 10 arguments it doesn't have any added values any way.
// | ||
// Note: NONE of these methods guarantee that pop_worker_request.callback will be called | ||
// with the started worker. It may be called with any fitting workers. | ||
void StartNewWorker(const std::shared_ptr<PopWorkerRequest> &pop_worker_request); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PrestartWorker
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is also used internally by PopWorker
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
src/ray/raylet/worker_pool.cc
Outdated
const auto &job_id = idle_worker->GetAssignedJobId(); | ||
if (finished_jobs_.contains(job_id)) { | ||
const auto &job_id = it->worker->GetAssignedJobId(); | ||
if (finished_jobs_.count(job_id) > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I feel contains()
is more clear than count() > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some tests
src/ray/raylet/worker_pool.cc
Outdated
WorkerID reused_worker_id = PopWorker(pop_worker_request); | ||
if (!reused_worker_id.IsNil()) { | ||
RAY_LOG(DEBUG).WithField(task_spec.TaskId()).WithField(reused_worker_id) | ||
<< "Re-using worker for task."; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you had this for debugging. Should we change back to void PopWorker(pop_worker_request)
.
If we want to log, we can log inside PopWorker()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thought process was: PopWorker(pop_worker_request) does not know about task id, because it's abstracted out into an opaque callback. Now, I just removed the debug log.
src/ray/raylet/worker_pool.h
Outdated
@@ -506,46 +565,10 @@ class WorkerPool : public WorkerPoolInterface, public IOWorkerPoolInterface { | |||
rpc::RuntimeEnvInfo runtime_env_info; | |||
/// The dynamic_options. | |||
std::vector<std::string> dynamic_options; | |||
/// The duration to keep the worker alive even if it's idle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make this comment more specific: it's only used by prestart workers and only when the first time it's idle (before running any tasks).
ideally we should also rename the variable name to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to worker_startup_keep_alive_duration
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Signed-off-by: Ruiyang Wang <[email protected]>
Refactors WorkerPool and adds an extra PrestartWorkers API.
Changes:
PopWorker
->StartNewWorker
->StartWorkerProcess
with their differences documented.NodeManagerService.PrestartWorkers
gRPC method. Callers can ask to prestartnum_workers
workers, capped byRAY_restart_workers_api_max_num_workers
, with runtime_env and job_id.now + idle_worker_killing_time_threshold_ms
. In this PR we change to keep a worker "keep alive until" timestamp, set toidle time + idle_worker_killing_time_threshold_ms
orcreate time + keep_alive_duration
, and compare withnow
.