-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix the condition #617
base: master
Are you sure you want to change the base?
fix the condition #617
Conversation
Signed-off-by: wang-mask <[email protected]>
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
But we only create the launcher Job suspended. Are you saying that the problem arises when you unsuspend the job and suddenly all the pods start? Not sure if it's worth always creating the Job and only unsuspending if the workers are ready. |
I think the launcher pod should be created after all the workers are ready when the |
I think that when the MPIJob is suspended, workers aren't created by the mpi-operator. |
In this case the workers are not created by the mpi-operator, but the mpi-operator creates the launcher when the spec.launcherCreationPolicy is set to "WaitForWorkersReady". |
My understanding of The current code is designed to create a launcher, but it suspends it when the MPI job is suspended, regardless of the .launcherCreationPolicy. Does this meet expectations? |
In this situation, who does create the workers? |
The operation of unsuspending an MPI job triggers the creation of workers. |
With this implementation: what happens if the job is running, then it is suspended and unsuspended? Is a launcher pod created as soon as it is unsuspended the second time? If so, this solution is not sufficient. |
Now the controller deletes the worker pod and suspends the launcher when the mpi job is suspended. Maybe it can also deletes the launcher when the mpi job is suspended and If this implementation is reasonable, I would be happy to modify this PR. |
Or you could still create the Job object but not flip the suspend flag until the workers are ready. |
fixes #615
The original logic is:
If the MPIJob is
suspended
, thelen(worker)
andc.countReadyWorkerPods(worker)
would be both0
. Then this judgment condition will be meet, and the launcher pod will be created.