-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DGL Operator proposal to kubeflow community #512
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ryantd The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proposals/dgl-operator-proposal.md
Outdated
- The `cleanPodPolicy` can be optionally configured as `Always` / `Never` / `OnFailure` / `OnCompletion`, indicating whether to delete the pod when the task is terminated. Preferred `Never` in debugging and `OnCompletion` in production. | ||
|
||
- The content of `Launcher` and `Worker` are followed the `PodTemplateSpec`. Users can be free to add more native key-values according to the spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we can reuse the common spec in https://github.com/kubeflow/common
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ongoing development is based on pure Kubebuilder v2, not engaged in Kubeflow common spec definition or job controller. But we are glad to reuse them.
proposals/dgl-operator-proposal.md
Outdated
|
||
### Resulting Launcher | ||
|
||
The resulting launcher resembles ones in MPI Operator very much. Like, `kubectl-delivery` makes sure all the worker pods are ready and download the `kubectl` (we just did some minor changes on Env and ip config processing). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can consider moving the common parts from MPI Operator (kubectl-delivery controller) to kubeflow/common so DGL operator can use it as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, sounds great
The proposal is updated
|
Great proposal, @ryantd . @gaocegege , I do prefer reuse the mpi-operator and do some extensions. |
The CRD yaml example is as following: | ||
|
||
```yaml | ||
apiVersion: qihoo.net/v1alpha1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume apiVersion will be changed to kubeflow fashion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, apiVersion will be changed to kubeflow fashion
/hold Looks like there's way to support in MPI-Operator. If that's the case, I think the preferred option is to check possibility and efforts we need to make in mpi-operator to support this case. @ryantd If you like to add more details on this, that would be great. |
Sure Jeffwan, and let me put more details on this. Basic requirements
|
@Bobgy Do we have a process to determine if we should accept external contributions? |
Yes, @theadactyl will soon send out a PR for process on how we decided for paddle operator. And I will use paddle operator to evaluate the time needed for adding a repo. |
@ryantd Do you have an idea about next steps? Do you think support this case in mph-operator is a reasonable direction to go? If not, any technical blockers? |
@Jeffwan Sorry for the late reply
Summary: After the period of consideration, I think I should provide 3 options to introduce my thinking, and let the community and WG determine. Graph PartitioningThe most important thing is graph partitioning. DGL has 2 approaches to partition, one is single-thread (DGL built-in API), the other one is multi-process (ParMETIS, not in DGL). In the K8s+Operator context, DGL Operator wants to apply these and support 3
You can see SSP and SMP need a single Partitioner Pod. And MSP may don't need the Partitioner Pod, can partition in parallel in Worker Pods. So, SSP and SMP cannot apply MPI manner[1], MSP can apply MPI manner. TrainingEach DGL distributed training workload has 3 components, DGL-Server, DGL-Sampler, and DGL-Trainer. DGL-Trainer will do the allreduce thing, but only use PyTorch's DDP to solve dense parameters updating, while sparse parameters updating is DGL self-developed. Currently, there are fixed one DGL-Server, one DGL-Sampler, and one DGL-Trainer in every training workload (every Worker Pod). But, in the near future, DGL-Server, DGL-Sampler, and DGL-Trainer may have their own native workloads (Pods). So, if training workload will not be upgraded to support any number of tri-workloads; the training can apply fully MPI manner. ConclusionIn view of the above discussions, I can give 3 options
Option 1In order to entirely merge into MPI Operator, and apply fully MPI manner. We should only support MSP and a fixed 1-1-1 training workload, removing SSP, SMP, and advanced training tri-workloads. Option 2In order to 90% support all user cases, and maximized merging into MPI Operator. DGL Operator will be an upper layer of MPI Operator, only do SSP and SMP part. And MPI Operator will do the same things in Option 1. Something like Wang Zhang's idea Option 3Be an independent Operator. DGL Operator may need to be looked at as a GNN focused Operator. TF, PyTorch, and Horovod are intended to solve CNN/RNN problems. Appendix[1]: Put Partitioner logic manually into MPI Operator can make SSP and SMP apply MPI Operator manner. But I think this will break the specificity of MPI Operator. |
How thrilling to have a GNN training operator in Kubeflow commuting! Regarding the three partition mode, the SSP mode is more like the single-client mode in TensorFlow PS/Worker training if we can merge The SMP mode, which also needs a single partitioner Pod, can follow the same fashion mentioned above. However, I'm concerned how the output of The MSP mode is a regular MPIJob case just like Xiaoyu mentioned. If this suggestion is workable, option 1 should be feasible to all user cases. If I missed something somewhere, please ping me at any time. |
Sure, we will keep the community updated after the scheduled discussion
this week.
…On Sun, Jul 25, 2021, 04:42 Yuan Tang ***@***.***> wrote:
@zw0610 <https://github.com/zw0610> @ryantd <https://github.com/ryantd>
Given that you have discussed this separately offline, could either of you
provide a summary of your discussions here as well so that everyone will be
on the same page?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#512 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AK7V6IQPCMWRQGBAJY7FHNDTZMQSVANCNFSM43OH6BLQ>
.
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
name: dgl-graphsage | ||
namespace: dgl-operator | ||
spec: | ||
cleanPodPolicy: Running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Xiaoyu and comminuty : ) I don't know if it is probably better to use runPolicy field https://github.com/kubeflow/common/blob/master/pkg/apis/common/v1/types.go#L176 as other operators did instead of directly using cleanPodPolicy ?
For example tfjobs https://github.com/kubeflow/training-operator/blob/master/pkg/apis/tensorflow/v1/types.go#L55;
pytorchjob https://github.com/kubeflow/training-operator/blob/master/pkg/apis/pytorch/v1/types.go#L53.
For standalone operator, this works fine; but for all-in-one training-operator, all of controllers use the same base job controller, where the clean up logic try to read runPolicy.cleanPodPolicy. Like mpi-operator migrating into training-operator, the fields mismatched will be an headache.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This is Xiaoyu Zhai, from Qihoo 360 AI Infra. Currently, our team is working on DGL Operator to make DGL distributed training easier on Kubernetes. And I am glad to introduce our proposal.
Looking forward to having you guys any feedback.