Add DGL Operator proposal to kubeflow community #512

ryantd · 2021-04-23T09:13:33Z

This is Xiaoyu Zhai, from Qihoo 360 AI Infra. Currently, our team is working on DGL Operator to make DGL distributed training easier on Kubernetes. And I am glad to introduce our proposal.

Looking forward to having you guys any feedback.

google-oss-robot · 2021-04-23T09:13:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ryantd
To complete the pull request process, please assign theadactyl after the PR has been reviewed.
You can assign the PR to them by writing /assign @theadactyl in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

terrytangyuan

Great to see this! cc'ing the Kubeflow Training WG leads to review this.

/cc @kubeflow/wg-training-leads
Also /cc @zw0610 @carmark

terrytangyuan · 2021-04-23T18:00:41Z

proposals/dgl-operator-proposal.md

+- The `cleanPodPolicy` can be optionally configured as `Always` / `Never` / `OnFailure` / `OnCompletion`, indicating whether to delete the pod when the task is terminated. Preferred `Never` in debugging and `OnCompletion` in production.
+
+- The content of `Launcher` and `Worker` are followed the `PodTemplateSpec`. Users can be free to add more native key-values according to the spec.


Looks like we can reuse the common spec in https://github.com/kubeflow/common

Ongoing development is based on pure Kubebuilder v2, not engaged in Kubeflow common spec definition or job controller. But we are glad to reuse them.

terrytangyuan · 2021-04-23T18:02:28Z

proposals/dgl-operator-proposal.md

+
+### Resulting Launcher
+
+The resulting launcher resembles ones in MPI Operator very much. Like, `kubectl-delivery` makes sure all the worker pods are ready and download the `kubectl` (we just did some minor changes on Env and ip config processing).


Perhaps we can consider moving the common parts from MPI Operator (kubectl-delivery controller) to kubeflow/common so DGL operator can use it as well?

Agreed, sounds great

proposals/dgl-operator-proposal.md

ryantd · 2021-04-26T11:04:45Z

The proposal is updated

add standalone partition mode
illustrate the differences of implementation between standalone and distributed partitioning
fix typos

gaocegege · 2021-04-27T02:25:24Z

/cc @carmark @zw0610 @Jeffwan

The proposal LGTM, It's clean and elegant. One concern is that if we should maintain a new operator or support it in mpi-operator.

carmark · 2021-04-27T03:00:53Z

Great proposal, @ryantd .

@gaocegege , I do prefer reuse the mpi-operator and do some extensions.

Jeffwan · 2021-04-27T05:25:43Z

proposals/dgl-operator-proposal.md

+The CRD yaml example is as following:
+
+```yaml
+apiVersion: qihoo.net/v1alpha1


I assume apiVersion will be changed to kubeflow fashion?

Yes, apiVersion will be changed to kubeflow fashion

proposals/dgl-operator-proposal.md

Jeffwan · 2021-04-27T05:33:14Z

/hold

Looks like there's way to support in MPI-Operator. If that's the case, I think the preferred option is to check possibility and efforts we need to make in mpi-operator to support this case. @ryantd If you like to add more details on this, that would be great.

ryantd · 2021-04-27T08:17:50Z

@ryantd If you like to add more details on this, that would be great.

Sure Jeffwan, and let me put more details on this.

Basic requirements

custom hostfile generating formats. DGL Operator's ipconfig format: [pod ip] [dgl server port] [dgl server num] [host/pod name] [gpu num]
custom ContainerPort
custom Env

ParMETIS partitioning can be applied by MPI Operator smoothly. But DGL-API partitioning needs more customizations.

`DGL-API` and `ParMETIS` partitioning needs

A Partitioner Pod.
1. ideas:
  1. Worker-0 will be looked as a Partitioner, will be terminated after finished the dispatching job
  2. Worker-1 to Worker-N will work on the training later
  3. Use env to recognize which one is Partitioner, and dglrun can do the actual partitioning things
A watcher-loop-partitioner initContainer, waiting for Partitioner Pod to be finished.
1. ideas: moving the common parts from MPI Operator (kubectl-delivery controller) to kubeflow/common

More details can be found in the proposal.

So far, these are the questions I have thought of. I am glad that DGL Operator can merge into MPI Operator, but I don't want to make MPI Operator deal with more diverged issues and potentially bigger and bigger. Furthermore, DGL Operator is intended to support DGL-KE and DGL-LifeSci, which are high-leveled packages based on DGL.

gaocegege · 2021-06-22T09:06:11Z

@Bobgy Do we have a process to determine if we should accept external contributions?

Bobgy · 2021-06-22T09:11:48Z

Yes, @theadactyl will soon send out a PR for process on how we decided for paddle operator.

And I will use paddle operator to evaluate the time needed for adding a repo.

Jeffwan · 2021-07-13T16:40:32Z

@ryantd Do you have an idea about next steps? Do you think support this case in mph-operator is a reasonable direction to go? If not, any technical blockers?

ryantd · 2021-07-19T10:39:09Z

@Jeffwan Sorry for the late reply

Do you think support this case in mpi-operator is a reasonable direction to go?

Summary: After the period of consideration, I think I should provide 3 options to introduce my thinking, and let the community and WG determine.

Graph Partitioning

The most important thing is graph partitioning. DGL has 2 approaches to partition, one is single-thread (DGL built-in API), the other one is multi-process (ParMETIS, not in DGL).

In the K8s+Operator context, DGL Operator wants to apply these and support 3 partitionModes,

Single Pod, single-thread partitioning, will call SSP later
Single Pod, multi-process partitioning (native ParMETIS version), will call SMP later
Multi Pod, single-thread partitioning (may need some adoptions in ParMETIS, still in research), will call MSP later

You can see SSP and SMP need a single Partitioner Pod. And MSP may don't need the Partitioner Pod, can partition in parallel in Worker Pods. So, SSP and SMP cannot apply MPI manner[1], MSP can apply MPI manner.

Training

Each DGL distributed training workload has 3 components, DGL-Server, DGL-Sampler, and DGL-Trainer. DGL-Trainer will do the allreduce thing, but only use PyTorch's DDP to solve dense parameters updating, while sparse parameters updating is DGL self-developed. Currently, there are fixed one DGL-Server, one DGL-Sampler, and one DGL-Trainer in every training workload (every Worker Pod). But, in the near future, DGL-Server, DGL-Sampler, and DGL-Trainer may have their own native workloads (Pods).

So, if training workload will not be upgraded to support any number of tri-workloads; the training can apply fully MPI manner.

Conclusion

In view of the above discussions, I can give 3 options

Option 1: Entirely merging into MPI Operator
Option 2: Let DGL training be a two-stage Operator mechanism (90% support all user cases)
Option 3: Be an independent Operator (Fully support all user cases)

Option 1

In order to entirely merge into MPI Operator, and apply fully MPI manner. We should only support MSP and a fixed 1-1-1 training workload, removing SSP, SMP, and advanced training tri-workloads.

Option 2

In order to 90% support all user cases, and maximized merging into MPI Operator. DGL Operator will be an upper layer of MPI Operator, only do SSP and SMP part. And MPI Operator will do the same things in Option 1.

Something like Wang Zhang's idea

Option 3

Be an independent Operator. DGL Operator may need to be looked at as a GNN focused Operator. TF, PyTorch, and Horovod are intended to solve CNN/RNN problems.

Appendix

[1]: Put Partitioner logic manually into MPI Operator can make SSP and SMP apply MPI Operator manner. But I think this will break the specificity of MPI Operator.

zw0610 · 2021-07-23T03:10:53Z

How thrilling to have a GNN training operator in Kubeflow commuting!

Regarding the three partition mode, the SSP mode is more like the single-client mode in TensorFlow PS/Worker training if we can merge Launcher and Partitioner. From my understanding, the main container in Partitioner can be converted into an initContainer in the Launcher pod in replace of watcher-loop-partitioner as long as the partitioner can quit before the workers start training. Please do correct me if I'm wrong.

The SMP mode, which also needs a single partitioner Pod, can follow the same fashion mentioned above. However, I'm concerned how the output of pm_dglpart shall be distributed and synchronized to worker pods. But with kubectl installed in the Launcher pod, the worst scenario is to use kubectl cp.

The MSP mode is a regular MPIJob case just like Xiaoyu mentioned.

If this suggestion is workable, option 1 should be feasible to all user cases. If I missed something somewhere, please ping me at any time.

terrytangyuan · 2021-07-24T20:42:06Z

@zw0610 @ryantd Given that you have discussed this separately offline, could either of you provide a summary of your discussions here as well so that everyone will be on the same page?

zw0610 · 2021-07-24T21:33:51Z

Sure, we will keep the community updated after the scheduled discussion this week.

…

On Sun, Jul 25, 2021, 04:42 Yuan Tang ***@***.***> wrote: @zw0610 <https://github.com/zw0610> @ryantd <https://github.com/ryantd> Given that you have discussed this separately offline, could either of you provide a summary of your discussions here as well so that everyone will be on the same page? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#512 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AK7V6IQPCMWRQGBAJY7FHNDTZMQSVANCNFSM43OH6BLQ> .

stale · 2022-03-02T10:37:17Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cheimu · 2022-03-22T17:19:15Z

proposals/dgl-operator-proposal.md

+  name: dgl-graphsage
+  namespace: dgl-operator
+spec:
+  cleanPodPolicy: Running


Hi Xiaoyu and comminuty : ) I don't know if it is probably better to use runPolicy field https://github.com/kubeflow/common/blob/master/pkg/apis/common/v1/types.go#L176 as other operators did instead of directly using cleanPodPolicy ?

For example tfjobs https://github.com/kubeflow/training-operator/blob/master/pkg/apis/tensorflow/v1/types.go#L55;
pytorchjob https://github.com/kubeflow/training-operator/blob/master/pkg/apis/pytorch/v1/types.go#L53.

For standalone operator, this works fine; but for all-in-one training-operator, all of controllers use the same base job controller, where the clean up logic try to read runPolicy.cleanPodPolicy. Like mpi-operator migrating into training-operator, the fields mismatched will be an headache.

stale · 2022-09-21T05:17:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

google-cla bot added the cla: yes label Apr 23, 2021

google-oss-robot requested review from Bobgy and theadactyl April 23, 2021 09:13

google-oss-robot added the size/L label Apr 23, 2021

ryantd mentioned this pull request Apr 23, 2021

DGL Operator: Leverage DGL on K8s dmlc/dgl#2843

Closed

terrytangyuan reviewed Apr 23, 2021

View reviewed changes

google-oss-robot requested review from carmark, Jeffwan and zw0610 April 27, 2021 02:25

Jeffwan reviewed Apr 27, 2021

View reviewed changes

proposals/dgl-operator-proposal.md Show resolved Hide resolved

google-oss-robot added the do-not-merge/hold label Apr 27, 2021

Bobgy mentioned this pull request Apr 29, 2021

Add paddle operator proposal to kubeflow community. #502

Merged

ryantd mentioned this pull request Jun 23, 2021

Add @ryantd as a Kubeflow member kubeflow/internal-acls#467

Merged

Add DGL Operator proposal to kubeflow community

d2f048f

ryantd force-pushed the dgl-operator branch from 9d65028 to d2f048f Compare June 28, 2021 08:37

Update ParMETIS and DistParMETIS partitioning

c5cd6db

stale bot added the lifecycle/stale label Mar 2, 2022

cheimu reviewed Mar 22, 2022

View reviewed changes

stale bot removed the lifecycle/stale label Mar 22, 2022

stale bot added the lifecycle/stale label Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DGL Operator proposal to kubeflow community #512

Add DGL Operator proposal to kubeflow community #512

ryantd commented Apr 23, 2021

google-oss-robot commented Apr 23, 2021

terrytangyuan left a comment •

edited

Loading

terrytangyuan Apr 23, 2021

ryantd Apr 24, 2021

terrytangyuan Apr 23, 2021

ryantd Apr 24, 2021

ryantd commented Apr 26, 2021

gaocegege commented Apr 27, 2021

carmark commented Apr 27, 2021

Jeffwan Apr 27, 2021

ryantd Apr 27, 2021

Jeffwan commented Apr 27, 2021

ryantd commented Apr 27, 2021 •

edited

Loading

gaocegege commented Jun 22, 2021

Bobgy commented Jun 22, 2021

Jeffwan commented Jul 13, 2021

ryantd commented Jul 19, 2021 •

edited

Loading

zw0610 commented Jul 23, 2021

terrytangyuan commented Jul 24, 2021

zw0610 commented Jul 24, 2021 via email

stale bot commented Mar 2, 2022

cheimu Mar 22, 2022

stale bot commented Sep 21, 2022

		- The `cleanPodPolicy` can be optionally configured as `Always` / `Never` / `OnFailure` / `OnCompletion`, indicating whether to delete the pod when the task is terminated. Preferred `Never` in debugging and `OnCompletion` in production.

		- The content of `Launcher` and `Worker` are followed the `PodTemplateSpec`. Users can be free to add more native key-values according to the spec.


		### Resulting Launcher

		The resulting launcher resembles ones in MPI Operator very much. Like, `kubectl-delivery` makes sure all the worker pods are ready and download the `kubectl` (we just did some minor changes on Env and ip config processing).

Add DGL Operator proposal to kubeflow community #512

Are you sure you want to change the base?

Add DGL Operator proposal to kubeflow community #512

Conversation

ryantd commented Apr 23, 2021

google-oss-robot commented Apr 23, 2021

terrytangyuan left a comment • edited Loading

Choose a reason for hiding this comment

terrytangyuan Apr 23, 2021

Choose a reason for hiding this comment

ryantd Apr 24, 2021

Choose a reason for hiding this comment

terrytangyuan Apr 23, 2021

Choose a reason for hiding this comment

ryantd Apr 24, 2021

Choose a reason for hiding this comment

ryantd commented Apr 26, 2021

gaocegege commented Apr 27, 2021

carmark commented Apr 27, 2021

Jeffwan Apr 27, 2021

Choose a reason for hiding this comment

ryantd Apr 27, 2021

Choose a reason for hiding this comment

Jeffwan commented Apr 27, 2021

ryantd commented Apr 27, 2021 • edited Loading

Basic requirements

DGL-API and ParMETIS partitioning needs

gaocegege commented Jun 22, 2021

Bobgy commented Jun 22, 2021

Jeffwan commented Jul 13, 2021

ryantd commented Jul 19, 2021 • edited Loading

Graph Partitioning

Training

Conclusion

Option 1

Option 2

Option 3

Appendix

zw0610 commented Jul 23, 2021

terrytangyuan commented Jul 24, 2021

zw0610 commented Jul 24, 2021 via email

stale bot commented Mar 2, 2022

cheimu Mar 22, 2022

Choose a reason for hiding this comment

stale bot commented Sep 21, 2022

terrytangyuan left a comment •

edited

Loading

ryantd commented Apr 27, 2021 •

edited

Loading

`DGL-API` and `ParMETIS` partitioning needs

ryantd commented Jul 19, 2021 •

edited

Loading