Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DGL Operator proposal to kubeflow community #512

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ryantd
Copy link
Member

@ryantd ryantd commented Apr 23, 2021

This is Xiaoyu Zhai, from Qihoo 360 AI Infra. Currently, our team is working on DGL Operator to make DGL distributed training easier on Kubernetes. And I am glad to introduce our proposal.

Looking forward to having you guys any feedback.

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ryantd
To complete the pull request process, please assign theadactyl after the PR has been reviewed.
You can assign the PR to them by writing /assign @theadactyl in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see this! cc'ing the Kubeflow Training WG leads to review this.

/cc @kubeflow/wg-training-leads
Also /cc @zw0610 @carmark

Comment on lines 72 to 93
- The `cleanPodPolicy` can be optionally configured as `Always` / `Never` / `OnFailure` / `OnCompletion`, indicating whether to delete the pod when the task is terminated. Preferred `Never` in debugging and `OnCompletion` in production.

- The content of `Launcher` and `Worker` are followed the `PodTemplateSpec`. Users can be free to add more native key-values according to the spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can reuse the common spec in https://github.com/kubeflow/common

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ongoing development is based on pure Kubebuilder v2, not engaged in Kubeflow common spec definition or job controller. But we are glad to reuse them.


### Resulting Launcher

The resulting launcher resembles ones in MPI Operator very much. Like, `kubectl-delivery` makes sure all the worker pods are ready and download the `kubectl` (we just did some minor changes on Env and ip config processing).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can consider moving the common parts from MPI Operator (kubectl-delivery controller) to kubeflow/common so DGL operator can use it as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, sounds great

proposals/dgl-operator-proposal.md Outdated Show resolved Hide resolved
proposals/dgl-operator-proposal.md Outdated Show resolved Hide resolved
@ryantd
Copy link
Member Author

ryantd commented Apr 26, 2021

The proposal is updated

  1. add standalone partition mode
  2. illustrate the differences of implementation between standalone and distributed partitioning
  3. fix typos

@gaocegege
Copy link
Member

/cc @carmark @zw0610 @Jeffwan

The proposal LGTM, It's clean and elegant. One concern is that if we should maintain a new operator or support it in mpi-operator.

@carmark
Copy link
Member

carmark commented Apr 27, 2021

Great proposal, @ryantd .

@gaocegege , I do prefer reuse the mpi-operator and do some extensions.

The CRD yaml example is as following:

```yaml
apiVersion: qihoo.net/v1alpha1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume apiVersion will be changed to kubeflow fashion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, apiVersion will be changed to kubeflow fashion

proposals/dgl-operator-proposal.md Outdated Show resolved Hide resolved
@Jeffwan
Copy link
Member

Jeffwan commented Apr 27, 2021

/hold

Looks like there's way to support in MPI-Operator. If that's the case, I think the preferred option is to check possibility and efforts we need to make in mpi-operator to support this case. @ryantd If you like to add more details on this, that would be great.

@ryantd
Copy link
Member Author

ryantd commented Apr 27, 2021

@ryantd If you like to add more details on this, that would be great.

Sure Jeffwan, and let me put more details on this.

Basic requirements

  1. custom hostfile generating formats. DGL Operator's ipconfig format: [pod ip] [dgl server port] [dgl server num] [host/pod name] [gpu num]
  2. custom ContainerPort
  3. custom Env

ParMETIS partitioning can be applied by MPI Operator smoothly. But DGL-API partitioning needs more customizations.

DGL-API and ParMETIS partitioning needs

  1. A Partitioner Pod.
    1. ideas:
      1. Worker-0 will be looked as a Partitioner, will be terminated after finished the dispatching job
      2. Worker-1 to Worker-N will work on the training later
      3. Use env to recognize which one is Partitioner, and dglrun can do the actual partitioning things
  2. A watcher-loop-partitioner initContainer, waiting for Partitioner Pod to be finished.
    1. ideas: moving the common parts from MPI Operator (kubectl-delivery controller) to kubeflow/common

dgloperator-operator-1 minor

More details can be found in the proposal.

So far, these are the questions I have thought of. I am glad that DGL Operator can merge into MPI Operator, but I don't want to make MPI Operator deal with more diverged issues and potentially bigger and bigger. Furthermore, DGL Operator is intended to support DGL-KE and DGL-LifeSci, which are high-leveled packages based on DGL.

@gaocegege
Copy link
Member

@Bobgy Do we have a process to determine if we should accept external contributions?

@Bobgy
Copy link
Contributor

Bobgy commented Jun 22, 2021

Yes, @theadactyl will soon send out a PR for process on how we decided for paddle operator.

And I will use paddle operator to evaluate the time needed for adding a repo.

@Jeffwan
Copy link
Member

Jeffwan commented Jul 13, 2021

@ryantd Do you have an idea about next steps? Do you think support this case in mph-operator is a reasonable direction to go? If not, any technical blockers?

@ryantd
Copy link
Member Author

ryantd commented Jul 19, 2021

@Jeffwan Sorry for the late reply

Do you think support this case in mpi-operator is a reasonable direction to go?

Summary: After the period of consideration, I think I should provide 3 options to introduce my thinking, and let the community and WG determine.


Graph Partitioning

The most important thing is graph partitioning. DGL has 2 approaches to partition, one is single-thread (DGL built-in API), the other one is multi-process (ParMETIS, not in DGL).

In the K8s+Operator context, DGL Operator wants to apply these and support 3 partitionModes,

  • Single Pod, single-thread partitioning, will call SSP later
  • Single Pod, multi-process partitioning (native ParMETIS version), will call SMP later
  • Multi Pod, single-thread partitioning (may need some adoptions in ParMETIS, still in research), will call MSP later

You can see SSP and SMP need a single Partitioner Pod. And MSP may don't need the Partitioner Pod, can partition in parallel in Worker Pods. So, SSP and SMP cannot apply MPI manner[1], MSP can apply MPI manner.

Training

Each DGL distributed training workload has 3 components, DGL-Server, DGL-Sampler, and DGL-Trainer. DGL-Trainer will do the allreduce thing, but only use PyTorch's DDP to solve dense parameters updating, while sparse parameters updating is DGL self-developed. Currently, there are fixed one DGL-Server, one DGL-Sampler, and one DGL-Trainer in every training workload (every Worker Pod). But, in the near future, DGL-Server, DGL-Sampler, and DGL-Trainer may have their own native workloads (Pods).

So, if training workload will not be upgraded to support any number of tri-workloads; the training can apply fully MPI manner.

Conclusion

In view of the above discussions, I can give 3 options

  • Option 1: Entirely merging into MPI Operator
  • Option 2: Let DGL training be a two-stage Operator mechanism (90% support all user cases)
  • Option 3: Be an independent Operator (Fully support all user cases)

Option 1

In order to entirely merge into MPI Operator, and apply fully MPI manner. We should only support MSP and a fixed 1-1-1 training workload, removing SSP, SMP, and advanced training tri-workloads.

Option 2

In order to 90% support all user cases, and maximized merging into MPI Operator. DGL Operator will be an upper layer of MPI Operator, only do SSP and SMP part. And MPI Operator will do the same things in Option 1.

Something like Wang Zhang's idea

Option 3

Be an independent Operator. DGL Operator may need to be looked at as a GNN focused Operator. TF, PyTorch, and Horovod are intended to solve CNN/RNN problems.

Appendix

[1]: Put Partitioner logic manually into MPI Operator can make SSP and SMP apply MPI Operator manner. But I think this will break the specificity of MPI Operator.

@zw0610
Copy link
Member

zw0610 commented Jul 23, 2021

How thrilling to have a GNN training operator in Kubeflow commuting!

Regarding the three partition mode, the SSP mode is more like the single-client mode in TensorFlow PS/Worker training if we can merge Launcher and Partitioner. From my understanding, the main container in Partitioner can be converted into an initContainer in the Launcher pod in replace of watcher-loop-partitioner as long as the partitioner can quit before the workers start training. Please do correct me if I'm wrong.

The SMP mode, which also needs a single partitioner Pod, can follow the same fashion mentioned above. However, I'm concerned how the output of pm_dglpart shall be distributed and synchronized to worker pods. But with kubectl installed in the Launcher pod, the worst scenario is to use kubectl cp.

The MSP mode is a regular MPIJob case just like Xiaoyu mentioned.

If this suggestion is workable, option 1 should be feasible to all user cases. If I missed something somewhere, please ping me at any time.

@terrytangyuan
Copy link
Member

@zw0610 @ryantd Given that you have discussed this separately offline, could either of you provide a summary of your discussions here as well so that everyone will be on the same page?

@zw0610
Copy link
Member

zw0610 commented Jul 24, 2021 via email

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Mar 2, 2022
name: dgl-graphsage
namespace: dgl-operator
spec:
cleanPodPolicy: Running
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Xiaoyu and comminuty : ) I don't know if it is probably better to use runPolicy field https://github.com/kubeflow/common/blob/master/pkg/apis/common/v1/types.go#L176 as other operators did instead of directly using cleanPodPolicy ?

For example tfjobs https://github.com/kubeflow/training-operator/blob/master/pkg/apis/tensorflow/v1/types.go#L55;
pytorchjob https://github.com/kubeflow/training-operator/blob/master/pkg/apis/pytorch/v1/types.go#L53.

For standalone operator, this works fine; but for all-in-one training-operator, all of controllers use the same base job controller, where the clean up logic try to read runPolicy.cleanPodPolicy. Like mpi-operator migrating into training-operator, the fields mismatched will be an headache.

@stale stale bot removed the lifecycle/stale label Mar 22, 2022
@stale
Copy link

stale bot commented Sep 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants