Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Allow 0 worker in pytorch plugins & Add objectMeta to PyTorchJob #348

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ByronHsu
Copy link
Contributor

@ByronHsu ByronHsu commented May 9, 2023

TL;DR

  1. Kubeflow PyTorch can be configured to use 0 workers when running distributed PyTorch jobs. In this case, the training job would run on a single machine (the master node), without any additional worker nodes.
  2. Pass objectMeta to PyTorchJob

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

@ByronHsu ByronHsu requested review from fg91 and igorvalko May 9, 2023 05:14
@ByronHsu ByronHsu force-pushed the oss/pytorch-fix branch from e7f3888 to 2065a96 Compare May 9, 2023 05:18
@codecov
Copy link

codecov bot commented May 9, 2023

Codecov Report

Merging #348 (dce4e46) into master (76a80ec) will increase coverage by 1.30%.
The diff coverage is 100.00%.

❗ Current head dce4e46 differs from pull request most recent head 2065a96. Consider uploading reports for the commit 2065a96 to get more accurate results

@@            Coverage Diff             @@
##           master     #348      +/-   ##
==========================================
+ Coverage   62.76%   64.06%   +1.30%     
==========================================
  Files         148      148              
  Lines       12444    10080    -2364     
==========================================
- Hits         7810     6458    -1352     
+ Misses       4038     3026    -1012     
  Partials      596      596              
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go 79.80% <100.00%> (+0.28%) ⬆️

... and 130 files with indirect coverage changes

Copy link
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any benefit to use only one master node in pytorch CRD? people can just run pytorch in regular python task for single node training, right?

@fg91
Copy link
Member

fg91 commented May 9, 2023

is there any benefit to use only one master node in pytorch CRD? people can just run pytorch in regular python task for single node training, right?

I can only imagine the env vars set by the operator like world size, rank, ...? In the torch elastic task we opted to run single worker trainings in a normal python task/single pod so that users don't need the training operator.

@ByronHsu
Copy link
Contributor Author

ByronHsu commented May 9, 2023

In our case, we start with 0 worker since it's easier to debug, and then adjust to multiple workers.

Although python task can achieve the same thing, we shouldn't error out 0 worker in PyTorch because it's what PyTorch operator allows

}
job := &kubeflowv1.PyTorchJob{
TypeMeta: metav1.TypeMeta{
Kind: kubeflowv1.PytorchJobKind,
APIVersion: kubeflowv1.SchemeGroupVersion.String(),
},
Spec: jobSpec,
Spec: jobSpec,
ObjectMeta: *objectMeta,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just to set labels / annotations on the CR the same as the replicas? If this is the route we want to go, this should probably be done for all kf operator plugins.

@fg91
Copy link
Member

fg91 commented May 9, 2023

In our case, we start with 0 worker since it's easier to debug, and then adjust to multiple workers.

Although python task can achieve the same thing, we shouldn't error out 0 worker in PyTorch because it's what PyTorch operator allows

Fair point. Users can still switch to python task if they prefer by removing the task_config=Pytorch(....

Only thing I wonder: in the new pytorch elastic task we decided that with nnodes=1 (meaning only a single worker/pod) we will use a normal python task/pod so that users can run this without operator (see here). Should we maybe then change this to not have different behaviour between elastic and non-elastic pytorch dist training? @kumare3 argued that it would be nice to allow users to use torchrun without the operator.

@kumare3
Copy link
Contributor

kumare3 commented May 10, 2023

I actually want to get rid of a required dependency on PytorchOperator for simple single node training. which can suffice in many cases. This actually makes scaling really nice, you start with one node and simply scale to more nodes - to scale you may need to deploy the operator?

This is why when nnodes=1, we just change the task type itself. WDYT? @fg91 and @ByronHsu

@ByronHsu
Copy link
Contributor Author

Could you elaborate the drawback to use PyTorch Operator to training a single node case?

@kumare3
Copy link
Contributor

kumare3 commented May 10, 2023

@ByronHsu - FlytePropeller is way more optimal in allocating resources, retrying, and completing sooner. Also this for single node is faster, runs without needing an operator and does not need a CRD to be created.

@ByronHsu
Copy link
Contributor Author

@kumare3 skipping the CRD part can definitely be faster. Thanks. I will raise a corresponding pr in flytekit

@ByronHsu
Copy link
Contributor Author

Will merge with #345 and do a integration test

@fg91
Copy link
Member

fg91 commented Aug 16, 2023

Is this still supposed to be pushed over the finish line or shell we close it?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants