[WIP] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

helenxie-bit · 2024-09-03T13:17:38Z

What this PR does / why we need it:
This PR adds an e2e test for the tune API, specifically for the scenario of importing external models and datasets for LLM hyperparameter optimization.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow · 2024-09-03T13:17:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

helenxie-bit · 2024-09-03T13:21:23Z

/area gsoc

helenxie-bit · 2024-09-03T13:21:49Z

Ref: #2339

Signed-off-by: helenxie-bit <[email protected]>

…roller Signed-off-by: helenxie-bit <[email protected]>

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit · 2024-09-24T21:12:47Z

The e2e test for the tune API has been consistently failing due to a "Timeout Error," and I have been investigating the root cause. I set the retain_trials parameter to True and retrieved the logs from the pod in the Experiment. The logs revealed that both the pytorch container and the metrics-logger-and-collector container exited with an Error 137.

When I ran kubectl describe pod $POD_NAME -n default, I noticed the following events. One specific event, "SandboxChanged," stood out as potentially problematic:

Events:
  Type    Reason          Age                    From               Message
  ----    ------          ----                   ----               -------
  ...
  Normal  SandboxChanged  3m (x2 over 3m43s)     kubelet            Pod sandbox changed, it will be killed and re-created.
  ...

However, when I checked the pod logs using kubectl logs $POD_NAME -n default --all-containers, everything appeared normal, and the logs confirmed that "Training is complete."

I also examined the kubelet and container runtime logs. While the kubelet logs provided no additional insights, the container runtime logs displayed the following error, which I believe may be related to the issue:

Sep 29 19:59:04 fv-az1986-610 dockerd[3342]: time="2024-09-29T19:59:04.631799544Z" level=info msg="Container failed to exit within 30s of signal 15 - using the force" container=bfff1b5f24d7ebcdc51d0dabe807e391053c4a4065a404203e266c5341bbfbbe spanID=6c2dd21dc1394346 traceID=738eda3fc86a653490d5534deb664c93

@andreyvelich @tenzen-y Do you have any thoughts on how to resolve this issue?

Signed-off-by: helenxie-bit <[email protected]>

add e2e test for tune api

6be7f29

Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow bot requested review from andreyvelich, anencore94 and gaocegege September 3, 2024 13:17

google-oss-prow bot added the size/M label Sep 3, 2024

helenxie-bit mentioned this pull request Sep 3, 2024

[GSoC] Project 4: Hyperparameter Optimization API in Katib for LLMs #2339

Open

6 tasks

google-oss-prow bot added the area/gsoc label Sep 3, 2024

helenxie-bit added 2 commits September 3, 2024 21:38

upgrade training-operator sdk

1a1f119

Signed-off-by: helenxie-bit <[email protected]>

specify the version of training operator sdk

8461a49

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit changed the title ~~[GSoC] Add e2e test for tune api with LLM hyperparameter optimization~~ [WIP] Add e2e test for tune api with LLM hyperparameter optimization Sep 3, 2024

google-oss-prow bot added the do-not-merge/work-in-progress label Sep 3, 2024

fix num_labels error and update the version of training operator cont…

c860238

…roller Signed-off-by: helenxie-bit <[email protected]>

google-oss-prow bot added size/L and removed size/M labels Sep 3, 2024

helenxie-bit added 13 commits September 3, 2024 22:30

check the version of training operator

216ebd9

Signed-off-by: helenxie-bit <[email protected]>

debug

f6b96f5

Signed-off-by: helenxie-bit <[email protected]>

check import path of HuggingFaceModelParams

c636493

Signed-off-by: helenxie-bit <[email protected]>

update the version of training operator sdk

8180422

Signed-off-by: helenxie-bit <[email protected]>

update the name of experiment

6101489

Signed-off-by: helenxie-bit <[email protected]>

add step of checking pod

d67a1b8

Signed-off-by: helenxie-bit <[email protected]>

check the logs of pod

295abb6

Signed-off-by: helenxie-bit <[email protected]>

add check

e0a1b6d

Signed-off-by: helenxie-bit <[email protected]>

check reason for imagepullbackoff

1df7df9

Signed-off-by: helenxie-bit <[email protected]>

revert timeout limit

d1e1311

Signed-off-by: helenxie-bit <[email protected]>

fix format

0cc319f

Signed-off-by: helenxie-bit <[email protected]>

extend timeout limit

0383932

Signed-off-by: helenxie-bit <[email protected]>

update training operator sdk version

08c8634

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit added 24 commits September 13, 2024 22:13

update the function of getting logs

e4f614d

Signed-off-by: helenxie-bit <[email protected]>

add the step of describing pod

0385eea

Signed-off-by: helenxie-bit <[email protected]>

check disk space

e0c5170

Signed-off-by: helenxie-bit <[email protected]>

change work directory

0286f70

Signed-off-by: helenxie-bit <[email protected]>

change work directory

f6e5ed5

Signed-off-by: helenxie-bit <[email protected]>

increase timeout limit

7ea7e43

Signed-off-by: helenxie-bit <[email protected]>

check the logs of controller and events

25d99b1

Signed-off-by: helenxie-bit <[email protected]>

change work directory

fcd64fa

Signed-off-by: helenxie-bit <[email protected]>

change work directory

122c611

Signed-off-by: helenxie-bit <[email protected]>

change work directory

c1fde09

Signed-off-by: helenxie-bit <[email protected]>

check the logs of kubelet

8ff6864

Signed-off-by: helenxie-bit <[email protected]>

check the logs of kubelet

da3c298

Signed-off-by: helenxie-bit <[email protected]>

increase cpu

a1bff26

Signed-off-by: helenxie-bit <[email protected]>

check the logs of training operator

bbae57b

Signed-off-by: helenxie-bit <[email protected]>

check the use of resources

e45ceac

Signed-off-by: helenxie-bit <[email protected]>

check the logs of container 'pytorch' and 'storage_initializer'

4ae11ed

Signed-off-by: helenxie-bit <[email protected]>

fix error of checking use of resources

bedab36

Signed-off-by: helenxie-bit <[email protected]>

add other checks to find the error reason

7bfb3cc

Signed-off-by: helenxie-bit <[email protected]>

set 'storage_config'

efffdc2

Signed-off-by: helenxie-bit <[email protected]>

reduce the number of tests

2a18b17

Signed-off-by: helenxie-bit <[email protected]>

Check container runtime logs

c6c964b

Signed-off-by: helenxie-bit <[email protected]>

set the driver of minikube as docker

28ffb96

Signed-off-by: helenxie-bit <[email protected]>

set the driver of minikube to none

dc684e3

Signed-off-by: helenxie-bit <[email protected]>

check logs of pod

a12034c

Signed-off-by: helenxie-bit <[email protected]>

helenxie-bit added 5 commits September 29, 2024 10:50

check memory usage

b088815

Signed-off-by: helenxie-bit <[email protected]>

increase 'termination_grace_period_seconds' in podspec

e468b27

Signed-off-by: helenxie-bit <[email protected]>

fix annotations error

64d8fef

Signed-off-by: helenxie-bit <[email protected]>

restart docker

45db42e

Signed-off-by: helenxie-bit <[email protected]>

delete restarting docker

c6e91cd

Signed-off-by: helenxie-bit <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

[WIP] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

helenxie-bit commented Sep 3, 2024

google-oss-prow bot commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

helenxie-bit commented Sep 24, 2024 •

edited

Loading

[WIP] Add e2e test for tune api with LLM hyperparameter optimization #2420

Are you sure you want to change the base?

[WIP] Add e2e test for tune api with LLM hyperparameter optimization #2420

Conversation

helenxie-bit commented Sep 3, 2024

google-oss-prow bot commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

helenxie-bit commented Sep 3, 2024

helenxie-bit commented Sep 24, 2024 • edited Loading

[WIP] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

[WIP] Add e2e test for `tune` api with LLM hyperparameter optimization #2420

helenxie-bit commented Sep 24, 2024 •

edited

Loading