Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Katib Trials are failing (Unable to read file /var/log/katib/6.pid) #391

Open
basti-j opened this issue Nov 25, 2021 · 3 comments
Open

Katib Trials are failing (Unable to read file /var/log/katib/6.pid) #391

basti-j opened this issue Nov 25, 2021 · 3 comments

Comments

@basti-j
Copy link

basti-j commented Nov 25, 2021

Environment
KF: 1.3 (on prem, via ArgoFlow with rook ceph; without Rok)
Kale: 0.7.0
Katib: 0.11

Katib-Controller: gcr.io/arrikto/katib-controller/v0.11.1-8-gfab7fb06
Katib chocolate service: gcr.io/arrikto/suggestion-chocolate:v0.11.1-8-gfab7fb06

Setup PodDefault for accessing KF Pipelines (like here: https://www.kubeflow.org/docs/components/pipelines/sdk/connect-api/#multi-user-mode). I see example with KF_PIPELINES_SA_TOKEN_PATH and others with ML_PIPELINE_SA_TOKEN_PATH. Is there a difference and should I set the ML_PIPELINE_SA_TOKEN_PATH instead?

Setup pipeline-runner SA
Setup Rolebinding for workflows

Current Status
I'm able to successfully compile and run pipelines from notebooks via the Kale base image with collected metric in the run output.

Problem/Bug
I'm able to trigger for the same pipeline a katib job. A katib job gets created and the trials are created. But the trials are always failing. Based on the trials logs the metrics collector seems to be unable to read the metrics. Or is this a permission issue?

Logs
kale.log (looks fine)

2021-11-25 12:35:14 _client:452 [[INFO]] Creating experiment kale-gpu-xime0.
2021-11-25 12:35:14 run:83 [[DEBUG]] [TID=kwjdea8w22] [] Decoding ctx of RPC function 'katib.create_katib_experiment'
2021-11-25 12:35:14 run:95 [[DEBUG]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.create_katib_experiment'
2021-11-25 12:35:14 run:104 [[DEBUG]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.create_katib_experiment'
2021-11-25 12:35:14 run:113 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'create_katib_experiment(pipeline_id=4721f911-0468-49c9-b91d-421ff444280b, version_id=8d05817c-b447-4b16-bc80-d4957e2d6f18, pipeline_metadata={'autosnapshot': False, 'docker_image': 'karstenpufahl/crtx-tensorflow-full-gpu-kale:0.0.1', 'experiment': {'id': 'new', 'name': 'kale-gpu'}, 'experiment_name': 'kale-gpu-xime0', 'katib_metadata': {'parameters': [{'feasibleSpace': {'list': ['0.001', '0.006', '0.0001']}, 'name': 'LR', 'parameterType': 'categorical'}], 'objective': {'additionalMetricNames': [], 'goal': 90, 'objectiveMetricName': 'test-accuracy-resnet', 'type': 'maximize'}, 'algorithm': {'algorithmName': 'grid'}, 'maxTrialCount': 3, 'maxFailedTrialCount': 3, 'parallelTrialCount': 1}, 'katib_run': True, 'pipeline_description': '', 'pipeline_name': 'dog-breed-gpu-resnet', 'snapshot_volumes': False, 'steps_defaults': ['label:access-ml-pipeline:true'], 'volume_access_mode': 'rwm', 'volumes': [{'annotations': [], 'mount_point': '/home/jovyan/data', 'name': 'kale-example-data-rwm', 'size': 1, 'size_type': 'Gi', 'snapshot': False, 'snapshot_name': '', 'type': 'pvc'}]}, output_path=/home/jovyan/data/examples/dog-breed-classification)'
2021-11-25 12:35:14 katibutils:347 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Discovering Katib version...
2021-11-25 12:35:14 katibutils:337 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Listing Katib v1beta1 Experiment n namespace 'kubeflow-user'...
2021-11-25 12:35:14 katibutils:341 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Successfully retrieved 4 v1beta1 Experiments
2021-11-25 12:35:14 katibutils:364 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Found Katib version v1beta1
2021-11-25 12:35:14 katib:118 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Saving Katib experiment definition at /home/jovyan/data/examples/dog-breed-classification/kale-gpu-xime0.katib.yaml
2021-11-25 12:35:14 katibutils:324 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Creating Katib Experiment 'kubeflow-user/kale-gpu-xime0'...
2021-11-25 12:35:14 katibutils:330 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Successfully created Katib Experiment!
2021-11-25 12:35:14 run:83 [[DEBUG]] [TID=39ipo212kb] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:14 run:95 [[DEBUG]] [TID=39ipo212kb] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:14 run:104 [[DEBUG]] [TID=39ipo212kb] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:14 run:113 [[INFO]] [TID=39ipo212kb] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-xime0, namespace=kubeflow-user)'
2021-11-25 12:35:18 run:83 [[DEBUG]] [TID=j94xbrcpom] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:18 run:95 [[DEBUG]] [TID=j94xbrcpom] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:18 run:104 [[DEBUG]] [TID=j94xbrcpom] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:18 run:113 [[INFO]] [TID=j94xbrcpom] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-sfexz, namespace=kubeflow-user)'
2021-11-25 12:35:20 run:83 [[DEBUG]] [TID=qn4ydo9fxp] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:20 run:95 [[DEBUG]] [TID=qn4ydo9fxp] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:20 run:104 [[DEBUG]] [TID=qn4ydo9fxp] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:20 run:113 [[INFO]] [TID=qn4ydo9fxp] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-xime0, namespace=kubeflow-user)'
2021-11-25 12:35:24 run:83 [[DEBUG]] [TID=j0x0w4155o] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:24 run:95 [[DEBUG]] [TID=j0x0w4155o] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:24 run:104 [[DEBUG]] [TID=j0x0w4155o] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:24 run:113 [[INFO]] [TID=j0x0w4155o] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-sfexz, namespace=kubeflow-user)'
2021-11-25 12:35:26 run:83 [[DEBUG]] [TID=r82jur4749] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:26 run:95 [[DEBUG]] [TID=r82jur4749] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:26 run:104 [[DEBUG]] [TID=r82jur4749] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:26 run:113 [[INFO]] [TID=r82jur4749] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-xime0, namespace=kubeflow-user)'
2021-11-25 12:35:30 run:83 [[DEBUG]] [TID=4m71ofcwuk] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:30 run:95 [[DEBUG]] [TID=4m71ofcwuk] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:30 run:104 [[DEBUG]] [TID=4m71ofcwuk] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:30 run:113 [[INFO]] [TID=4m71ofcwuk] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-sfexz, namespace=kubeflow-user)'
2021-11-25 12:35:32 run:83 [[DEBUG]] [TID=j9774vwosv] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:32 run:95 [[DEBUG]] [TID=j9774vwosv] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:32 run:104 [[DEBUG]] [TID=j9774vwosv] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:32 run:113 [[INFO]] [TID=j9774vwosv] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-xime0, namespace=kubeflow-user)'

katib runner logs (looks fine):

DEBUG:filelock:Attempting to acquire lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 acquired on my_db.db?check_same_thread=False.lock
DEBUG:filelock:Attempting to release lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 released on my_db.db?check_same_thread=False.lock
INFO:pkg.suggestion.v1beta1.chocolate.base_service:----------------------------------------------------------------------------------------------------

INFO:pkg.suggestion.v1beta1.chocolate.base_service:New GetSuggestions call

DEBUG:filelock:Attempting to acquire lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 acquired on my_db.db?check_same_thread=False.lock
DEBUG:filelock:Attempting to release lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 released on my_db.db?check_same_thread=False.lock
INFO:pkg.suggestion.v1beta1.chocolate.base_service:New suggested parameters for Trial with chocolate_id: 0
INFO:pkg.suggestion.v1beta1.chocolate.base_service:Name = LR, Value = 0.001
INFO:pkg.suggestion.v1beta1.chocolate.base_service:--------------------------------------------------

INFO:pkg.suggestion.v1beta1.chocolate.base_service:GetSuggestions returns 1 new Trials


INFO:pkg.suggestion.v1beta1.chocolate.base_service:----------------------------------------------------------------------------------------------------

INFO:pkg.suggestion.v1beta1.chocolate.base_service:New GetSuggestions call

DEBUG:filelock:Attempting to acquire lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 acquired on my_db.db?check_same_thread=False.lock
DEBUG:filelock:Attempting to release lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 released on my_db.db?check_same_thread=False.lock
INFO:pkg.suggestion.v1beta1.chocolate.base_service:New suggested parameters for Trial with chocolate_id: 1
INFO:pkg.suggestion.v1beta1.chocolate.base_service:Name = LR, Value = 0.006
INFO:pkg.suggestion.v1beta1.chocolate.base_service:--------------------------------------------------

INFO:pkg.suggestion.v1beta1.chocolate.base_service:GetSuggestions returns 1 new Trials


INFO:pkg.suggestion.v1beta1.chocolate.base_service:----------------------------------------------------------------------------------------------------

INFO:pkg.suggestion.v1beta1.chocolate.base_service:New GetSuggestions call

DEBUG:filelock:Attempting to acquire lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 acquired on my_db.db?check_same_thread=False.lock
DEBUG:filelock:Attempting to release lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 released on my_db.db?check_same_thread=False.lock
INFO:pkg.suggestion.v1beta1.chocolate.base_service:New suggested parameters for Trial with chocolate_id: 2
INFO:pkg.suggestion.v1beta1.chocolate.base_service:Name = LR, Value = 0.0001
INFO:pkg.suggestion.v1beta1.chocolate.base_service:--------------------------------------------------

INFO:pkg.suggestion.v1beta1.chocolate.base_service:GetSuggestions returns 1 new Trials

Failing logs of one trial:

I1125 12:35:32.655194      12 main.go:342] Trial Name: kale-gpu-xime0-08d9371f
I1125 12:35:33.100939      12 main.go:136] 2021-11-25 12:35:33 Kale kfputils:176         [INFO]     Creating KFP experiment 'kale-gpu-xime0'...
I1125 12:35:33.101570      12 main.go:136] Traceback (most recent call last):
I1125 12:35:33.101575      12 main.go:136]   File "<string>", line 1, in <module>
I1125 12:35:33.101583      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/kale/common/katibutils.py", line 152, in create_and_wait_kfp_run
I1125 12:35:33.101781      12 main.go:136]     **kwargs)
I1125 12:35:33.101796      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/kale/common/kfputils.py", line 177, in run_pipeline
I1125 12:35:33.101813      12 main.go:136]     experiment = client.create_experiment(experiment_name)
I1125 12:35:33.101817      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/kfp/_client.py", line 342, in create_experiment
I1125 12:35:33.101835      12 main.go:136]     experiment = self._experiment_api.create_experiment(body=experiment)
I1125 12:35:33.101839      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/kfp_server_api/api/experiment_service_api.py", line 187, in create_experiment
I1125 12:35:33.101949      12 main.go:136]     return self.create_experiment_with_http_info(body, **kwargs)  # noqa: E501
I1125 12:35:33.101964      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/kfp_server_api/api/experiment_service_api.py", line 285, in create_experiment_with_http_info
I1125 12:35:33.101978      12 main.go:136]     collection_formats=collection_formats)
I1125 12:35:33.101982      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/kfp_server_api/api_client.py", line 369, in call_api
I1125 12:35:33.102078      12 main.go:136]     _preload_content, _request_timeout, _host)
I1125 12:35:33.102085      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/kfp_server_api/api_client.py", line 185, in __call_api
I1125 12:35:33.102095      12 main.go:136]     _request_timeout=_request_timeout)
I1125 12:35:33.102098      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/kfp_server_api/api_client.py", line 413, in request
I1125 12:35:33.102211      12 main.go:136]     body=body)
I1125 12:35:33.102217      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/kfp_server_api/rest.py", line 271, in POST
I1125 12:35:33.102288      12 main.go:136]     body=body)
I1125 12:35:33.102293      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/kfp_server_api/rest.py", line 168, in request
I1125 12:35:33.102348      12 main.go:136]     headers=headers)
I1125 12:35:33.102354      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/urllib3/request.py", line 79, in request
I1125 12:35:33.102402      12 main.go:136]     method, url, fields=fields, headers=headers, **urlopen_kw
I1125 12:35:33.102409      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/urllib3/request.py", line 170, in request_encode_body
I1125 12:35:33.102490      12 main.go:136]     return self.urlopen(method, url, **extra_kw)
I1125 12:35:33.102497      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/urllib3/poolmanager.py", line 375, in urlopen
I1125 12:35:33.102569      12 main.go:136]     response = conn.urlopen(method, u.request_uri, **kw)
I1125 12:35:33.102573      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
I1125 12:35:33.102741      12 main.go:136]     chunked=chunked,
I1125 12:35:33.102754      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 394, in _make_request
I1125 12:35:33.102765      12 main.go:136]     conn.request(method, url, **httplib_request_kw)
I1125 12:35:33.102771      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 234, in request
I1125 12:35:33.102854      12 main.go:136]     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
I1125 12:35:33.102862      12 main.go:136]   File "/usr/local/lib/python3.6/http/client.py", line 1287, in request
I1125 12:35:33.103073      12 main.go:136]     self._send_request(method, url, body, headers, encode_chunked)
I1125 12:35:33.103081      12 main.go:136]   File "/usr/local/lib/python3.6/http/client.py", line 1328, in _send_request
I1125 12:35:33.103295      12 main.go:136]     self.putheader(hdr, value)
I1125 12:35:33.103302      12 main.go:136]   File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 219, in putheader
I1125 12:35:33.103365      12 main.go:136]     _HTTPConnection.putheader(self, header, *values)
I1125 12:35:33.103373      12 main.go:136]   File "/usr/local/lib/python3.6/http/client.py", line 1264, in putheader
I1125 12:35:33.103606      12 main.go:136]     if _is_illegal_header_value(values[i]):
I1125 12:35:33.103616      12 main.go:136] TypeError: expected string or bytes-like object
F1125 12:35:33.669121      12 main.go:365] Failed to wait for worker container: Training container is failed. Unable to read file /var/log/katib/6.pid for pid 6, error: open /var/log/katib/6.pid: no such file or directory

Thanks a lot :)!

@drawesomenic
Copy link

drawesomenic commented Jan 5, 2022

@basti-j same issue here. Were you able to fix it?
EDIT: I was able to solve it by adding the additional env variable and binding pipeline-runner to the correct cluster role, see kubeflow/katib#1454 (comment)

@amouu
Copy link

amouu commented Apr 25, 2022

@basti-j same issue here. Were you able to fix it? EDIT: I was able to solve it by adding the additional env variable and binding pipeline-runner to the correct cluster role, see kubeflow/katib#1454 (comment)

Hi, I have already add an additional env variables and create a new serviceAccount with the name "pipeline-runner" then add the cluster role of "kubeflow-edit" to the newly created serviceAccount.
But still have error " TypeError: expected string or bytes-like object".
Could you help me? Thanks a lot.

@alibahramian
Copy link

@basti-j I had the same issue, solved by increasing ephemeral-storage. check your failed job event

Warning  FailedCreate      31m                  job-controller  Error creating: pods "random-j5cntq8j-nbb5z" is forbidden: exceeded quota: default, requested: limits.ephemeral-storage=5220Mi, used: limits.ephemeral-storage=10340Mi, limited: limits.ephemeral-storage=15Gi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants