You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Setup pipeline-runner SA
Setup Rolebinding for workflows
Current Status
I'm able to successfully compile and run pipelines from notebooks via the Kale base image with collected metric in the run output.
Problem/Bug
I'm able to trigger for the same pipeline a katib job. A katib job gets created and the trials are created. But the trials are always failing. Based on the trials logs the metrics collector seems to be unable to read the metrics. Or is this a permission issue?
Logs kale.log (looks fine)
2021-11-25 12:35:14 _client:452 [[INFO]] Creating experiment kale-gpu-xime0.
2021-11-25 12:35:14 run:83 [[DEBUG]] [TID=kwjdea8w22] [] Decoding ctx of RPC function 'katib.create_katib_experiment'
2021-11-25 12:35:14 run:95 [[DEBUG]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.create_katib_experiment'
2021-11-25 12:35:14 run:104 [[DEBUG]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.create_katib_experiment'
2021-11-25 12:35:14 run:113 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'create_katib_experiment(pipeline_id=4721f911-0468-49c9-b91d-421ff444280b, version_id=8d05817c-b447-4b16-bc80-d4957e2d6f18, pipeline_metadata={'autosnapshot': False, 'docker_image': 'karstenpufahl/crtx-tensorflow-full-gpu-kale:0.0.1', 'experiment': {'id': 'new', 'name': 'kale-gpu'}, 'experiment_name': 'kale-gpu-xime0', 'katib_metadata': {'parameters': [{'feasibleSpace': {'list': ['0.001', '0.006', '0.0001']}, 'name': 'LR', 'parameterType': 'categorical'}], 'objective': {'additionalMetricNames': [], 'goal': 90, 'objectiveMetricName': 'test-accuracy-resnet', 'type': 'maximize'}, 'algorithm': {'algorithmName': 'grid'}, 'maxTrialCount': 3, 'maxFailedTrialCount': 3, 'parallelTrialCount': 1}, 'katib_run': True, 'pipeline_description': '', 'pipeline_name': 'dog-breed-gpu-resnet', 'snapshot_volumes': False, 'steps_defaults': ['label:access-ml-pipeline:true'], 'volume_access_mode': 'rwm', 'volumes': [{'annotations': [], 'mount_point': '/home/jovyan/data', 'name': 'kale-example-data-rwm', 'size': 1, 'size_type': 'Gi', 'snapshot': False, 'snapshot_name': '', 'type': 'pvc'}]}, output_path=/home/jovyan/data/examples/dog-breed-classification)'
2021-11-25 12:35:14 katibutils:347 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Discovering Katib version...
2021-11-25 12:35:14 katibutils:337 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Listing Katib v1beta1 Experiment n namespace 'kubeflow-user'...
2021-11-25 12:35:14 katibutils:341 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Successfully retrieved 4 v1beta1 Experiments
2021-11-25 12:35:14 katibutils:364 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Found Katib version v1beta1
2021-11-25 12:35:14 katib:118 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Saving Katib experiment definition at /home/jovyan/data/examples/dog-breed-classification/kale-gpu-xime0.katib.yaml
2021-11-25 12:35:14 katibutils:324 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Creating Katib Experiment 'kubeflow-user/kale-gpu-xime0'...
2021-11-25 12:35:14 katibutils:330 [[INFO]] [TID=kwjdea8w22] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Successfully created Katib Experiment!
2021-11-25 12:35:14 run:83 [[DEBUG]] [TID=39ipo212kb] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:14 run:95 [[DEBUG]] [TID=39ipo212kb] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:14 run:104 [[DEBUG]] [TID=39ipo212kb] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:14 run:113 [[INFO]] [TID=39ipo212kb] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-xime0, namespace=kubeflow-user)'
2021-11-25 12:35:18 run:83 [[DEBUG]] [TID=j94xbrcpom] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:18 run:95 [[DEBUG]] [TID=j94xbrcpom] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:18 run:104 [[DEBUG]] [TID=j94xbrcpom] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:18 run:113 [[INFO]] [TID=j94xbrcpom] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-sfexz, namespace=kubeflow-user)'
2021-11-25 12:35:20 run:83 [[DEBUG]] [TID=qn4ydo9fxp] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:20 run:95 [[DEBUG]] [TID=qn4ydo9fxp] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:20 run:104 [[DEBUG]] [TID=qn4ydo9fxp] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:20 run:113 [[INFO]] [TID=qn4ydo9fxp] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-xime0, namespace=kubeflow-user)'
2021-11-25 12:35:24 run:83 [[DEBUG]] [TID=j0x0w4155o] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:24 run:95 [[DEBUG]] [TID=j0x0w4155o] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:24 run:104 [[DEBUG]] [TID=j0x0w4155o] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:24 run:113 [[INFO]] [TID=j0x0w4155o] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-sfexz, namespace=kubeflow-user)'
2021-11-25 12:35:26 run:83 [[DEBUG]] [TID=r82jur4749] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:26 run:95 [[DEBUG]] [TID=r82jur4749] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:26 run:104 [[DEBUG]] [TID=r82jur4749] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:26 run:113 [[INFO]] [TID=r82jur4749] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-xime0, namespace=kubeflow-user)'
2021-11-25 12:35:30 run:83 [[DEBUG]] [TID=4m71ofcwuk] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:30 run:95 [[DEBUG]] [TID=4m71ofcwuk] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:30 run:104 [[DEBUG]] [TID=4m71ofcwuk] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:30 run:113 [[INFO]] [TID=4m71ofcwuk] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-sfexz, namespace=kubeflow-user)'
2021-11-25 12:35:32 run:83 [[DEBUG]] [TID=j9774vwosv] [] Decoding ctx of RPC function 'katib.get_experiment'
2021-11-25 12:35:32 run:95 [[DEBUG]] [TID=j9774vwosv] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Decoding kwargs of RPC function 'katib.get_experiment'
2021-11-25 12:35:32 run:104 [[DEBUG]] [TID=j9774vwosv] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Importing RPC function 'katib.get_experiment'
2021-11-25 12:35:32 run:113 [[INFO]] [TID=j9774vwosv] [/home/jovyan/data/examples/dog-breed-classification/dog-breed-v2-gpu-katib.ipynb] Executing RPC function 'get_experiment(experiment=kale-gpu-xime0, namespace=kubeflow-user)'
katib runner logs (looks fine):
DEBUG:filelock:Attempting to acquire lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 acquired on my_db.db?check_same_thread=False.lock
DEBUG:filelock:Attempting to release lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 released on my_db.db?check_same_thread=False.lock
INFO:pkg.suggestion.v1beta1.chocolate.base_service:----------------------------------------------------------------------------------------------------
INFO:pkg.suggestion.v1beta1.chocolate.base_service:New GetSuggestions call
DEBUG:filelock:Attempting to acquire lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 acquired on my_db.db?check_same_thread=False.lock
DEBUG:filelock:Attempting to release lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 released on my_db.db?check_same_thread=False.lock
INFO:pkg.suggestion.v1beta1.chocolate.base_service:New suggested parameters for Trial with chocolate_id: 0
INFO:pkg.suggestion.v1beta1.chocolate.base_service:Name = LR, Value = 0.001
INFO:pkg.suggestion.v1beta1.chocolate.base_service:--------------------------------------------------
INFO:pkg.suggestion.v1beta1.chocolate.base_service:GetSuggestions returns 1 new Trials
INFO:pkg.suggestion.v1beta1.chocolate.base_service:----------------------------------------------------------------------------------------------------
INFO:pkg.suggestion.v1beta1.chocolate.base_service:New GetSuggestions call
DEBUG:filelock:Attempting to acquire lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 acquired on my_db.db?check_same_thread=False.lock
DEBUG:filelock:Attempting to release lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 released on my_db.db?check_same_thread=False.lock
INFO:pkg.suggestion.v1beta1.chocolate.base_service:New suggested parameters for Trial with chocolate_id: 1
INFO:pkg.suggestion.v1beta1.chocolate.base_service:Name = LR, Value = 0.006
INFO:pkg.suggestion.v1beta1.chocolate.base_service:--------------------------------------------------
INFO:pkg.suggestion.v1beta1.chocolate.base_service:GetSuggestions returns 1 new Trials
INFO:pkg.suggestion.v1beta1.chocolate.base_service:----------------------------------------------------------------------------------------------------
INFO:pkg.suggestion.v1beta1.chocolate.base_service:New GetSuggestions call
DEBUG:filelock:Attempting to acquire lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 acquired on my_db.db?check_same_thread=False.lock
DEBUG:filelock:Attempting to release lock 140324854106040 on my_db.db?check_same_thread=False.lock
INFO:filelock:Lock 140324854106040 released on my_db.db?check_same_thread=False.lock
INFO:pkg.suggestion.v1beta1.chocolate.base_service:New suggested parameters for Trial with chocolate_id: 2
INFO:pkg.suggestion.v1beta1.chocolate.base_service:Name = LR, Value = 0.0001
INFO:pkg.suggestion.v1beta1.chocolate.base_service:--------------------------------------------------
INFO:pkg.suggestion.v1beta1.chocolate.base_service:GetSuggestions returns 1 new Trials
Failing logs of one trial:
I1125 12:35:32.655194 12 main.go:342] Trial Name: kale-gpu-xime0-08d9371f
I1125 12:35:33.100939 12 main.go:136] 2021-11-25 12:35:33 Kale kfputils:176 [INFO] Creating KFP experiment 'kale-gpu-xime0'...
I1125 12:35:33.101570 12 main.go:136] Traceback (most recent call last):
I1125 12:35:33.101575 12 main.go:136] File "<string>", line 1, in <module>
I1125 12:35:33.101583 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/kale/common/katibutils.py", line 152, in create_and_wait_kfp_run
I1125 12:35:33.101781 12 main.go:136] **kwargs)
I1125 12:35:33.101796 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/kale/common/kfputils.py", line 177, in run_pipeline
I1125 12:35:33.101813 12 main.go:136] experiment = client.create_experiment(experiment_name)
I1125 12:35:33.101817 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/kfp/_client.py", line 342, in create_experiment
I1125 12:35:33.101835 12 main.go:136] experiment = self._experiment_api.create_experiment(body=experiment)
I1125 12:35:33.101839 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/kfp_server_api/api/experiment_service_api.py", line 187, in create_experiment
I1125 12:35:33.101949 12 main.go:136] return self.create_experiment_with_http_info(body, **kwargs) # noqa: E501
I1125 12:35:33.101964 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/kfp_server_api/api/experiment_service_api.py", line 285, in create_experiment_with_http_info
I1125 12:35:33.101978 12 main.go:136] collection_formats=collection_formats)
I1125 12:35:33.101982 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/kfp_server_api/api_client.py", line 369, in call_api
I1125 12:35:33.102078 12 main.go:136] _preload_content, _request_timeout, _host)
I1125 12:35:33.102085 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/kfp_server_api/api_client.py", line 185, in __call_api
I1125 12:35:33.102095 12 main.go:136] _request_timeout=_request_timeout)
I1125 12:35:33.102098 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/kfp_server_api/api_client.py", line 413, in request
I1125 12:35:33.102211 12 main.go:136] body=body)
I1125 12:35:33.102217 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/kfp_server_api/rest.py", line 271, in POST
I1125 12:35:33.102288 12 main.go:136] body=body)
I1125 12:35:33.102293 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/kfp_server_api/rest.py", line 168, in request
I1125 12:35:33.102348 12 main.go:136] headers=headers)
I1125 12:35:33.102354 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/urllib3/request.py", line 79, in request
I1125 12:35:33.102402 12 main.go:136] method, url, fields=fields, headers=headers, **urlopen_kw
I1125 12:35:33.102409 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/urllib3/request.py", line 170, in request_encode_body
I1125 12:35:33.102490 12 main.go:136] return self.urlopen(method, url, **extra_kw)
I1125 12:35:33.102497 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/urllib3/poolmanager.py", line 375, in urlopen
I1125 12:35:33.102569 12 main.go:136] response = conn.urlopen(method, u.request_uri, **kw)
I1125 12:35:33.102573 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
I1125 12:35:33.102741 12 main.go:136] chunked=chunked,
I1125 12:35:33.102754 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 394, in _make_request
I1125 12:35:33.102765 12 main.go:136] conn.request(method, url, **httplib_request_kw)
I1125 12:35:33.102771 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 234, in request
I1125 12:35:33.102854 12 main.go:136] super(HTTPConnection, self).request(method, url, body=body, headers=headers)
I1125 12:35:33.102862 12 main.go:136] File "/usr/local/lib/python3.6/http/client.py", line 1287, in request
I1125 12:35:33.103073 12 main.go:136] self._send_request(method, url, body, headers, encode_chunked)
I1125 12:35:33.103081 12 main.go:136] File "/usr/local/lib/python3.6/http/client.py", line 1328, in _send_request
I1125 12:35:33.103295 12 main.go:136] self.putheader(hdr, value)
I1125 12:35:33.103302 12 main.go:136] File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 219, in putheader
I1125 12:35:33.103365 12 main.go:136] _HTTPConnection.putheader(self, header, *values)
I1125 12:35:33.103373 12 main.go:136] File "/usr/local/lib/python3.6/http/client.py", line 1264, in putheader
I1125 12:35:33.103606 12 main.go:136] if _is_illegal_header_value(values[i]):
I1125 12:35:33.103616 12 main.go:136] TypeError: expected string or bytes-like object
F1125 12:35:33.669121 12 main.go:365] Failed to wait for worker container: Training container is failed. Unable to read file /var/log/katib/6.pid for pid 6, error: open /var/log/katib/6.pid: no such file or directory
Thanks a lot :)!
The text was updated successfully, but these errors were encountered:
@basti-j same issue here. Were you able to fix it? EDIT: I was able to solve it by adding the additional env variable and binding pipeline-runner to the correct cluster role, see kubeflow/katib#1454 (comment)
@basti-j same issue here. Were you able to fix it? EDIT: I was able to solve it by adding the additional env variable and binding pipeline-runner to the correct cluster role, see kubeflow/katib#1454 (comment)
Hi, I have already add an additional env variables and create a new serviceAccount with the name "pipeline-runner" then add the cluster role of "kubeflow-edit" to the newly created serviceAccount.
But still have error " TypeError: expected string or bytes-like object".
Could you help me? Thanks a lot.
Environment
KF: 1.3 (on prem, via ArgoFlow with rook ceph; without Rok)
Kale: 0.7.0
Katib: 0.11
Katib-Controller: gcr.io/arrikto/katib-controller/v0.11.1-8-gfab7fb06
Katib chocolate service: gcr.io/arrikto/suggestion-chocolate:v0.11.1-8-gfab7fb06
Setup PodDefault for accessing KF Pipelines (like here: https://www.kubeflow.org/docs/components/pipelines/sdk/connect-api/#multi-user-mode). I see example with
KF_PIPELINES_SA_TOKEN_PATH
and others withML_PIPELINE_SA_TOKEN_PATH
. Is there a difference and should I set theML_PIPELINE_SA_TOKEN_PATH
instead?Setup pipeline-runner SA
Setup Rolebinding for workflows
Current Status
I'm able to successfully compile and run pipelines from notebooks via the Kale base image with collected metric in the run output.
Problem/Bug
I'm able to trigger for the same pipeline a katib job. A katib job gets created and the trials are created. But the trials are always failing. Based on the trials logs the metrics collector seems to be unable to read the metrics. Or is this a permission issue?
Logs
kale.log
(looks fine)katib runner logs
(looks fine):Failing logs of one
trial
:Thanks a lot :)!
The text was updated successfully, but these errors were encountered: