Sidecars in Worker Template Causes Failure #378

ehenry2 · 2021-12-15T21:42:53Z

What happened:
Cluster failed to create when using a worker template that contains multiple containers (e.g. sidecar pattern).

The error:

kubernetes_asyncio.client.exceptions.ApiException: (400)
Reason: Bad Request
HTTP response headers: <CIMultiDictProxy('Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Wed, 15 Dec 2021 20:09:31 GMT', 'Content-Length': '212')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"a container name must be specified for pod dask-root-46c95e02-0vgngv, choose one of: [dask model]","reason":"BadRequest","code":400}

I traced this back to the logs() function in the Pod class in core.py. It makes a call to read_namespaced_pod_log (which in the case of multiple containers in the pod, needs a "container=" argument passed to it with the name of the container.

What you expected to happen:
The cluster to be created correctly. I expected the logs() function to be smart enough to know which container is the dask container or iterate through each container until it recognized the logs it was looking for.

Minimal Complete Verifiable Example:

cluster = KubeCluster.from_yaml(spec_path)
cluster.scale(4)

And the worker-spec.yaml file:

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: dev.local/batch-app:0.0.4
    imagePullPolicy: IfNotPresent
    args: [/app/venv/bin/dask-worker, --nthreads, '2', --no-dashboard, --memory-limit, 2GB, --death-timeout, '60']
    name: dask
    resources:
      limits:
        cpu: "1"
        memory: 2G
      requests:
        cpu: "1"
        memory: 2G
  - name: model
    image: dev.local/ml-model-prototype:0.0.1
    env:
    - name: SERVER_HOST
      value: "0.0.0.0"
    - name: SERVER_PORT
      value: "8989"

Anything else we need to know?:

Environment:

Dask version: 2021.12.0
Python version: 3.9.5
Operating System: Linux
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2021-12-16T10:04:57Z

Thanks for raising this!

Byt default we name the worker container dask-worker, so perhaps setting a default container="dask-worker" would be good as it wouldn't break backward compatibility. That way if you want to have sidecar containers you just need to ensure the worker pod is named correctly.

Do you have any interest in raising a PR to implement this?

ehenry2 · 2021-12-16T14:52:14Z

hey thanks for taking a look at this @jacobtomlinson. I'm definitely interested in raising a PR. The other approach I was thinking was to always select the first container (which will be 100% backwards compatible because no one can run a cluster right now with more than one container), so that way if people overrode the container name in their worker-spec.yaml, their deployments will continue to work. I can do it either way just let me know. The only complication is that I have a Mac with the M1 chip (ARM) so the test suite doesn't appear to be working (the dask dev docker container won't run on ARM it seems), so I might have some issues with the regression testing, but I'll see how far I can get.

jacobtomlinson · 2021-12-16T15:36:32Z

That would work too, as long as the Dask container is always the first one. I have no preference really, as long as it's documented.

Ah yeah I also have an M1 Mac and Docker is just not possible. I tend to SSH to an amd64 machine. The CI should run the tests when you push up a PR so feel free to open a draft PR and keep pushing commits to it to trigger the CI. I'll squash on merge anyway.

I am curious what your use case is for running multiple contains inside one pod with dask-kubernetes. This is the first time that request has come up.

ehenry2 · 2021-12-16T17:10:58Z

Perfect, thanks for the advice - that should work! Our use case here is to run machine learning models in batch. We build the ML models as a docker container and use Apache Arrow Flight as the external interface so we can use the same container efficiently for both real time and batch use cases as a sidecar. It's kind a more developed version of what I posted here: https://github.com/ehenry2/xgbatch

jacobtomlinson · 2024-04-30T15:28:11Z

The classic KubeCluster was removed in #890. All users will need to migrate to the Dask Operator. Closing.

jacobtomlinson added the bug label Dec 16, 2021

jacobtomlinson added kubecluster (classic) help wanted labels Jan 11, 2022

jacobtomlinson closed this as not planned Won't fix, can't repro, duplicate, stale Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sidecars in Worker Template Causes Failure #378

Sidecars in Worker Template Causes Failure #378

ehenry2 commented Dec 15, 2021

jacobtomlinson commented Dec 16, 2021

ehenry2 commented Dec 16, 2021

jacobtomlinson commented Dec 16, 2021

ehenry2 commented Dec 16, 2021

jacobtomlinson commented Apr 30, 2024

Sidecars in Worker Template Causes Failure #378

Sidecars in Worker Template Causes Failure #378

Comments

ehenry2 commented Dec 15, 2021

jacobtomlinson commented Dec 16, 2021

ehenry2 commented Dec 16, 2021

jacobtomlinson commented Dec 16, 2021

ehenry2 commented Dec 16, 2021

jacobtomlinson commented Apr 30, 2024