Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidecars in Worker Template Causes Failure #378

Closed
ehenry2 opened this issue Dec 15, 2021 · 5 comments
Closed

Sidecars in Worker Template Causes Failure #378

ehenry2 opened this issue Dec 15, 2021 · 5 comments

Comments

@ehenry2
Copy link

ehenry2 commented Dec 15, 2021

What happened:
Cluster failed to create when using a worker template that contains multiple containers (e.g. sidecar pattern).

The error:

kubernetes_asyncio.client.exceptions.ApiException: (400)
Reason: Bad Request
HTTP response headers: <CIMultiDictProxy('Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Wed, 15 Dec 2021 20:09:31 GMT', 'Content-Length': '212')>
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"a container name must be specified for pod dask-root-46c95e02-0vgngv, choose one of: [dask model]","reason":"BadRequest","code":400}

I traced this back to the logs() function in the Pod class in core.py. It makes a call to read_namespaced_pod_log (which in the case of multiple containers in the pod, needs a "container=" argument passed to it with the name of the container.

What you expected to happen:
The cluster to be created correctly. I expected the logs() function to be smart enough to know which container is the dask container or iterate through each container until it recognized the logs it was looking for.

Minimal Complete Verifiable Example:

cluster = KubeCluster.from_yaml(spec_path)
cluster.scale(4)

And the worker-spec.yaml file:

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: dev.local/batch-app:0.0.4
    imagePullPolicy: IfNotPresent
    args: [/app/venv/bin/dask-worker, --nthreads, '2', --no-dashboard, --memory-limit, 2GB, --death-timeout, '60']
    name: dask
    resources:
      limits:
        cpu: "1"
        memory: 2G
      requests:
        cpu: "1"
        memory: 2G
  - name: model
    image: dev.local/ml-model-prototype:0.0.1
    env:
    - name: SERVER_HOST
      value: "0.0.0.0"
    - name: SERVER_PORT
      value: "8989"

Anything else we need to know?:

Environment:

  • Dask version: 2021.12.0
  • Python version: 3.9.5
  • Operating System: Linux
  • Install method (conda, pip, source): pip
@jacobtomlinson
Copy link
Member

Thanks for raising this!

Byt default we name the worker container dask-worker, so perhaps setting a default container="dask-worker" would be good as it wouldn't break backward compatibility. That way if you want to have sidecar containers you just need to ensure the worker pod is named correctly.

Do you have any interest in raising a PR to implement this?

@ehenry2
Copy link
Author

ehenry2 commented Dec 16, 2021

hey thanks for taking a look at this @jacobtomlinson. I'm definitely interested in raising a PR. The other approach I was thinking was to always select the first container (which will be 100% backwards compatible because no one can run a cluster right now with more than one container), so that way if people overrode the container name in their worker-spec.yaml, their deployments will continue to work. I can do it either way just let me know. The only complication is that I have a Mac with the M1 chip (ARM) so the test suite doesn't appear to be working (the dask dev docker container won't run on ARM it seems), so I might have some issues with the regression testing, but I'll see how far I can get.

@jacobtomlinson
Copy link
Member

That would work too, as long as the Dask container is always the first one. I have no preference really, as long as it's documented.

Ah yeah I also have an M1 Mac and Docker is just not possible. I tend to SSH to an amd64 machine. The CI should run the tests when you push up a PR so feel free to open a draft PR and keep pushing commits to it to trigger the CI. I'll squash on merge anyway.

I am curious what your use case is for running multiple contains inside one pod with dask-kubernetes. This is the first time that request has come up.

@ehenry2
Copy link
Author

ehenry2 commented Dec 16, 2021

Perfect, thanks for the advice - that should work! Our use case here is to run machine learning models in batch. We build the ML models as a docker container and use Apache Arrow Flight as the external interface so we can use the same container efficiently for both real time and batch use cases as a sidecar. It's kind a more developed version of what I posted here: https://github.com/ehenry2/xgbatch

@jacobtomlinson
Copy link
Member

The classic KubeCluster was removed in #890. All users will need to migrate to the Dask Operator. Closing.

@jacobtomlinson jacobtomlinson closed this as not planned Won't fix, can't repro, duplicate, stale Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants