Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaskKubernetesEnvironment not setting imagePullSecrets #3356

Closed
mgsnuno opened this issue Sep 21, 2020 · 13 comments
Closed

DaskKubernetesEnvironment not setting imagePullSecrets #3356

mgsnuno opened this issue Sep 21, 2020 · 13 comments

Comments

@mgsnuno
Copy link

mgsnuno commented Sep 21, 2020

Description

When using DaskKubernetesEnvironment with a Docker storage in a private registry in azure, the deployment of the prefect job/flow run container fails with:

Failed to pull image "<image_url>": rpc error: code = Unknown desc = Error response from daemon: Get <image_url>: unauthorized: authentication required, visit https://aka.ms/acr/authorization for more information.

Looking into the yaml of the job submitted to the cluster (Pods->Actions->Edit) it is clear that there is no field as bellow:

imagePullSecrets:
    - name: regcred

I've created an image_pull_secret and tried setting it in the custom scheduler/worker yaml files and also as an argument in DaskKubernetesEnvironment, both fail.

Expected Behavior

imagePullSecrets should be part of the yaml file submitted by prefect to run the flow.

Reproduction

@task
def just_sleep():
    from time import sleep

    sleep(10)


with Flow("flow-test") as flow:
    _ = just_sleep()

flow.storage = Docker(
    registry_url="<azure_name>.azurecr.io",
    image_name="flow-test",
)
flow.environment = DaskKubernetesEnvironment(
    min_workers=1,
    max_workers=1,
    image_pull_secret="regcred",
)

flow.register(project_name="pipelines")

Environment

{
  "config_overrides": {
    "backend": true
  },
  "env_vars": [],
  "system_information": {
    "platform": "Linux-5.4.0-1025-azure-x86_64-with-glibc2.10",
    "prefect_backend": "server",
    "prefect_version": "0.13.7",
    "python_version": "3.8.5"
  }
}
@joshmeek
Copy link

joshmeek commented Sep 21, 2020

@mgsnuno I think what might be happening here is the first prefect job (which deserializes / loads the environment) is not able to start up because it also does not have the image pull secrets. The Kubernetes agent loads the image pull secrets for its pods off of an environment variable which you can set like this:

prefect agent start kubernetes --env IMAGE_PULL_SECRETS=regcred

@mgsnuno
Copy link
Author

mgsnuno commented Sep 21, 2020

@joshmeek exactly, it's that first prefect job that is not getting the image pull secrets. I did what you suggested but still fails because as you can see in the yaml submitted by prefect bellow (k8s dashboard->Pods->Actions->Edit) there is still not imagePullSecrets:

containers:
    - name: flow
      image: <image_url_private_repo>
      command:
        - /bin/sh
        - '-c'
      args:
        - prefect execute flow-run
      env:
        - name: PREFECT__CLOUD__API
          value: 'http://localhost:4200'
        - name: PREFECT__CLOUD__AUTH_TOKEN
        - name: PREFECT__CONTEXT__FLOW_RUN_ID
          value: 2931d2f6-0746-4cd3-9781-0ca0f7a1c0a1
        - name: PREFECT__CONTEXT__FLOW_ID
          value: 17c7c87b-e952-4c94-9882-53915ebebc22
        - name: PREFECT__CONTEXT__NAMESPACE
          value: default
        - name: PREFECT__CLOUD__AGENT__LABELS
          value: '[]'
        - name: PREFECT__LOGGING__LOG_TO_CLOUD
          value: 'true'
        - name: PREFECT__LOGGING__LEVEL
          value: INFO
        - name: PREFECT__CLOUD__USE_LOCAL_SECRETS
          value: 'false'
        - name: PREFECT__ENGINE__FLOW_RUNNER__DEFAULT_CLASS
          value: prefect.engine.cloud.CloudFlowRunner
        - name: PREFECT__ENGINE__TASK_RUNNER__DEFAULT_CLASS
          value: prefect.engine.cloud.CloudTaskRunner
        - name: IMAGE_PULL_SECRETS
          value: regcred
      resources:
        limits:
          cpu: 100m
        requests:
          cpu: 100m
      volumeMounts:
        - name: default-token-7j4sl
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
  restartPolicy: Never
  terminationGracePeriodSeconds: 30
  dnsPolicy: ClusterFirst
  serviceAccountName: default
  serviceAccount: default

@joshmeek
Copy link

Hmm there could be some confusion here, I actually think setting that --env only works when installing the agent. Could you try doing this:

export IMAGE_PULL_SECRETS=regcred
prefect agent start kubernetes

@mgsnuno
Copy link
Author

mgsnuno commented Sep 22, 2020

@joshmeek that works for my local running prefect kubernetes agent, the job pod creates successfully and then fails later on with something related to rbac permissions I believe (see bellow).

This is the local running kubernetes agent command:

export IMAGE_PULL_SECRETS=regcred
prefect agent start kubernetes --api http://<prefect_server_url>:4200

Since having the kubernetes agent installed in kubernetes is a way more elegant solution and can also solve the rbac issue with the --rbac flag, I tried the following:

prefect agent install kubernetes --api http://<prefect_server_url>:4200 --rbac --image-pull-secrets=regcred --namespace pipelines | kubectl apply --namespace=pipelines -f -

Job pod gets created successfully, pulling image from private repo works, but then I still get the following:

[2020-09-22 09:42:15] CRITICAL - prefect.DaskKubernetesEnvironment | Failed to create Kubernetes job: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '86de4f49-3216-4c8e-b8de-8f9fe3be7577', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 22 Sep 2020 09:42:15 GMT', 'Content-Length': '285'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod in version \"v1\" cannot be handled as a Job: no kind \"Pod\" is registered for version \"batch/v1\" in scheme \"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\"","reason":"BadRequest","code":400}
(400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '86de4f49-3216-4c8e-b8de-8f9fe3be7577', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 22 Sep 2020 09:42:15 GMT', 'Content-Length': '285'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod in version \"v1\" cannot be handled as a Job: no kind \"Pod\" is registered for version \"batch/v1\" in scheme \"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\"","reason":"BadRequest","code":400}
Traceback (most recent call last):
  File "/opt/conda/bin/prefect", line 10, in <module>
    sys.exit(cli())
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/prefect/cli/execute.py", line 33, in flow_run
    return _execute_flow_run()
  File "/opt/conda/lib/python3.8/site-packages/prefect/cli/execute.py", line 93, in _execute_flow_run
    raise exc
  File "/opt/conda/lib/python3.8/site-packages/prefect/cli/execute.py", line 87, in _execute_flow_run
    environment.execute(flow)
  File "/home/nuno/miniconda3/envs/pipelines/lib/python3.8/site-packages/prefect/environments/execution/dask/k8s.py", line 220, in execute
  File "/home/nuno/miniconda3/envs/pipelines/lib/python3.8/site-packages/prefect/environments/execution/dask/k8s.py", line 215, in execute
  File "/opt/conda/lib/python3.8/site-packages/kubernetes/client/api/batch_v1_api.py", line 58, in create_namespaced_job
    (data) = self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/opt/conda/lib/python3.8/site-packages/kubernetes/client/api/batch_v1_api.py", line 135, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
  File "/opt/conda/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 340, in call_api
    return self.__call_api(resource_path, method,
  File "/opt/conda/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 172, in __call_api
    response_data = self.request(
  File "/opt/conda/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 382, in request
    return self.rest_client.POST(url,
  File "/opt/conda/lib/python3.8/site-packages/kubernetes/client/rest.py", line 272, in POST
    return self.request("POST", url,
  File "/opt/conda/lib/python3.8/site-packages/kubernetes/client/rest.py", line 231, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '86de4f49-3216-4c8e-b8de-8f9fe3be7577', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 22 Sep 2020 09:42:15 GMT', 'Content-Length': '285'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod in version \"v1\" cannot be handled as a Job: no kind \"Pod\" is registered for version \"batch/v1\" in scheme \"k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30\"","reason":"BadRequest","code":400}

Any ideas on how to proceed?

@joshmeek
Copy link

@mgsnuno Haven't seen that one before! Looks like some weird pod <--> job mismatch that could be due to formatting. Could you try taking out image_pull_secret="regcred", from your DaskKubernetesEnvironment and see what happens? This will probably fail but I want to see if it fails for the same error

@mgsnuno
Copy link
Author

mgsnuno commented Sep 22, 2020

I think I got something, since yesterday I was experimenting with a DaskKubernetesEnvironment as follows:

flow.environment = DaskKubernetesEnvironment(
    min_workers=1,
    max_workers=1,
    scheduler_spec_file=os.path.join(dirname, "pod_scheduler.yaml"),
    worker_spec_file=os.path.join(dirname, "pod_worker.yaml"),
    image_pull_secret="regcred",
)
where `pod_scheduler.yaml`
kind: Pod
metadata:
  labels:
    app: dask
spec:
  template:
    metadata:
      labels:
        app: prefect-dask-scheduler
    spec:
      replicas: 1
      restartPolicy: Always
      containers:
        - image: <private-repo-name>.azurecr.io/pipelines-dask:latest
          imagePullPolicy: Always
          name: dask-scheduler
          args: [dask-scheduler]
          resources:
            limits:
              cpu: "3"
              memory: 12G
            requests:
              cpu: "3"
              memory: 12G
      imagePullSecrets:
        - name: regcred
where `pod_worker.yaml`
kind: Pod
metadata:
  labels:
    app: dask
spec:
  replicas: 1
  restartPolicy: Always
  containers:
    - image: <private-repo-name>.azurecr.io/pipelines-dask:latest
      imagePullPolicy: Always
      name: dask-worker
      args:
        [
          dask-worker,
          --nthreads,
          "3",
          --memory-limit,
          12G,
          --death-timeout,
          "60",
        ]
      resources:
        limits:
          cpu: "3"
          memory: 12G
        requests:
          cpu: "3"
          memory: 12G
  imagePullSecrets:
    - name: regcred

This was causing the errors I just sent. I based these yamls on what I saw in https://docs.prefect.io/orchestration/execution/dask_k8s_environment.html#examples

I removed those, so I ran

flow.environment = DaskKubernetesEnvironment(
    min_workers=1,
    max_workers=1,
    image_pull_secret="regcred",
)

Initially it fails with cannot find path /home/nuno/miniconda3/envs/pipelines/lib/python3.8/site-packages/prefect/environments/execution/dask/job.yaml

So I copied the default job.yaml and worker_pod.yaml to the same folder where flow_test.py is, redeployed the flow as follows:

flow.environment = DaskKubernetesEnvironment(
    min_workers=1,
    max_workers=1,
    scheduler_spec_file=os.path.join(dirname, "job.yaml"),
    worker_spec_file=os.path.join(dirname, "worker_pod.yaml"),
    image_pull_secret="regcred",
)

And it worked!

So, any pointers why the custom yaml do not work? Thank you

@mgsnuno
Copy link
Author

mgsnuno commented Sep 22, 2020

Comparing job.yaml with pod_scheduler.yaml I found errors in:

apiVersion: batch/v1
kind: Job
....
    restartPolicy: Never

And now it goes further until it stops again with the job prefect-dask-job-..... (scheduler I believe), that is spawned from prefect-job-...., failing to pull the image from the private repo because image_pull_secret gets lost.

Since I'm writing the yaml files myself, I can place imagePullSecrets in it, but what I think that happens is that prefect-job-.... fails to forward either IMAGE_PULL_SECRETS evn var or imagePullSecrets to the prefect-dask-job-.....

@joshmeek
Copy link

Ah that makes sense as to why it would fail, the scheduler spec is expected to be a job, not a pod. Basing your scheduler spec off the job.yaml in the repo is a good idea. The image pull secrets set via env var is not forwarded to custom specs and that is intentional by design. You'll have to add the imagePullSecrets directly to it.

@mgsnuno
Copy link
Author

mgsnuno commented Sep 22, 2020

ok, thanks a lot. I also found that in order for it to work the custom scheduler args/command have to be

command: ["/bin/sh", "-c"]
args:
  [
    'python -c "import prefect; prefect.environments.execution.load_and_run_flow()"',
  ]

I was trying args: [dask-scheduler] without success.

which brings me to this pod_scheduler.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: prefect-dask-scheduler
  labels:
    app: dask
spec:
  template:
    metadata:
      labels:
        app: prefect-dask-scheduler
    spec:
      replicas: 1
      restartPolicy: Never
      containers:
        - image: <private_repo_name>.azurecr.io/pipelines-dask:latest
          imagePullPolicy: Always
          name: dask-scheduler
          command: ["/bin/sh", "-c"]
          args:
            [
              'python -c "import prefect; prefect.environments.execution.load_and_run_flow()"',
            ]
          resources:
            limits:
              cpu: "3"
              memory: 12G
            requests:
              cpu: "3"
              memory: 12G
      imagePullSecrets:
        - name: regcred

Another question: how can I expose the scheduler address in order to have access to the dask dashboard?

When spawning the cluster myself using dask-kubernetes I was setting dask.config.set({"kubernetes.scheduler-service-type": "LoadBalancer"})

@mgsnuno
Copy link
Author

mgsnuno commented Sep 23, 2020

@joshmeek it would be great if you could help with the following:

Another question: how can I expose the scheduler address in order to have access to the dask dashboard?

When spawning the cluster myself using dask-kubernetes I was setting dask.config.set({"kubernetes.scheduler-service-type": "LoadBalancer"})

@joshmeek
Copy link

joshmeek commented Sep 23, 2020

@mgsnuno I haven't attempted that before using this environment so I can't say for certain how to do it. Looking at that config you set I wonder if you can set the env var like this to set it on your scheduler pod:

env:
  - name: DASK_KUBERNETES__SCHEDULER_SERVICE_TYPE
    value: LoadBalancer

@mgsnuno
Copy link
Author

mgsnuno commented Sep 30, 2020

@joshmeek I tried the following:

env:
  - name: DASK_KUBERNETES__SCHEDULER_SERVICE_TYPE
    value: LoadBalancer
  - name: DASK__DISTRIBUTED__COMM__TIMEOUTS__CONNECT
    value: "200"
  - name: DASK__KUBERNETES__DEPLOY_MODE
    value: remote

Didn't work because no LoadBalancer service gets created. See dask/dask-kubernetes#259 (comment) for reference.

I can open a separate enhancement issue for this, to expose the dask dashboard.

@mgsnuno
Copy link
Author

mgsnuno commented Sep 30, 2020

Working solution:

  1. install kubernetes agent as follows:
prefect agent install kubernetes \
--api http://<remote_server_url_OR_localhost>:4200 \
--rbac \
--image-pull-secrets=regcred \
--namespace pipelines \
kubectl apply --namespace=pipelines -f -
  1. set flow environment with custom scheduler/worker yaml files
flow.environment = DaskKubernetesEnvironment(
    min_workers=1,
    max_workers=1,
    scheduler_spec_file="pod_scheduler.yaml",
    worker_spec_file="pod_worker.yaml",
    image_pull_secret="regcred",
)
  1. templates of yaml files to use: image is not necessary to include has it will be added automatically by prefect when deploying the flow.
pod_scheduler.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: prefect-dask-scheduler
  labels:
    app: prefect
spec:
  template:
    metadata:
      labels:
        app: prefect-dask-scheduler
    spec:
      restartPolicy: Never
      containers:
        - name: prefect-dask-scheduler
          imagePullPolicy: Always
          command: ["/bin/sh", "-c"]
          args:
            [
              'python -c "import prefect; prefect.environments.execution.load_and_run_flow()"',
            ]
          resources:
            limits:
              cpu: "3"
              memory: 12G
            requests:
              cpu: "3"
              memory: 12G
      imagePullSecrets:
        - name: regcred
pod_worker.yaml
kind: Pod
metadata:
  labels:
    app: prefect
spec:
  restartPolicy: Never
  containers:
    - name: prefect-dask-worker
      imagePullPolicy: Always
      args:
        [
          dask-worker,
          --nthreads,
          "3",
          --memory-limit,
          12G,
          --death-timeout,
          "60",
        ]
      resources:
        limits:
          cpu: "3"
          memory: 12G
        requests:
          cpu: "3"
          memory: 12G
  imagePullSecrets:
    - name: regcred

@mgsnuno mgsnuno closed this as completed Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants