Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name cvmfs.csi.cern.ch not found in the list of registered CSI drivers #496

Open
Truongphikt opened this issue Aug 11, 2024 · 4 comments

Comments

@Truongphikt
Copy link

Truongphikt commented Aug 11, 2024

Hi galaxy-helm team,

My target is to deploy Galaxy on GKE. After undergoing #493, I ran into a problem that may be related to CSI drivers.

Behaviour

All storage is fine. But some workloads are in the pending stage and others show off the error "Does not have minimum availability" indefinitely.
workloads storage
image image

At all error workflow, the pod stops at the initial stage

image

Look into error pod

After that, I checked the log of an error pod

$ kubectl describe pods my-galaxy-release-web-8df9fc56b-wfh8q
Name:             my-galaxy-release-web-8df9fc56b-wfh8q
Namespace:        default
Priority:         0
Service Account:  my-galaxy-release
Node:             gke-galaxy-cluster-default-pool-aada09c7-ktvt/10.150.0.12
Start Time:       Sun, 11 Aug 2024 03:25:36 +0000
Labels:           app.kubernetes.io/component=galaxy-web-handler
                  app.kubernetes.io/instance=my-galaxy-release
                  app.kubernetes.io/name=galaxy
                  pod-template-hash=8df9fc56b
Annotations:      checksum/galaxy_conf: 28bf33924622f4c62fc23e4cb0579231c491df17f7bc086b532a6a9fc0648859
                  checksum/galaxy_extras: 1cb6d207de441e5ed402124756fa900166791afb5986fd986c6056baa44e26ca
                  checksum/galaxy_rules: 4e8361a62fb4b616e92fadf2fb8be8147d402a66bbfdc760913519f96e2cbe5c
                  cloud.google.com/cluster_autoscaler_unhelpable_since: 2024-08-11T03:25:08+0000
                  cloud.google.com/cluster_autoscaler_unhelpable_until: Inf
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    ReplicaSet/my-galaxy-release-web-8df9fc56b
Init Containers:
  galaxy-wait-db:
    Container ID:  
    Image:         quay.io/galaxyproject/galaxy-min:24.1.1
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      sh
      -c
      until [ -f /galaxy/server/config/mutable/db_init_done_1 ]; do echo "waiting for DB initialization"; sleep 1; done; until timeout 1 bash -c "echo > /dev/tcp/my-galaxy-release-rabbitmq-server/5672"; do echo "waiting for rabbitmq service"; sleep 1; done; until [ -f /galaxy/server/config/mutable/init_mounts_done_1 ]; do echo "waiting for copying onto NFS"; sleep 1; done; until [ -f /galaxy/server/config/mutable/init_clone_done_1 ]; do echo "waiting for refdata copying"; sleep 1; done; echo "Initialization waits complete"; sleep 0;
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /galaxy/server/config/mutable/ from galaxy-data (rw,path="config")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ptpfx (ro)
Containers:
  galaxy-web:
    Container ID:  
    Image:         quay.io/galaxyproject/galaxy-min:24.1.1
    Image ID:      
    Port:          8080/TCP
    Host Port:     0/TCP
    Args:
      sh
      -c
      /galaxy/server/.venv/bin/gunicorn "galaxy.webapps.galaxy.fast_factory:factory()" --timeout 300 --pythonpath /galaxy/server/lib -k galaxy.webapps.galaxy.workers.Worker -b 0.0.0.0:8080 --workers=1 --config python:galaxy.web_stack.gunicorn_config --preload 
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                3
      ephemeral-storage:  10Gi
      memory:             7G
    Requests:
      cpu:                100m
      ephemeral-storage:  1Gi
      memory:             1G
    Liveness:             http-get http://:8080/galaxy/api/version delay=0s timeout=30s period=10s #success=1 #failure=30
    Readiness:            http-get http://:8080/galaxy/api/version delay=0s timeout=12s period=10s #success=1 #failure=12
    Startup:              http-get http://:8080/galaxy/api/version delay=30s timeout=80s period=5s #success=1 #failure=80
    Environment:
      GALAXY_DB_USER_PASSWORD:                          <set to the key 'password' in secret 'galaxydbuser.galaxy-my-galaxy-release-postgres.credentials.postgresql.acid.zalan.do'>  Optional: false
      GALAXY_CONFIG_OVERRIDE_DATABASE_CONNECTION:       postgresql://galaxydbuser:$(GALAXY_DB_USER_PASSWORD)@galaxy-my-galaxy-release-postgres/galaxy?sslmode=require
      GALAXY_CONFIG_OVERRIDE_ID_SECRET:                 <set to the key 'galaxy-config-id-secret' in secret 'my-galaxy-release-galaxy-secrets'>  Optional: false
      PYTHONPATH:                                       /galaxy/server/lib
      GALAXY_CONFIG_FILE:                               /galaxy/server/config/galaxy.yml
      GALAXY_RABBITMQ_USERNAME:                         <set to the key 'username' in secret 'my-galaxy-release-rabbitmq-server-default-user'>  Optional: false
      GALAXY_RABBITMQ_PASSWORD:                         <set to the key 'password' in secret 'my-galaxy-release-rabbitmq-server-default-user'>  Optional: false
      GALAXY_CONFIG_OVERRIDE_AMQP_INTERNAL_CONNECTION:  amqp://$(GALAXY_RABBITMQ_USERNAME):$(GALAXY_RABBITMQ_PASSWORD)@my-galaxy-release-rabbitmq-server:5672
    Mounts:
      /cvmfs/cloud.galaxyproject.org from galaxy-data (rw,path="cvmfsclone")
      /cvmfs/data.galaxyproject.org from refdata-gxy (rw,path="data.galaxyproject.org")
      /galaxy/server/config/build_sites.yml from galaxy-conf-files (rw,path="build_sites.yml")
      /galaxy/server/config/container_resolvers_conf.xml from galaxy-conf-files (rw,path="container_resolvers_conf.xml")
      /galaxy/server/config/galaxy.yml from galaxy-conf-files (rw,path="galaxy.yml")
      /galaxy/server/config/integrated_tool_panel.xml from galaxy-conf-files (rw,path="integrated_tool_panel.xml")
      /galaxy/server/config/job_conf.yml from galaxy-conf-files (rw,path="job_conf.yml")
      /galaxy/server/config/mutable/ from galaxy-data (rw,path="config")
      /galaxy/server/config/sanitize_allowlist.txt from galaxy-conf-files (rw,path="sanitize_allowlist.txt")
      /galaxy/server/config/tool_conf.xml from galaxy-conf-files (rw,path="tool_conf.xml")
      /galaxy/server/config/workflow_schedulers_conf.xml from galaxy-conf-files (rw,path="workflow_schedulers_conf.xml")
      /galaxy/server/database from galaxy-data (rw)
      /galaxy/server/lib/galaxy/jobs/rules/tpv_rules_local.yml from galaxy-job-rules (rw,path="tpv_rules_local.yml")
      /galaxy/server/static/welcome.html from extra-welcomehtml-ee3410714399628f55d8b0fbdbcc0b1ab19c965ad38e8 (rw,path="welcome.html")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ptpfx (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  galaxy-conf-files:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      my-galaxy-release-configs
    Optional:  false
  extra-welcomehtml-ee3410714399628f55d8b0fbdbcc0b1ab19c965ad38e8:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      my-galaxy-release-extra-welcomehtml-ee3410714399628f55d8b0fbdbcc0b1ab19c965ad38e8
    Optional:  false
  galaxy-job-rules:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      my-galaxy-release-job-rules
    Optional:  false
  galaxy-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  my-galaxy-release-galaxy-pvc
    ReadOnly:   false
  refdata-gxy:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  my-galaxy-release-refdata-gxy-pvc
    ReadOnly:   false
  kube-api-access-ptpfx:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Normal   NotTriggerScaleUp  55m                cluster-autoscaler  pod didn't trigger scale-up:
  Warning  FailedScheduling   55m (x5 over 55m)  default-scheduler   0/3 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Normal   Scheduled          55m                default-scheduler   Successfully assigned default/my-galaxy-release-web-8df9fc56b-wfh8q to gke-galaxy-cluster-default-pool-aada09c7-ktvt
  Warning  FailedMount        21s (x3 over 55m)  kubelet             MountVolume.MountDevice failed for volume "pvc-8179672f-3188-4fde-a9e7-d22c9e2719d7" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name cvmfs.csi.cern.ch not found in the list of registered CSI drivers

As far as I can see, the actual error is attacher.MountDevice failed to create newCsiDriverClient: driver name cvmfs.csi.cern.ch not found in the list of registered CSI drivers.

I also checked that my cluster hasn't installed CSI drivers.

My question

So my question is: Based on the above information, the reason for the error is missing CSI drivers, isn't it? If it is, how to install CSI drivers "properly"? Thank you so much for the amazing platform and enthusiastic support.

@Truongphikt Truongphikt changed the title MountVolume.MountDevice failed for volume "pvc-8179672f-3188-4fde-a9e7-d22c9e2719d7" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name cvmfs.csi.cern.ch not found in the list of registered CSI drivers kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name cvmfs.csi.cern.ch not found in the list of registered CSI drivers Aug 11, 2024
@ksuderman
Copy link
Contributor

ksuderman commented Aug 11, 2024

The only thing you should have to do to install the cvmfs csi is set the following in your values.yaml file:

cvmfs:
  enabled: true
  deploy: true

I assume you have that as it looks like CVMFS has been deployed.

However, I see that none of the cvmfsci-nodeplugin pods are in the Ready state and I suspect it is a problem with the name of the alien cache. Can you look in the logs for the cvmfs-nodeplugin and see what it is complaining about? If the logs mention they can't find the alien cache you can add the following to to your values.yaml file:

cvmfs:
  enabled: true
  deploy: true
  cvmfscsi:
    cache:
      alien:
        pvc:
          name: cvmfs-alien-cache

See: #437

@Truongphikt
Copy link
Author

@ksuderman Thanks for your information, I checked the cvmfscsi-nodeplugin workload but there is no pod run here. So I can't provide its log to you.

image
image

@ksuderman
Copy link
Contributor

Is there anything when you click the container logs link? Since the status shows as Ok I assume the startupProbe and livenessProbe are passing and just the readinessProbe fails resulting of 0/3 pods being ready.

Did you try setting the alien cache name?

@Truongphikt
Copy link
Author

Truongphikt commented Aug 11, 2024

After re-deploying Galaxy, and updating values.yml as you recommended, unfortunately, the error remains. I have already checked Container logs and see it is empty.

Container logs Audit logs
image image
Status after setting the `alien cache name` (I downscaled node number from 3 to 2)
Workload Storage
image image

values.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants