Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm chart 4.0.0-beta.2 cannot find existing pvc for user home dir. #3536

Closed
dragz opened this issue Oct 3, 2024 · 8 comments · Fixed by #3537
Closed

Helm chart 4.0.0-beta.2 cannot find existing pvc for user home dir. #3536

dragz opened this issue Oct 3, 2024 · 8 comments · Fixed by #3537
Labels

Comments

@dragz
Copy link

dragz commented Oct 3, 2024

Bug description

Just ran an upgrade from version 3.3.7 and found that the spawn fails because it cannot find the users home dir PVC.

How to reproduce

  1. Set up a cluster with 3.3.7 and get some users into it.
  2. Update to helm chart v4.0.0-beta.2
  3. Log in as a user
  4. Spawn fails with this message:
Spawn failed: pod jupyter/jupyter-abc000-uit-no---1ea1e26d did not start in 300 seconds!

Event log
Server requested
2024-10-03T10:50:38.416759Z [Warning] 0/1 nodes are available: persistentvolumeclaim "claim-abc000-uit-no---1ea1e26d" not found. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
2024-10-03T10:50:47Z [Normal] pod didn't trigger scale-up: 2 persistentvolumeclaim "claim-abc000-uit-no---1ea1e26d" not found
Spawn failed: pod jupyter/jupyter-abc000-uit-no---1ea1e26d did not start in 300 seconds!

Expected behaviour

It should look for a PVC named claim-abc000-40uit-2eno which was created before the upgrade.

Actual behaviour

It tries to find a PVC named claim-abc000-uit-no---1ea1e26d which does not exist

Your personal set up

Rather straight forward setup with Azure Kubernetes Service using zero-to-jupyterhub with autoscaling and EntraID for authentication.

Configuration
# jupyterhub_config.py
Logs
Loading /usr/local/etc/jupyterhub/secret/values.yaml
No config at /usr/local/etc/jupyterhub/existing-secret/values.yaml
Loading extra config: checkvolumes
[I 2024-10-03 10:50:13.022 JupyterHub app:3352] Running JupyterHub version 5.2.0
[I 2024-10-03 10:50:13.023 JupyterHub app:3382] Using Authenticator: oauthenticator.azuread.AzureAdOAuthenticator-17.0.0
[I 2024-10-03 10:50:13.023 JupyterHub app:3382] Using Spawner: kubespawner.spawner.KubeSpawner-7.0.0b2
[I 2024-10-03 10:50:13.023 JupyterHub app:3382] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-5.2.0
[I 2024-10-03 10:50:13.052 JupyterHub dbutil:129] Upgrading sqlite:///jupyterhub.sqlite
[I 2024-10-03 10:50:13.052 JupyterHub dbutil:100] Backing up jupyterhub.sqlite => jupyterhub.sqlite.2024-10-03-105013
[I 2024-10-03 10:50:14.117 alembic.runtime.migration migration:215] Context impl SQLiteImpl.
[I 2024-10-03 10:50:14.117 alembic.runtime.migration migration:218] Will assume non-transactional DDL.
[I 2024-10-03 10:50:14.127 alembic.runtime.migration migration:623] Running upgrade 0eee8c825d24 -> 3c2384c5aae1, Add from_config column to the services table
[I 2024-10-03 10:50:14.174 alembic.runtime.migration migration:623] Running upgrade 3c2384c5aae1 -> 4621fec11365, manage roles
[I 2024-10-03 10:50:14.586 JupyterHub roles:210] Role attribute admin.scopes has been changed
[I 2024-10-03 10:50:14.696 JupyterHub app:2925] Creating service jupyterhub-idle-culler without oauth.
[I 2024-10-03 10:50:14.738 JupyterHub roles:281] Adding role admin for User: [email protected]
[I 2024-10-03 10:50:14.786 JupyterHub app:3422] Initialized 0 spawners in 0.014 seconds
[I 2024-10-03 10:50:14.793 JupyterHub metrics:373] Found 0 active users in the last ActiveUserPeriods.twenty_four_hours
[I 2024-10-03 10:50:14.794 JupyterHub metrics:373] Found 0 active users in the last ActiveUserPeriods.seven_days
[I 2024-10-03 10:50:14.795 JupyterHub metrics:373] Found 1 active users in the last ActiveUserPeriods.thirty_days
[I 2024-10-03 10:50:14.795 JupyterHub app:3702] Not starting proxy
[I 2024-10-03 10:50:14.804 JupyterHub app:3738] Hub API listening on http://:8081/hub/
[I 2024-10-03 10:50:14.804 JupyterHub app:3740] Private Hub API connect url http://hub:8081/hub/
[I 2024-10-03 10:50:14.805 JupyterHub app:3621] Starting managed service jupyterhub-idle-culler
[I 2024-10-03 10:50:14.805 JupyterHub service:423] Starting service 'jupyterhub-idle-culler': ['python3', '-m', 'jupyterhub_idle_culler', '--url=http://localhost:8081/hub/api', '--timeout=3600', '--cull-every=600', '--concurrency=10']
[I 2024-10-03 10:50:14.807 JupyterHub service:136] Spawning python3 -m jupyterhub_idle_culler --url=http://localhost:8081/hub/api --timeout=3600 --cull-every=600 --concurrency=10
[I 2024-10-03 10:50:14.810 JupyterHub proxy:477] Adding route for Hub: / => http://hub:8081
[I 2024-10-03 10:50:14.817 JupyterHub app:3781] JupyterHub is now running, internal Hub API at http://hub:8081/hub/
[I 2024-10-03 10:50:15.166 JupyterHub log:192] 200 GET /hub/api/ (jupyterhub-idle-culler@::1) 27.87ms
[I 2024-10-03 10:50:15.188 JupyterHub log:192] 200 GET /hub/api/users?state=[secret] (jupyterhub-idle-culler@::1) 18.79ms
[I 2024-10-03 10:50:26.307 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.87ms
[I 2024-10-03 10:50:28.890 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.81ms
[I 2024-10-03 10:50:30.204 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 1.00ms
[W 2024-10-03 10:50:30.305 JupyterHub spawner:2546] Loading state for [email protected]/ from unknown prior version of kubespawner (likely 6.x), will attempt to upgrade.
[I 2024-10-03 10:50:30.306 JupyterHub log:192] 302 GET /hub/ -> /hub/spawn ([email protected]@::ffff:10.244.1.1) 40.00ms
[I 2024-10-03 10:50:30.413 JupyterHub _xsrf_utils:125] Setting new xsrf cookie for b'KVMkVzPaApnZAJdX5_Z_TUdiGiiyYJiT3q211Kuclgw=:cf609962c22f43ed8cb6ad6bafc27d0c' {'path': '/hub/'}
[I 2024-10-03 10:50:30.469 JupyterHub log:192] 200 GET /hub/spawn ([email protected]@::ffff:10.244.1.1) 87.21ms
[I 2024-10-03 10:50:38.131 JupyterHub provider:661] Creating oauth client jupyterhub-user-abc000%40uit.no
[I 2024-10-03 10:50:38.196 JupyterHub <string>:7] check_pvcs: Doing a PVC check...
[I 2024-10-03 10:50:38.198 JupyterHub log:192] 302 POST /hub/spawn?_xsrf=[secret] -> /hub/spawn-pending/[email protected]?_xsrf=[secret] ([email protected]@::ffff:10.244.1.1) 254.35ms
[I 2024-10-03 10:50:38.248 JupyterHub <string>:11] check_pvcs: existing_pvc-names ['claim-bto004-40uit-2eno', 'claim-rba004-40uit-2eno', 'claim-abc000-40uit-2eno', 'hub-db-dir', 'private-datadir-abc000-40uit-2eno']
[I 2024-10-03 10:50:38.249 JupyterHub <string>:16] check_pvcs: claimname claim-abc000-uit-no---1ea1e26d
[I 2024-10-03 10:50:38.249 JupyterHub <string>:17] check_pvcs: volume {'name': 'volume-{username}{servername}', 'persistentVolumeClaim': {'claimName': 'claim-{username}{servername}'}}
[I 2024-10-03 10:50:38.250 JupyterHub <string>:16] check_pvcs: claimname private-datadir-abc000-uit-no---1ea1e26d
[I 2024-10-03 10:50:38.250 JupyterHub <string>:17] check_pvcs: volume {'name': 'user-datadir', 'persistentVolumeClaim': {'claimName': 'private-datadir-{username}'}}
[I 2024-10-03 10:50:38.250 JupyterHub <string>:20] check_pvcs: Removing nonexistant PVC {'name': 'user-datadir', 'persistentVolumeClaim': {'claimName': 'private-datadir-{username}'}}
[I 2024-10-03 10:50:38.251 JupyterHub <string>:29] check_pvcs: Nonexisting PVC, removing mountpath {'mountPath': '/home/datadir', 'name': 'user-datadir'}
[I 2024-10-03 10:50:38.263 JupyterHub pages:397] [email protected] is pending spawn
[I 2024-10-03 10:50:38.268 JupyterHub log:192] 200 GET /hub/spawn-pending/[email protected]?_xsrf=[secret] ([email protected]@::ffff:10.244.1.1) 11.19ms
[I 2024-10-03 10:50:38.270 JupyterHub reflector:297] watching for pods with label selector='component=singleuser-server' in namespace jupyter
[I 2024-10-03 10:50:38.304 JupyterHub reflector:297] watching for events with field selector='involvedObject.kind=Pod' in namespace jupyter
[W 2024-10-03 10:50:38.346 JupyterHub spawner:3138] Using legacy pvc claim-abc000-40uit-2eno for [email protected]
[I 2024-10-03 10:50:38.347 JupyterHub spawner:2940] Attempting to create pvc claim-abc000-40uit-2eno, with timeout 3
[I 2024-10-03 10:50:38.367 JupyterHub spawner:2956] PVC claim-abc000-40uit-2eno already exists, so did not create new pvc.
[I 2024-10-03 10:50:38.370 JupyterHub spawner:2899] Attempting to create pod jupyter-abc000-uit-no---1ea1e26d, with timeout 3
[I 2024-10-03 10:50:56.307 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.88ms
[I 2024-10-03 10:50:58.892 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.87ms
[I 2024-10-03 10:51:26.306 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.70ms
[I 2024-10-03 10:51:28.891 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.75ms
[I 2024-10-03 10:51:56.313 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.80ms
[I 2024-10-03 10:51:58.892 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 1.12ms
[I 2024-10-03 10:52:26.311 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.72ms
[I 2024-10-03 10:52:28.890 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.71ms
[I 2024-10-03 10:52:56.311 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.91ms
[I 2024-10-03 10:52:58.897 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.85ms
[I 2024-10-03 10:53:26.314 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.91ms
[I 2024-10-03 10:53:28.891 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.81ms
[I 2024-10-03 10:53:56.316 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.84ms
[I 2024-10-03 10:53:58.893 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.75ms
[I 2024-10-03 10:54:26.319 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.95ms
[I 2024-10-03 10:54:28.894 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.75ms
[I 2024-10-03 10:54:56.318 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.74ms
[I 2024-10-03 10:54:58.903 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.82ms
[W 2024-10-03 10:55:25.628 JupyterHub user:999] [email protected]'s server failed to start in 300 seconds, giving up.
    
    Common causes of this timeout, and debugging tips:
    
    1. Everything is working, but it took too long.
       To fix: increase `Spawner.start_timeout` configuration
       to a number of seconds that is enough for spawners to finish starting.
    2. The server didn't finish starting,
       or it crashed due to a configuration issue.
       Check the single-user server's logs for hints at what needs fixing.
    
[I 2024-10-03 10:55:25.628 JupyterHub spawner:3291] Deleting pod jupyter/jupyter-abc000-uit-no---1ea1e26d
[E 2024-10-03 10:55:25.768 JupyterHub gen:629] Exception in Future <Task finished name='Task-74' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.12/site-packages/jupyterhub/handlers/base.py:1115> exception=TimeoutError('pod jupyter/jupyter-abc000-uit-no---1ea1e26d did not start in 300 seconds!')> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.12/site-packages/tornado/gen.py", line 624, in error_callback
        future.result()
      File "/usr/local/lib/python3.12/site-packages/jupyterhub/handlers/base.py", line 1122, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.12/site-packages/jupyterhub/user.py", line 1021, in spawn
        raise e
      File "/usr/local/lib/python3.12/site-packages/jupyterhub/user.py", line 920, in spawn
        await asyncio.wait_for(f, timeout=spawner.start_timeout)
      File "/usr/local/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
        return await fut
               ^^^^^^^^^
      File "/usr/local/lib/python3.12/site-packages/kubespawner/spawner.py", line 3243, in _start
        await exponential_backoff(
      File "/usr/local/lib/python3.12/site-packages/jupyterhub/utils.py", line 265, in exponential_backoff
        raise asyncio.TimeoutError(fail_message)
    TimeoutError: pod jupyter/jupyter-abc000-uit-no---1ea1e26d did not start in 300 seconds!
    
[I 2024-10-03 10:55:25.775 JupyterHub log:192] 200 GET /hub/api/users/[email protected]/server/progress?_xsrf=[secret] ([email protected]@::ffff:10.244.1.1) 287317.52ms
[I 2024-10-03 10:55:26.320 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.76ms
[I 2024-10-03 10:55:28.906 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.72ms
[I 2024-10-03 10:55:56.328 JupyterHub log:192] 302 GET / -> /hub/ (@::ffff:10.244.1.1) 0.77ms

@dragz dragz added the bug label Oct 3, 2024
@consideRatio consideRatio changed the title Helm chart 4 beta cannot find existing pvc for user home dir. Helm chart 4.0.0-beta.2 cannot find existing pvc for user home dir. Oct 3, 2024
@consideRatio
Copy link
Member

consideRatio commented Oct 3, 2024

Thank you @dragz for trying out the beta release and reporting this back!!!

The key parts from this report is that kubespawner 7.0.0b2 in z2jh 4.0.0-beta.2 resulted in...

[W 2024-10-03 10:50:38.346 JupyterHub spawner:3138] Using legacy pvc claim-abc000-40uit-2eno for [email protected]
[I 2024-10-03 10:50:38.347 JupyterHub spawner:2940] Attempting to create pvc claim-abc000-40uit-2eno, with timeout 3
[I 2024-10-03 10:50:38.367 JupyterHub spawner:2956] PVC claim-abc000-40uit-2eno already exists, so did not create new pvc.
2024-10-03T10:50:38.416759Z [Warning] 0/1 nodes are available: persistentvolumeclaim "claim-abc000-uit-no---1ea1e26d" not found.

@minrk it seems from this that kubespawner was smart and concluded it should re-use the old PVC, but that it at the same time failed to do so.

@consideRatio
Copy link
Member

Note from the default values of z2jh, we have this btw @minrk, so we have explicit and defaults overriding the kubespawner defaults.

  storage:
    type: dynamic
    extraLabels: {}
    extraVolumes: []
    extraVolumeMounts: []
    static:
      pvcName:
      subPath: "{username}"
    capacity: 10Gi
    homeMountPath: /home/jovyan
    dynamic:
      storageClass:
      pvcNameTemplate: claim-{username}{servername}
      volumeNameTemplate: volume-{username}{servername}
      storageAccessModes: [ReadWriteOnce]
      subPath:

@minrk
Copy link
Member

minrk commented Oct 3, 2024

pvcNameTemplate should probably be removed, since that's already the default. volumeNameTemplate doesn't match the kubespawner default, so it should still be defined. But where we have {username}{servername}, we should replace it with {user_server} (which is identical in escape scheme).

I'll investigate the tests to see why they don't cover this case, I thought they did.

@minrk
Copy link
Member

minrk commented Oct 3, 2024

(unless pvcNameTemplate is consumed in the chart, which I suspect it actually is. In that case, it should stay and be updated to {user_server})

@minrk
Copy link
Member

minrk commented Oct 3, 2024

I think I understand it now, the chart actually defines the volume:pvc mount, so even though the pvc itself is correct, the volume mount points to the pvc name that doesn't exist. Fortunately, the updated templates address this, because there is a {pvc_name} field that resolves directly to spawner.pvc_name, however that may be resolved.

@consideRatio
Copy link
Member

consideRatio commented Oct 3, 2024

Thank you @minrk!! @dragz could you try if this resolved things for you by using the latest development release of the chart?

It will be listed soon at https://hub.jupyter.org/helm-chart/#development-releases-jupyterhub

It is version 4.0.0-beta.2.git.6792.hc3e818e9

@dragz
Copy link
Author

dragz commented Oct 3, 2024

Yes, this fixed the issue. Home dir is now found and mounted correctly.

(I'm impressed with the speed here)

@consideRatio
Copy link
Member

consideRatio commented Oct 3, 2024

Thank you for testing, reporting, and verifying the attempted fix @dragz!!

4.0.0-beta.3 released for this fix btw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants