Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine docs in this repo and upstream about server and kernel culling #185

Open
consideRatio opened this issue May 23, 2023 · 0 comments
Open
Labels
Documentation A change to our documentation.

Comments

@consideRatio
Copy link
Member

consideRatio commented May 23, 2023

2i2c jupyterhub setups comes with two systems to handle inactivity setup by default. In this issue I'm summarizing what I think can be used to update the docs we provide about server culling and kernel culling.

A jupyter kernel culling system

If a jupyter notebook file is opened, a "kernel" is started. The kernel will retain state (variables' values etc) based on code that has run in the kernel (executed notebook cells). What the kernel culling system does, is that it terminates kernels that has been "idle" for one hour or more.

In practice this means that if you have a long running job running from a jupyter notebook where a kernel is involved, and want to retain state after the notebook execution completes, then the kernel culling system should be disabled.

Disabling it

This can be disabled for individual users by providing a ~/.jupyter/jupyter_server_config.json file like:

{
  "MappingKernelManager": {
    "cull_idle_timeout": 0
  }
}

To disable it for an entire hub, a 2i2c engineer can re-configure the file we inject in user servers via the basehub chart:

jupyterhub:
  singleuser:
    extraFiles:
    # ensure kernel culling is disabled so in-memory state of a long running
    # job is retained after it complete
    jupyter_server_config.json:
      data:
        MappingKernelManager:
          cull_idle_timeout: 0

A jupyter server culling system

When a user server is started by jupyterhub, it registeres to get information about "activity" from the user server. If the user server hasn't been accessed via the network recently (a user's browser does things), and the server reports no activity in the last hour, then its shut down.

A big drawback of this system is that it fails to regonize all activity. For example if a user starts a user server, then runs a command in a terminal to come back a week later and check, it could have been terminated by a lack of perceived activity. Something was running in a terminal, but it likely didn't register as server activity to this system. Not even busy kernels registers as server activity by itself, only if the busy kernel writes a status message reguarly for example.

A big upside of this system is that it helps protect users from forgetting to shut down a powerful server, and that can be costly.

I suggest three strategies to protect long running jobs:

  1. We disable the server culling system it for everyone
  2. We increase the inactivity duration from 1 hour to 24 hours or more
  3. Individual users adopt a workaround when needed by manually running this "keep alive" script in a notebook: Additions to how it works, and a simple "keep alive" strategy jupyterhub/jupyterhub-idle-culler#55 (comment)

Note that you can track user server activity as understood by jupyterhub, and their status by visiting https://jupyter.quantifiedcarbon.com/hub/admin. If the server culling system is disabled, it may be relevant to check in there from time to time to avoid having a large server running without a user attending to it.

Disabling it

Its a basehub chart configuration of the dependency chart jupyterhub:

jupyterhub:
  # ensure user server culling is disabled so server inactivity (includes busy
  # kernels that emit nothing while computing) doesn't get interrupted
  cull:
    enabled: false

Related

@consideRatio consideRatio changed the title Refine docs here, and upstream, about server and kernel culling Refine docs in this repo and upstream about server and kernel culling May 23, 2023
@consideRatio consideRatio added the Documentation A change to our documentation. label May 23, 2023
@consideRatio consideRatio moved this from Needs Shaping / Refinement to Ready to work in DEPRECATED Engineering and Product Backlog May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation A change to our documentation.
Projects
No open projects
Status: Ready to work
Development

No branches or pull requests

1 participant