-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "connect to job" functionality, use that for CondorSpawner #200
base: main
Are you sure you want to change the base?
Conversation
Sounds like it would partially solve #169, right? |
It's surely an ingredient, but the focus here is different: It allows to connect to workers spawned by JupyterHub via a batch system command. In fact, this means if you spawn batch system workers in arbitrary locations, and the batch system picks them up, this means you can also run notebooks there. That's also one of the ways we'd like to use it: Start a HTCondor
The necessary change in JupyterHub's batchspawner is that establishing connectivity can be delegated to an additional command (and in case of HTCondor, So in short, yes, it can be seen as an ingredient also for the issue you linked 😉. |
I pushed two more commits here which ensure no port collision happens for the forwarded port. Again, this is implemented in a generalizable manner, i.e. it is optional, and could also be used for other SSH tunneling approaches — however, I only added the |
Rebased to current |
I pushed three more commits with a small improvement, and two more changes:
I see that this branch now has conflicts after 1decdf2 which caused a lot of (useful) reformatting all over the project. I am not (yet) resolving these, but am certainly willing to do so once somebody steps up to start a review (to minimize effort, I'd prefer to resolve conflicts only once). |
This adds the possibility to start a "connect_to_job" background task on the hub on job start, which establishes connectivity to the actual single user server. An example for this can be "condor_ssh_to_job" for HTCondor batch systems. Additionally, the background tasks are monitored: - for successful startup. The background task is given some time to successfully establish connectivity. - in poll() during job runtime and if they fail, the job is terminated.
This leverages condor_ssh_to_job to forward the port of the single user server to the hub, removing the need for direct connectivity from the hub to the execute nodes.
This allows to use {rport} inside the connect_to_job_cmd. If this is done, {port} is set to a random_port chosen locally on the Hub, and {rport} is set to the original remote port. This is useful e.g. for SSH port forwarding. If this is used, the {rport} of the notebook is saved into a new class variable in case it will be needed again.
This uses the functionality to use a random local port on the hub to forward the notebook port to. It ensures no port collisions appear between different, forwarded notebooks.
Wrapping the coroutines into futures first allows to directly check the state of the futures.
For implementations other than condor_ssh_to_job, it might be useful to use the hostname on which the notebook was spawned to proxy it.
If connect_to_job_cmd is explcitly set to an empty string, CondorSpawner will not override the hostname with localhost, allowing to revert to the old behaviour (assuming direct connectivity).
…ing. With this, the first database commit will already contain the forwarded port if connect_to_job is used, and the log will show the correct port number.
This is required since data may change after the connect_to_job function.
The background command can not cleanly be simulated to stay running.
I've taken the opportunity that 1.5 years have passed to:
After this, the PR applies cleanly on Hope this helps, we are still using this in production quite successfully 👍 . |
Thanks for this, @olifre. It is helping me to run an external JupyterHub connecting via ssh to a SLURM cluster. One thing I noticed is that the connect_to_job_cmd, and background commands in general, are not robust to the hub restarting. In general, when the hub restarts, users should be able to reattach to their servers once it's back. But here, when the hub restarts, the connect_to_job_cmd ssh processes are lost, and the hub is no longer able to connect to the servers. Is there an easy way around this, or would we need to store the commands in the database and then rerun them when the hub restarts? |
Thanks a lot for your nice feedback! I was really wondering whether this would be of use to others, and it's great to hear it is 👍 . So I am also happy I implemented it as generic as possible (since we only use it with HTCondor at the moment).
This is true. I don't have a good workaround for this, it would really be necessary to persist this information into the database I believe (which is not implemented at the moment). We did not attack this yet as our user count usually drops to zero at least once every day, which eases restarts. |
This adds the possibility to start a
connect_to_job
background task on the hub on job start, which establishes connectivity to the actual single user server.An example for this is
condor_ssh_to_job
for HTCondor batch systems.Additionally, the background tasks are monitored:
poll()
during job runtimeand if they fail, the job is terminated.
For the
CondorSpawner
, this leveragescondor_ssh_to_job
to forward the port of thesingle user server to the hub, removing the need for direct connectivity from the hub to the execute nodes.
Notably, it allows to use worker node setups where the workers do not allow for direct inbound connectivity (e.g. NATted setups), requiring only outbound connectivity from worker to hub, delegating the other direction to the batch system.