HPC integration option 1 - ssh #15

rlskoeser · 2024-10-03T21:04:39Z

First and simpler approach for HPC integration is to use ssh access and ssh keys so our app user can login to the cluster as users and start the slurm job as them.

Note that CAS integration (included in #10) is a prerequisite for this.

implementation details

request access to ssh without duo from test vm to hpc machine
ensure access to ssh from vm to hpc (may require PUL .lib domain firewall change)
add a vaulted ssh key to deploy and write instructions for adding to authorized keys on hpc machine
write remote versions equivalent to gpu celery tasks to kick off training jobs: export needed data/model, use scp/rsync to transfer files and ssh to log in as the current user, start the slurm job
- command-line utility to export data, run segmentation training, and update model with eScriptorium API and slurm #24
modify escriptorium to call our remote version of the task instead of running locally (think about how to make configurable but this version doesn't have to be elegant)
implement method to check status of remote slurm job
modify escriptorium task monitoring to handle remote slurm job
when the job completes, update the refined model back in escriptorium and report on status

linear · 2024-10-03T21:04:43Z

RSE-100 HPC integration option 1 - ssh

First and simpler approach for HPC integration is to use ssh access and ssh keys so our app user can login to the cluster as users and start the slurm job as them.

Note that CAS integration (included in #10) is a prerequisite for this.

implementation details

request access to ssh without duo from test vm to hpc machine
ensure access to ssh from vm to hpc (may require PUL .lib domain firewall change)
add a vaulted ssh key to deploy and write instructions for adding to authorized keys on hpc machine
write remote versions equivalent to gpu celery tasks to kick off training jobs: export needed data/model, use scp/rsync to transfer files and ssh to log in as the current user, start the slurm job
modify escriptorium to call our remote version of the task instead of running locally (think about how to make configurable but this version doesn't have to be elegant)
implement method to check status of remote slurm job
modify escriptorium task monitoring to handle remote slurm job
when the job completes, update the refined model back in escriptorium and report on status

rlskoeser mentioned this issue Oct 3, 2024

HPC integration option 2 - Open OnDemand app #16

Open

6 tasks

jerielizabeth added this to Iteration Planning Board Oct 28, 2024

jerielizabeth moved this to To Do in Iteration Planning Board Oct 28, 2024

jerielizabeth assigned rlskoeser Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPC integration option 1 - ssh #15

HPC integration option 1 - ssh #15

rlskoeser commented Oct 3, 2024 •

edited

Loading

linear bot commented Oct 3, 2024

implementation details

HPC integration option 1 - ssh #15

HPC integration option 1 - ssh #15

Comments

rlskoeser commented Oct 3, 2024 • edited Loading

implementation details

linear bot commented Oct 3, 2024

implementation details

rlskoeser commented Oct 3, 2024 •

edited

Loading