Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC integration option 1 - ssh #15

Open
1 of 9 tasks
rlskoeser opened this issue Oct 3, 2024 · 1 comment
Open
1 of 9 tasks

HPC integration option 1 - ssh #15

rlskoeser opened this issue Oct 3, 2024 · 1 comment
Assignees

Comments

@rlskoeser
Copy link
Contributor

rlskoeser commented Oct 3, 2024

First and simpler approach for HPC integration is to use ssh access and ssh keys so our app user can login to the cluster as users and start the slurm job as them.

Note that CAS integration (included in #10) is a prerequisite for this.

implementation details

  • request access to ssh without duo from test vm to hpc machine
  • ensure access to ssh from vm to hpc (may require PUL .lib domain firewall change)
  • add a vaulted ssh key to deploy and write instructions for adding to authorized keys on hpc machine
  • write remote versions equivalent to gpu celery tasks to kick off training jobs: export needed data/model, use scp/rsync to transfer files and ssh to log in as the current user, start the slurm job
  • modify escriptorium to call our remote version of the task instead of running locally (think about how to make configurable but this version doesn't have to be elegant)
  • implement method to check status of remote slurm job
  • modify escriptorium task monitoring to handle remote slurm job
  • when the job completes, update the refined model back in escriptorium and report on status
Copy link

linear bot commented Oct 3, 2024

RSE-100 HPC integration option 1 - ssh

First and simpler approach for HPC integration is to use ssh access and ssh keys so our app user can login to the cluster as users and start the slurm job as them.

Note that CAS integration (included in #10) is a prerequisite for this.

implementation details

  • request access to ssh without duo from test vm to hpc machine
  • ensure access to ssh from vm to hpc (may require PUL .lib domain firewall change)
  • add a vaulted ssh key to deploy and write instructions for adding to authorized keys on hpc machine
  • write remote versions equivalent to gpu celery tasks to kick off training jobs: export needed data/model, use scp/rsync to transfer files and ssh to log in as the current user, start the slurm job
  • modify escriptorium to call our remote version of the task instead of running locally (think about how to make configurable but this version doesn't have to be elegant)
  • implement method to check status of remote slurm job
  • modify escriptorium task monitoring to handle remote slurm job
  • when the job completes, update the refined model back in escriptorium and report on status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: IceBox
Development

No branches or pull requests

1 participant