Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help: slurm cluster example #12

Open
kkmann opened this issue Jun 10, 2024 · 9 comments
Open

Help: slurm cluster example #12

kkmann opened this issue Jun 10, 2024 · 9 comments
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed
Milestone

Comments

@kkmann
Copy link

kkmann commented Jun 10, 2024

Hello,

{future.mirai} is a dream :) Any chance we could get a minimal working example for getting this to work on a slurm cluster? I am struggling to connect the dots.

Do I need to set up the daemons manually? https://shikokuchuo.net/mirai/reference/daemons.html?

Thanks to all authors for making this happen :)

@HenrikBengtsson HenrikBengtsson added the help wanted Extra attention is needed label Jun 10, 2024
@HenrikBengtsson
Copy link
Collaborator

HenrikBengtsson commented Jun 11, 2024

@shikokuchuo , what's the most direct way of launching mirai workers on a set of hosts over SSH when we have a vector of local hostnames?

The gist is that with Slurm you can submit a job requesting say 50 tasks (="workers") that Slurm may reserve slots for across multiple hosts, e.g.

sbatch --ntasks=50 my_script.sh

This will result in my_script.sh being launched on one host, with an environment variable saying which the other hosts are and how many slots. The latter information is parsed by and available via hostnames <- parallelly::availableWorkers(). From here, the challenge is to launch the length(hostnames) mirai workers. Would it be something like:

hostnames <- parallelly::availableWorkers()
library(mirai)
daemons(
  url = host_url(),
  remote = ssh_config(remotes = paste0("ssh://", hostnames))
)

If that works, then:

plan(future.mirai::mirai_cluster)

should make Futurverse resolve futures via that cluster of mirai workers.

@HenrikBengtsson HenrikBengtsson added this to the Next release milestone Jun 11, 2024
@HenrikBengtsson HenrikBengtsson added the documentation Improvements or additions to documentation label Jun 11, 2024
@michaelmayer2
Copy link

The below code is using the slurmR package as a tool to flexibly create a pool of compute resources and then use them in the mirai::daemons() call. Once done, a simple plan(mirai_cluster) will enable parallel futures as expected.

Couple of points/questions

  • I have initially been playing around with mirai::make_cluster() in the hope that I could use the resulting cluster object similar to plan(cluster,workers=cl_slurm) but this does not seem to work like that.
  • The prevalent "ssh'ing" of both slurmR and mirai to create their processes and connections is IMHO not really well received on HPC clusters. Without taking any countermeasures it is circumventing the job of the resource manager (e.g. SLURM) and users are able to use more resources than allocated. That's why on many HPC clusters remote ssh access to the HPC compute nodes is categorically prohibited or limited to only users that have a job running on a particular compute node. For SLURM there is https://slurm.schedmd.com/pam_slurm_adopt.html which will add the ssh connection to the running job and hence any process launched will live within the proper resource allocation made.
  • For users trying the below on a cloud-based HPC, there are some transient errors with permission denied on newly spun up compute nodes - rerunning the same makeSlurmCluster() call again will successfully spin up the cluster.
library(future.mirai)
library(mirai)
library(furrr)
library(tictoc)
library(dplyr)
library(slurmR)
opts_slurmR$set_opts(mem="1024m")

#we'd like to run on 10 cores
compute_cores<-10

#allocate compute nodes via slurmR
cl_slurm<-makeSlurmCluster(compute_cores)

#wrapper to convert hostnames into mirai compatible string
get_nodes <- function(cl) {
  paste0("ssh://",sapply(1:length(cl), function(x) cl[[x]]$host))
}

mirai::daemons(compute_cores,
                   url = host_url(tls = TRUE),
                   remote = ssh_config(
                     remotes = get_nodes(cl_slurm),
                     timeout = 1,
                     rscript = paste0(Sys.getenv("R_HOME"),"/bin/Rscript")
                   )
)

# let's use mirai_cluster
plan(mirai_cluster)
tic()
nothingness <- future_map(rep.int(2,10), ~Sys.sleep(.x))
toc()

# let's use sequential
plan(sequential)
tic()
nothingness <- future_map(rep.int(2,10), ~Sys.sleep(.x))
toc()

stopCluster(cl_slurm)

@michaelmayer2
Copy link

Making a bit more progress... while the above statements around ssh usage on an HPC still hold, I went back to first principles and figured out why remote_config() was not working when using SLURM's srun - it turns out that due to the fact that the Rscript -e ... command is wrapped in double-quotes, srun will treat the whole command as a single binary and then fail to find it (e.g.srun "echo 30" will fail).

I have been experimenting with removing shQuote from the relevant bits of the code and while this worked for some use cases it did not work in general. So finally I decided to change the behaviour of mirai a bit more than expected: Instead of dynamically creating strings containing R code that will then be interpreted by Rscript -e as an expression, I have opted to save the R code as temporary file and then call it by Rscript. The changes made (see below patch) work for all use cases I checked including TLS on/off, classic remote_ssh etc... Only gap at the moment is that the temp files are not cleaned up.

So, using the patch in the mirai package, I get

library(future.mirai)
library(mirai)
library(furrr)
library(dplyr)
library(microbenchmark)

# launch mirai daemons
# 
# please note the specification of SLURM resource requirements as args
compute_cores <- 4
mirai::daemons(compute_cores,
               url = host_url(ws=TRUE, tls = TRUE),
               remote = remote_config(
                 command="srun",
                 args=c("--mem 512", "-n 1", "."),
                 rscript = paste0(Sys.getenv("R_HOME"),"/bin/Rscript")
               ),
               dispatcher=TRUE
)

# start mirai_cluster future 
plan(mirai_cluster)

microbenchmark(res<-future_map_dbl(1:500, function(x){
  mean(runif(180000))
},.options=furrr_options(seed=TRUE)),times=10)

Maybe this is something that @shikokuchuo could integrate into mirai ? I have to admit I really don't like the idea of having temporary files created but it seems that both the size and number of files is very small and hence the execution speed is practically not affected at all.

mirai.patch

@shikokuchuo
Copy link
Contributor

@michaelmayer2 thanks for investigating. I'll take a closer look at the shell quoting behaviour of remote_config(). As you rightly point out, writing temporary files is probably not the way to go.

@shikokuchuo
Copy link
Contributor

Michael, in build 9001 (39ce672) the shell quoting is updated so the argument passed to Rscript is wrapped in single rather than double quotes. This used to be the case in mirai, but changed for some reason in the interim.

You may test with the R-Universe dev build:

install.packages("mirai", repos = "shikokuchuo.r-universe.dev")

I hope this helps with SLURM, but even if not I believe it is safer to shell quote in this way - it may avoid other corner cases. If it doesn't solve the SLURM issue, I have a couple of other ideas, although from the man page for srun it does seem like it should just work.

@michaelmayer2
Copy link

@shikokuchuo - thanks so much for looking into this, Charlie !

I tried with the latest changes but I am sorry to report it is still not working... The crucial bit really seems to be the shQuote() in https://github.com/shikokuchuo/mirai/blob/39ce672609dfbffc0dfd1982a9b12641fea8754d/R/launchers.R#L151

In order to better demonstrate what is going on I have replaced this line with a system() command.

system(paste(c(command,`[<-`(args, find_dot(args), shQuote(cmd))),collapse=" "),wait=FALSE)

I then checked the following use cases

  1. classic ssh_config mirai worker
    a. with shQuote() enabled
    b. without shQuote()
  2. remote_config mirai worker
    a. with shQuote() enabled
    b. without shQuote()

See detailed results below, the gist is that 1a and 2b work while 1b and 2a fail. And this is caused by a different behaviour of ssh and srun when it comes to dealing with double quotes.

While srun echo hello will work, srun "echo hello" will fail as it cannot find a binary named "sleep 10" (it confuses the whole command including parameters as an executable).

posit0001@interactive-st-rstudio-1:~/mirai$ srun  "echo hello"
slurmstepd: error: execve(): echo hello: No such file or directory
srun: error: interactive-dy-rstudio-1: task 0: Exited with exit code 2
posit0001@interactive-st-rstudio-1:~/mirai$ srun echo hello
hello

I am not sure how to go from here. Happy to supply more information as needed. Maybe we can make the problematic shQuote() optional via an argument ?


Case 1a - classic ssh_config mirai worker with shQuote() enabled

Browse[2]> paste(c(command,`[<-`(args, find_dot(args), shQuote(cmd))),collapse=" ")
[1] "ssh -o ConnectTimeout=1 -fTp 22 localhost \"/opt/R/4.3.2/lib/R/bin/Rscript -e 'mirai::daemon(\\\"tcp://interactive-st-rstudio-1:42097\\\",rs=c(10407,648977717,1963234418,-2069452469,1499029520,1988279505,1192808542))'\""
Browse[2]> system(paste(c(command,`[<-`(args, find_dot(args), shQuote(cmd))),collapse=" "),wait=FALSE)
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Browse[2]> daemons()
$connections
[1] 1

$daemons
                                     i online instance assigned complete
tcp://interactive-st-rstudio-1:42097 1      1        1        0        0

Case 1b classic ssh_config mirai worker without shQuote()

Browse[2]> paste(c(command,`[<-`(args, find_dot(args), cmd)),collapse=" ")
[1] "ssh -o ConnectTimeout=1 -fTp 22 localhost /opt/R/4.3.2/lib/R/bin/Rscript -e 'mirai::daemon(\"tcp://interactive-st-rstudio-1:42097\",rs=c(10407,648977717,1963234418,-2069452469,1499029520,1988279505,1192808542))'"
Browse[2]> system(paste(c(command,`[<-`(args, find_dot(args), cmd)),collapse=" "),wait=FALSE)
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
bash: -c: line 0: syntax error near unexpected token `('
bash: -c: line 0: `/opt/R/4.3.2/lib/R/bin/Rscript -e mirai::daemon("tcp://interactive-st-rstudio-1:42097",rs=c(10407,648977717,1963234418,-2069452469,1499029520,1988279505,1192808542))'

Case 2a remote_config mirai worker with shQuote() enabled

Browse[1]> paste(c(command,`[<-`(args, find_dot(args), shQuote(cmd))),collapse=" ")
[1] "srun --mem 512 -n 1 -o slurm.loo \"/opt/R/4.3.2/lib/R/bin/Rscript -e 'mirai::daemon(\\\"tcp://interactive-st-rstudio-1:38907\\\",rs=c(10407,1413271533,1529776586,-351430461,-2090321112,-1063229687,-860424394))'\""
Browse[1]> system(paste(c(command,`[<-`(args, find_dot(args), shQuote(cmd))),collapse=" "),wait=FALSE)
srun: error: interactive-dy-rstudio-1: task 0: Exited with exit code 2

Case 2b remote_config mirai worker without shQuote()

Browse[1]> paste(c(command,`[<-`(args, find_dot(args), cmd)),collapse=" ")
[1] "srun --mem 512 -n 1 -o slurm.loo /opt/R/4.3.2/lib/R/bin/Rscript -e 'mirai::daemon(\"tcp://interactive-st-rstudio-1:38907\",rs=c(10407,1413271533,1529776586,-351430461,-2090321112,-1063229687,-860424394))'"
Browse[1]> system(paste(c(command,`[<-`(args, find_dot(args), cmd)),collapse=" "),wait=FALSE)
Browse[1]> daemons()
$connections
[1] 1

$daemons
                                     i online instance assigned complete
tcp://interactive-st-rstudio-1:38907 1      1        1        0        0

@HenrikBengtsson
Copy link
Collaborator

Thank you both. I've gone through quote a few of these quote-or-not-to-quote and nested-quoting issues in the parallelly package. It grew out of different needs to launch parallel R workers locally, remotely, in Linux containers, over SSH, over qrsh (similar to srun), from and to different operating systems, etc. Have a look at https://parallelly.futureverse.org/reference/makeClusterPSOCK.html and arguments rshcmd, rshopts, rscript, and rscript_args. Look also at the different examples. Note how both rshcmd and rscript are vectors, and how the first element is specially treated. FWIW, my constraint was to also be backward compatible with the parallel package, so some solutions might not be the one you would pick if you started from scratch. @shikokuchuo, I suspect you might have to do something similar in order to support different types of uses that will be throwh at remote_config() and ssh_config().

PS. @michaelmayer2, the canonical way to get the location of current Rscript is file.path(R.home("bin"), "Rscript").

@HenrikBengtsson
Copy link
Collaborator

I just checked the parallelly code; it suffers from the same problem. I'll see if there's workaround/hack or if I have to update the package.

@HenrikBengtsson HenrikBengtsson modified the milestones: 0.2.2, Next release Jul 3, 2024
@michaelmayer2
Copy link

Little more progress on comparing future.mirai to the cluster backend... https://pub.current.posit.team/public/future_mirai/. Really amazing how much more scalable mirai is compared to the good old (monolithic) PSOCK cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants