-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pll_id conflicts when submitting many jobs simultaneously (LSF arrays) #113
Comments
Code that works:
|
It seems like 2 separate processes see that an ID is available, and then next both try to get it, but only one gets it. We could have a combined "find available pllid AND hold it" command in |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Preliminaries
Before submitting an issue, please check (with
x
in brackets) that you:Expected behavior and actual behavior
Context
I am using LSF as a job scheduler to submit an array of jobs to our cluster. Each job is assigned 4 cores, and I am using
parallel sim
to divide the simulation among the 4 cores.Desired behavior
I would like
parallel
to assign a uniquepll_id
to each job.Actual behavior
LSF may send many jobs simultaneously, i.e., within the same second. This means that
parallel
assigns the samepll_id
to each job, causing a conflict and errors for all but one of the jobs that arrive within the same second.Failed solution attempted
Because each job has a unique
seed
, I tried therandtype("current")
option but this was not effective.Solution (workaround)
What solved the problem was to create a
while
loop such that, ifcap parallel sim
returns an error, we wait a random number of seconds (1-16, although this is arbitrary) and try again. This was successful although somewhat to my surprise some jobs needed to go through the loop 10 or more times. I have appended a sketch of the code.Steps to reproduce the problem
This would be pretty tough because I think it depends on the specifics of our cluster, our job scheduler, etc.
System information
Some relevant information
Output from
creturn list
:The text was updated successfully, but these errors were encountered: