Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment version race condition error when using slurm #45

Open
williamFalcon opened this issue Nov 30, 2018 · 2 comments
Open

Experiment version race condition error when using slurm #45

williamFalcon opened this issue Nov 30, 2018 · 2 comments

Comments

@williamFalcon
Copy link
Owner

williamFalcon commented Nov 30, 2018

Sometimes, there's a chance test-tube will try to create an experiment version which already exists. Need to add a small delay to avoid the race condition.

@williamFalcon williamFalcon changed the title Experiment version runtime error when using slurm Experiment version race condition error when using slurm Nov 30, 2018
@artyompal
Copy link

A small delay would not be a proper fix for a race condition.

@oscmansan
Copy link
Contributor

oscmansan commented Aug 17, 2019

I ran into this same problem. The workaround I found is to set the Experiment.version attribute to the value of the --hpc_exp_number argument that gets passed to the script when it's called from SlurmCluster.optimize_parallel_cluster_gpu(). Since the next_trial_version is read from a single process before the sbatch scripts are enqueued to run in parallel, it won't hit the race condition.

So, for example, in the pytorch_hpc_example, I'd add between lines 41-42:

parser.add_argument('--hpc_exp_number', type=int)

And then, between lines 18-19:

version=hparams.hpc_exp_number

There's probably a better way that handles this automatically, but in the meantime this is the solution I found. I'll open a PR if I find a better way to do it. What do you think @williamFalcon?

Anyway, I hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants