-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to start a parallel pool #2
Comments
OK.
With how tmpdir works in matlab, the only (I believe) input is a path you
want to temporary directory to be. So by calling tempname('cache'), the
temp directory is made within the local folder cache, which appears to be
true.
I added the extra cases / warnings around parpool starting. Hopefully it
will be more explicit when that's what fails.
I also added the && break loop around the singularity call. Hopefully that
will stem most of the bad starts. There aren't many new ones, and most
appear to be done.
I'll watch this moving forward, it would be good to standardize how we
handle this across apps at some point.
Brent
…On Sat, Nov 3, 2018 at 11:33 AM Soichi Hayashi ***@***.***> wrote:
We are seeing a lot of failed jobs due to "Failed to start a parallel
pool".
A couple of things we could try..
1.
Right now, this App uses tempname() to generate the temp path for
JobStorageLocation. I believe it uses /tmp as parent directory.
I wonder if we could use the current working directory instead.
Instead, I think we should create it under the current working directory..
in case use of /tmp is somehow causing the issue.
%need to use different profile directory to make sure multiple jobs won't share the same directory and crash
profile_dir='./profile';
mkdir(profile_dir);
c = parcluster();
c.JobStorageLocation = profile_dir;
pool = parpool(c, config.workers);
1.
Right now, this App is skipping to set JobStorageLocation if mkdir(tmpdir)
fails.
% check and set cachedir location
if OK
% set local storage for parpool
clust.JobStorageLocation = tmpdir;
end
I suggest removing this block and let the App fail if it fails to create a
tmpdir (or at least add the log message inside the block to know that we
are setting the JobStorageLocation
1.
I have seen a similar parpool startup failure / random matlab crash
before. I've workaround this by simply rerunning the code a few times if it
starts to fail.
https://github.com/brain-life/app-dp-modelfit/blob/master/fit_model.sh#L39
It's ugly but very simple thing to try.. and for the DP App this has cured
the issue of occasional hiccups.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/ANRr_aY0OjESni8EPSOmP8G1hy1cxwKfks5urbcxgaJpZM4YM2oN>
.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We are seeing a lot of failed jobs due to "Failed to start a parallel pool".
A couple of things we could try..
Right now, this App uses
tempname()
to generate the temp path for JobStorageLocation. I believe it uses /tmp as parent directory.I wonder if we could use the current working directory instead.
Instead, I think we should create it under the current working directory.. in case use of /tmp is somehow causing the issue.
Right now, this App is skipping to set JobStorageLocation if
mkdir(tmpdir)
fails.I suggest removing this block and let the App fail if it fails to create a tmpdir (or at least add the log message inside the block to know that we are setting the JobStorageLocation
I have seen a similar parpool startup failure / random matlab crash before. I've workaround this by simply rerunning the code a few times if it starts to fail.
It's ugly but very simple thing to try.. and for the DP App this has cured the issue of occasional hiccups.
The text was updated successfully, but these errors were encountered: