-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
launcher for stampede2 #47
Comments
I tried using the system module instead, which seems to be a more recent version and that is working better, the jobs are running now. Still getting a couple errors but not sure if it's affecting anything.
|
The first issue is related to a change in netcat, which was noticed on LS5 and is now the case on S2. I believe the current master branch has this resolved. For the second error, I'd suggest submitting a TACC ticket. I'm not at TACC anymore and don't currently have access to the systems to diagnose. |
Those last two errors are from if statements that expect a variable called LAUNCHER_BIND to be non null. They look harmless, but also not hard to rewrite them more defensively. |
Should the environment variables be setup different for stampede2? Supposedly each node has 63 cores. Normally I define LAUNCHER_PPN to be the number of process to run simultaneously on a node, but I'm seeing weird behavior. I run with LAUNCHER_PPN=8, connect to the node and run
Now if I set LAUNCHER_PPN=40, then I have 40
It shouldn't be an I/O thing because the files that If I run a single |
That looks suspiciously like an On Stampede2, the normal queue has Intel Xeon Phi processors with 68 cores. The |
I tried on the Skylake nodes and it works as expected, with 8 parallel process each are using 400% CPU. So the issue does seem specific to the KNL nodes. |
Also tried |
I'm porting my applications from TACC's stampede to stampede2 system. I'm using launcher 3.0.1 and getting these errors on stderr:
and stdout seems to indicate problem talking to task server
The text was updated successfully, but these errors were encountered: