Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

launcher for stampede2 #47

Open
schristley opened this issue Jan 16, 2018 · 7 comments
Open

launcher for stampede2 #47

schristley opened this issue Jan 16, 2018 · 7 comments

Comments

@schristley
Copy link

I'm porting my applications from TACC's stampede to stampede2 system. I'm using launcher 3.0.1 and getting these errors on stderr:

Ncat: Invalid -d delay "c405-132" (must be greater than 0). QUITTING.
Ncat: Invalid -d delay "c405-132" (must be greater than 0). QUITTING.
Ncat: Invalid -d delay "c405-132" (must be greater than 0). QUITTING.
Ncat: Invalid -d delay "c405-132" (must be greater than 0). QUITTING.
Ncat: Invalid -d delay "c405-132" (must be greater than 0). QUITTING.

and stdout seems to indicate problem talking to task server

------------- SUMMARY ---------------
   Number of hosts:    1
   Working directory:  /scratch/01114/vdj/vdj/job-59884011666018791-242ac11c-0001-007-igblast_test
   Processes per host: 3
   Total processes:    3
   Total jobs:         3
   Scheduling method:  dynamic

-------------------------------------
Launcher: Starting parallel tasks...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
WARNING: No response from dynamic task server. Retrying...
@schristley
Copy link
Author

I tried using the system module instead, which seems to be a more recent version and that is working better, the jobs are running now. Still getting a couple errors but not sure if it's affecting anything.

/opt/apps/launcher/launcher-3.1/paramrun: line 171: [: -eq: unary operator expected
/opt/apps/launcher/launcher-3.1/paramrun: line 211: [: -eq: unary operator expected

@lwilson
Copy link
Contributor

lwilson commented Jan 17, 2018

The first issue is related to a change in netcat, which was noticed on LS5 and is now the case on S2. I believe the current master branch has this resolved.

For the second error, I'd suggest submitting a TACC ticket. I'm not at TACC anymore and don't currently have access to the systems to diagnose.

@johnfonner
Copy link
Contributor

Those last two errors are from if statements that expect a variable called LAUNCHER_BIND to be non null. They look harmless, but also not hard to rewrite them more defensively.

@schristley
Copy link
Author

Should the environment variables be setup different for stampede2? Supposedly each node has 63 cores.

Normally I define LAUNCHER_PPN to be the number of process to run simultaneously on a node, but I'm seeing weird behavior. I run with LAUNCHER_PPN=8, connect to the node and run top and it shows each igblastn process using about 50% CPU. Here is a snapshot:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                          
237141 vdj       20   0  415464  68828  15604 S  57.8  0.1   1:40.66 igblastn                                                                                                                         
237079 vdj       20   0  415616  71296  16008 S  56.9  0.1   1:53.81 igblastn                                                                                                                         
237156 vdj       20   0  415584  68968  15928 S  56.9  0.1   1:23.45 igblastn                                                                                                                         
237125 vdj       20   0  415452  68636  15644 S  56.6  0.1   1:47.74 igblastn                                                                                                                         
237109 vdj       20   0  415516  74856  15808 S  55.9  0.1   1:50.45 igblastn                                                                                                                         
237033 vdj       20   0  415584  71752  15876 S  51.6  0.1   2:27.99 igblastn                                                                                                                         
237298 vdj       20   0  415572  64628  15556 S  51.6  0.1   0:14.28 igblastn                                                                                                                         

Now if I set LAUNCHER_PPN=40, then I have 40 igblastn process but they are only using 10% CPU each?! It's like they are throttled, the CPU% is exactly 5x less, the same multiple that I increased LAUNCHER_PPN by.

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                          
180938 vdj       20   0  415532  73020  15668 S  10.9  0.1   1:16.09 igblastn                                                                                                                         
180847 vdj       20   0  415488  59732  15808 S  10.6  0.1   1:16.03 igblastn                                                                                                                         
180861 vdj       20   0  415584  72248  15812 S  10.6  0.1   1:15.88 igblastn                                                                                                                         
180866 vdj       20   0  415540  67872  15796 S  10.6  0.1   1:15.58 igblastn                                                                                                                         
180899 vdj       20   0  415520  59044  15808 S  10.6  0.1   1:15.82 igblastn                                                                                                                         
180903 vdj       20   0  415692  67168  15884 S  10.6  0.1   1:15.58 igblastn                                                                                                                         
180912 vdj       20   0  415648  67900  15808 S  10.6  0.1   1:16.68 igblastn                                                                                                                         

It shouldn't be an I/O thing because the files that igblastn processes are small, ~3MB input and ~40MB output.

If I run a single igblastn, it uses 400% CPU, i.e. 8x faster than LAUNCHER_PPN=8.

@johnfonner
Copy link
Contributor

That looks suspiciously like an igblastn specific thing. Are manually setting -num_threads? It looks like by default, igblast uses 4 threads, which explains why a single igblastn is using 400% CPU.

On Stampede2, the normal queue has Intel Xeon Phi processors with 68 cores. The skx-normal queue has Skylake nodes with 48 cores. Maybe setting LAUNCHER_BIND=1 on the Xeon Phi nodes will help. Launcher isn't throttling the CPU, but depending on how the tasks are being distributed on the processor, it could be exposing bottlenecks in memory or something. Do you see the same thing on the Skylake nodess?

@schristley
Copy link
Author

I tried on the Skylake nodes and it works as expected, with 8 parallel process each are using 400% CPU. So the issue does seem specific to the KNL nodes.

@schristley
Copy link
Author

Also tried LAUNCHER_BIND=1 for KNL nodes but it produces errors and igblastn isn't even run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants