Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Einsum with certain tensors fails on Stampede2 with large number of processes #98

Open
HaoTy opened this issue May 26, 2020 · 0 comments

Comments

@HaoTy
Copy link
Collaborator

HaoTy commented May 26, 2020

The following code snippet failed with MPI exit code 11 on Stampede2 using 256 nodes and 64 ppn.

    import ctf
    import numpy as np
    
    xq = ctf.astensor([-0.70710678-0.70710678j], dtype=np.complex128).reshape(1, 1, 1, 1)
    u = ctf.astensor([-1.+0.j, 0.+0.j, 0.+0.j, 0.83205029-0.5547002j], dtype=np.complex128).reshape(1, 2, 2, 1)
    s = ctf.astensor([0.84089642, 0.84089642], dtype=np.float64)
    result = ctf.einsum('abdi,isup,s->absdup', xq, u, s)

gives the following output:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 121038 RUNNING AT c405-013
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================
TACC:  MPI job exited with code: 11 
TACC:  Shutdown complete. Exiting. 

I believe it leads to other unexpected behaviors when executed in the middle of a more complicated program, such as taking over 30 minutes (and getting cancelled due to time limit) for such small tensors or the following error:

c405-041.stampede2.tacc.utexas.edu.75166Received eager message(s) ptype=0x1 opcode=0xcc from an unknown process (err=49)

The program behaved as expected using either 128 nodes 64 ppn or 256 nodes 64 ppn, however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant