-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault executing sparse inner product #138
Comments
The |
The .tns file format encodes all tensor indices in 1-indexed format. Does the CTF read operation assume they are zero indexed? |
Yes, the documentation I think is consistent with that. |
One fix is to just read a tensor with dims larger by 1 and take a slice starting from 1, I think we did that to preprocess to get results elsewhere |
If ctf should be reading in the coordinates correctly, why is incrementing the dimensions necessary? Either way, I’ll give it a try. |
tns files are just one standard |
I tested this out with incrementing all of my dimensions by 1 and I'm still running into a segfault on 1 and 40 processes. |
Are you seeing the segfault when reading the tensor? |
No, it seems to be after the tensors load. To replicate my exact setup, try running this code: https://github.com/rohany/ctf/blob/master/examples/spbench.cxx (and edit line 287 to be Then, run the binary with arguments: |
The segmentation fault is because CTF runs out of memory for the contraction. Can you try higher node counts? |
I'm skeptical that memory usage is the problem (I usually get a signal 9 from the job scheduler when a process OOMs). I tried running with up to 8 nodes and saw segfaults each time.
I'm running when B != C. |
CTF calculates the memory usage a priori. If the contraction cannot be performed then an assert is triggered and the computation is aborted. Can you recompile and run your code with |
I don't see anything interesting output with those flags on. The output before the crash is:
and the backtrace is
That was a typo. The load of C shold have used a different input filename. |
So if I have to reproduce this, what are the two tensor files I need to use? |
I'm currently running it with the same tensor files (nell-2 and nell-2), but I aim to use it for different tensor files once we can resolve the segfault. |
CTF runs out of memory for this contraction (with
|
Is this something related to the shape of the tensor, or tensor of similar and greater size will also fail? Specifically the other larger tensors in the frostt suite? |
My guess is that it has to do with the size and the contraction type. Might have to try other tensors with this contraction to be able to conclude. |
The following code raises different segfaults depending on the process count (on a single node), when run on nell-2 tensor.
When run with a single process, it segfaults with the following backtrace:
When run with 40 processes (1 process per core on my system): it segfaults with the following backtrace:
Both the of "segfaults" are internal assertion failures, as it seems.
The text was updated successfully, but these errors were encountered: