segfault executing sparse inner product #138

rohany · 2022-01-25T00:38:34Z

The following code raises different segfaults depending on the process count (on a single node), when run on nell-2 tensor.

void innerprod(int nIter, int warmup, std::string filename, std::string tensorC, std::vector<int> dims, World& dw) {
  Tensor<double> B(3, true /* is_sparse */, dims.data(), dw);
  Tensor<double> C(3, true /* is_sparse */, dims.data(), dw);
  Scalar<double> a(dw);

  B.read_sparse_from_file(filename.c_str());
  C.read_sparse_from_file(filename.c_str());
  
  a[""] = B["ijk"] * C["ijk"];
}

When run with a single process, it segfaults with the following backtrace:

/g/g15/yadav2/ctf/src/redistribution/sparse_rw.cxx:948 (discriminator 7)
/g/g15/yadav2/ctf/src/tensor/untyped_tensor.cxx:1302
/g/g15/yadav2/ctf/examples/../include/../src/interface/tensor.cxx:609
/g/g15/yadav2/ctf/examples/../include/../src/interface/tensor.cxx:940
/g/g15/yadav2/ctf/examples/../include/../src/interface/tensor.cxx:952
/g/g15/yadav2/ctf/examples/spbench.cxx:199
/g/g15/yadav2/ctf/examples/spbench.cxx:317 (discriminator 7)

When run with 40 processes (1 process per core on my system): it segfaults with the following backtrace:

/g/g15/yadav2/ctf/src/contraction/contraction.cxx:119 (discriminator 3)
/g/g15/yadav2/ctf/src/interface/term.cxx:983
/g/g15/yadav2/ctf/src/interface/idx_tensor.cxx:227
/g/g15/yadav2/ctf/examples/../include/../src/interface/idx_tensor.h:262
/g/g15/yadav2/ctf/examples/spbench.cxx:209 (discriminator 6)
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:299
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:687
/g/g15/yadav2/ctf/examples/spbench.cxx:9 (discriminator 2)
/g/g15/yadav2/ctf/examples/spbench.cxx:208 (discriminator 1)
/g/g15/yadav2/ctf/examples/spbench.cxx:317 (discriminator 7)
??:0
??:0

Both the of "segfaults" are internal assertion failures, as it seems.

The text was updated successfully, but these errors were encountered:

raghavendrak · 2022-02-02T19:29:54Z

The nell-2.tensor dimensions specified are 12092 x 9184 x 28818. I am assuming you are using the same in dims.data(), but if you look at the indices specified in the tensor, there are values with index 28818. Using 12093 X 9185 X 28819 will fix this.

rohany · 2022-02-02T19:36:06Z

The .tns file format encodes all tensor indices in 1-indexed format. Does the CTF read operation assume they are zero indexed?

solomonik · 2022-02-02T19:40:57Z

Yes, the documentation I think is consistent with that.

solomonik · 2022-02-02T19:42:19Z

One fix is to just read a tensor with dims larger by 1 and take a slice starting from 1, I think we did that to preprocess to get results elsewhere

rohany · 2022-02-02T20:29:28Z

If ctf should be reading in the coordinates correctly, why is incrementing the dimensions necessary? Either way, I’ll give it a try.

solomonik · 2022-02-02T20:36:18Z

tns files are just one standard

rohany · 2022-02-03T19:39:18Z

I tested this out with incrementing all of my dimensions by 1 and I'm still running into a segfault on 1 and 40 processes.

raghavendrak · 2022-02-03T21:59:01Z

Are you seeing the segfault when reading the tensor?
(I tried running your code, and we were able to read the tensors on 1 process).

rohany · 2022-02-03T22:20:36Z

No, it seems to be after the tensors load.

To replicate my exact setup, try running this code: https://github.com/rohany/ctf/blob/master/examples/spbench.cxx (and edit line 287 to be dims.push_back(atoi(it.c_str()) + 1);.

Then, run the binary with arguments:
spbench -tensor <path to tns> -dims 12092,9184,28818 -n 20 -warmup 10 -bench spinnerprod -tensorC <path to tns>

raghavendrak · 2022-02-04T22:26:45Z

The segmentation fault is because CTF runs out of memory for the contraction. Can you try higher node counts?
Also, this specific operation (if B == C) can be achieved by computing the Forbenius norm i.e., B.norm2(norm).

rohany · 2022-02-05T01:27:22Z

I'm skeptical that memory usage is the problem (I usually get a signal 9 from the job scheduler when a process OOMs). I tried running with up to 8 nodes and saw segfaults each time.

Also, this specific operation (if B == C) can be achieved by computing the Forbenius norm i.e., B.norm2(norm).

I'm running when B != C.

raghavendrak · 2022-02-05T01:46:29Z

CTF calculates the memory usage a priori. If the contraction cannot be performed then an assert is triggered and the computation is aborted. Can you recompile and run your code with -DDEBUG=4 and -DVERBOSE=4.
I was under the assumption that both B and C are loaded with the same tensor (filename.c_str()) (based on your code mentioned first here) [nell-2 tensor].

rohany · 2022-02-05T04:52:57Z

I don't see anything interesting output with those flags on.

The output before the crash is:

CTF: Running with 4 threads
CTF: Total amount of memory available to process 0 is 170956357632
12093
9185
28819
debug:untyped_tensor.cxx:440 Created order 3 tensor ETXS03, is_sparse = 1, allocated = 1
debug:untyped_tensor.cxx:440 Created order 3 tensor AILI03, is_sparse = 1, allocated = 1
debug:untyped_tensor.cxx:440 Created order 0 tensor OBNO00, is_sparse = 0, allocated = 1
New tensor OBNO00 defined of size 1 elms (8 bytes):
printing lens of dense tensor OBNO00:
printing mapping of dense tensor OBNO00
CTF: OBNO00 mapped to order 4 topology with dims: 2  2  2  5
CTF: Tensor mapping is OBNO00[]
printing mapping of sparse tensor ETXS03
CTF: ETXS03 mapped to order 3 topology with dims: 10  2  2
CTF: Tensor mapping is ETXS03[p2(1)c0,p2(2)c0,p10(0)c0]
Read 76879419 non-zero entries from the file.
printing mapping of sparse tensor AILI03
CTF: AILI03 mapped to order 3 topology with dims: 10  2  2
CTF: Tensor mapping is AILI03[p2(1)c0,p2(2)c0,p10(0)c0]
Read 76879419 non-zero entries from the file.

and the backtrace is

/g/g15/yadav2/ctf/src/contraction/contraction.cxx:119 (discriminator 3)
/g/g15/yadav2/ctf/src/interface/term.cxx:983
/g/g15/yadav2/ctf/src/interface/idx_tensor.cxx:227
/g/g15/yadav2/ctf/examples/../include/../src/interface/idx_tensor.h:262
/g/g15/yadav2/ctf/examples/spbench.cxx:209 (discriminator 6)
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:299
/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/include/c++/8/bits/std_function.h:687
/g/g15/yadav2/ctf/examples/spbench.cxx:9 (discriminator 2)
/g/g15/yadav2/ctf/examples/spbench.cxx:208 (discriminator 1)
/g/g15/yadav2/ctf/examples/spbench.cxx:323 (discriminator 7)
??:0
??:0

I was under the assumption that both B and C are loaded with the same tensor (filename.c_str()) (based on your code mentioned first here) [nell-2 tensor].

That was a typo. The load of C shold have used a different input filename.

raghavendrak · 2022-02-05T05:45:24Z

So if I have to reproduce this, what are the two tensor files I need to use?
(I see that both tensors ETSX03 and AILI03 have the same non-zero entries: 76879419?)

rohany · 2022-02-05T05:52:41Z

I'm currently running it with the same tensor files (nell-2 and nell-2), but I aim to use it for different tensor files once we can resolve the segfault.

raghavendrak · 2022-02-11T02:42:43Z

CTF runs out of memory for this contraction (with nell-2 tensor as input for both B and C). I tried till 128 nodes with no luck. There is also a possibility of a bug in CTF. With -DDEBUG=4 and -DVERBOSE=4 you should be able to see output similar to below:

debug:contraction.cxx:2942 [EXH] Not enough memory available for topo 2047 with order 1 memory 1778101471/1183301216
ERROR: Failed to map contraction!

rohany · 2022-02-11T03:59:17Z

Is this something related to the shape of the tensor, or tensor of similar and greater size will also fail? Specifically the other larger tensors in the frostt suite?

raghavendrak · 2022-02-13T02:18:00Z

My guess is that it has to do with the size and the contraction type. Might have to try other tensors with this contraction to be able to conclude.

rohany mentioned this issue Feb 14, 2022

oom/memory corruption running an SDDMM (using TTTP specialized routine) #139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault executing sparse inner product #138

segfault executing sparse inner product #138

rohany commented Jan 25, 2022

raghavendrak commented Feb 2, 2022

rohany commented Feb 2, 2022

solomonik commented Feb 2, 2022

solomonik commented Feb 2, 2022

rohany commented Feb 2, 2022

solomonik commented Feb 2, 2022

rohany commented Feb 3, 2022

raghavendrak commented Feb 3, 2022

rohany commented Feb 3, 2022

raghavendrak commented Feb 4, 2022

rohany commented Feb 5, 2022

raghavendrak commented Feb 5, 2022

rohany commented Feb 5, 2022

raghavendrak commented Feb 5, 2022

rohany commented Feb 5, 2022

raghavendrak commented Feb 11, 2022 •

edited

Loading

rohany commented Feb 11, 2022

raghavendrak commented Feb 13, 2022

segfault executing sparse inner product #138

segfault executing sparse inner product #138

Comments

rohany commented Jan 25, 2022

raghavendrak commented Feb 2, 2022

rohany commented Feb 2, 2022

solomonik commented Feb 2, 2022

solomonik commented Feb 2, 2022

rohany commented Feb 2, 2022

solomonik commented Feb 2, 2022

rohany commented Feb 3, 2022

raghavendrak commented Feb 3, 2022

rohany commented Feb 3, 2022

raghavendrak commented Feb 4, 2022

rohany commented Feb 5, 2022

raghavendrak commented Feb 5, 2022

rohany commented Feb 5, 2022

raghavendrak commented Feb 5, 2022

rohany commented Feb 5, 2022

raghavendrak commented Feb 11, 2022 • edited Loading

rohany commented Feb 11, 2022

raghavendrak commented Feb 13, 2022

raghavendrak commented Feb 11, 2022 •

edited

Loading