Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seg fault when running hcl.build() #472

Open
sjz38 opened this issue Sep 16, 2022 · 3 comments
Open

Seg fault when running hcl.build() #472

sjz38 opened this issue Sep 16, 2022 · 3 comments

Comments

@sjz38
Copy link

sjz38 commented Sep 16, 2022

In the HCL Ultranet implementation https://github.com/sjz38/my_ultranet/tree/yolo_conv, I made changes to the convolution function I created to support the inclusion of a bias term (so that the convolution in the YOLO layer could be added to the FPGA/hardware side). After making these modifications, I was able to verify on CPU that the implementation produces the correct results. However, when I run hls_test.py to generate the HLS CPP code from the HCL code, a seg fault will result quite often (sometimes it is able to run successfully though). I was able to isolate the seg fault to the hcl.build() function call on line 219 (https://github.com/sjz38/my_ultranet/blob/yolo_conv/hls_test.py#L219). To further isolate the issue, I used gdb to get a stack trace of the code execution and got the following error:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
std::string::compare (this=<optimized out>, __s=0x7fff898ed995 "_top")
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:1422
1422    /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc: No such file or directory.
Missing separate debuginfos, use: debuginfo-install libX11-1.6.7-4.el7_9.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXext-1.3.3-3.el7.x86_64 libglvnd-1.0.1-0.8.git5baa1e5.el7.x86_64 libglvnd-glx-1.0.1-0.8.git5baa1e5.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64

It seems as though there is a std::string::compare call from a non-existent directory. Looking at the stack trace below, it looks like this is coming from an HCL function which invokes a TVM::schedule::move_to call.

#0  std::string::compare (this=<optimized out>, __s=0x7fff898ed995 "_top")
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/build/build-cc-gcc-final/x86_64-conda_cos6-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:1422
#1  0x00007fff86f4dffe in TVM::Schedule::move_to(TVM::Tensor const&, TVM::Stage, Halide::Internal::DeviceType, Halide::Internal::StreamType, int, TVM::Array<Halide::Expr, void>) ()
   from /home/sjz38/.local/lib/python3.7/site-packages/heterocl-0.3-py3.7.egg/lib/libhcl.so
#2  0x00007fff86f91cd7 in std::_Function_handler<void (TVM::runtime::TVMArgs, TVM::runtime::TVMRetValue*), TVM::{lambda(TVM::runtime::TVMArgs, TVM::runtime::TVMRetValue*)#59}>::_M_invoke(std::_Any_data const&, TVM::runtime::TVMArgs&&, TVM::runtime::TVMRetValue*&&) () from /home/sjz38/.local/lib/python3.7/site-packages/heterocl-0.3-py3.7.egg/lib/libhcl.so
#3  0x00007fff87263f08 in HCLTVMFuncCall ()
   from /home/sjz38/.local/lib/python3.7/site-packages/heterocl-0.3-py3.7.egg/lib/libhcl.so

I am not sure what the solution to this issue is. Any advice would be much appreciated!

@hecmay
Copy link
Collaborator

hecmay commented Sep 17, 2022

@sjz38
Hi Stephen, thanks for posting the backtrace here. that's very helpful for pinpointing the issue.

you are right. It seems that some illegal string operations happened inside TVM::schedule::move_to. Can you please also run export HCL_DEBUG_ON=1 and re-run the program? This ENV variable will enable some additional debugging information to be printed to stdout.

@sjz38
Copy link
Author

sjz38 commented Sep 21, 2022

When I use export HCL_DEBUG_ON=1, the stack trace is the same, but the stdout I get when running hls_test.py is the following. Is this what you were looking for?

(unet_env) sjz38@zhang-x1:.../sjz38/new/my_ultranet$ python3 hls_test.py 
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv1_line_buffer, 0x55cd548d0530)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv1, 0x55cd548d6b60)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:280: Tensor conv1_pad has more than one consumers. Start multi-casting...
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv2_line_buffer, 0x55cd548cf7f0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv2, 0x55cd548d8d70)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:280: Tensor conv2_pad has more than one consumers. Start multi-casting...
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv3_line_buffer, 0x55cd548e55b0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv3, 0x55cd548d98c0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:280: Tensor conv3_pad has more than one consumers. Start multi-casting...
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv4_line_buffer, 0x55cd548e6d60)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv4, 0x55cd548db290)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:280: Tensor conv4_pad has more than one consumers. Start multi-casting...
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv5_line_buffer, 0x55cd548e8570)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv5, 0x55cd548dd550)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:280: Tensor conv5_pad has more than one consumers. Start multi-casting...
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv6_line_buffer, 0x55cd548e9d80)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv6, 0x55cd548dee40)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:280: Tensor conv6_pad has more than one consumers. Start multi-casting...
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv7_line_buffer, 0x55cd548eb590)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv7, 0x55cd548e0bd0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:280: Tensor conv7_pad has more than one consumers. Start multi-casting...
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv8_line_buffer, 0x55cd548ecff0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv8, 0x55cd548e21a0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:280: Tensor conv8_pad has more than one consumers. Start multi-casting...
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(batch_norm1, 0x55cd548d75a0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(pool1_pad, 0x55cd548d7b50)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(pool1, 0x55cd548d7ec0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv2_pad, 0x55cd548d8560)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(batch_norm2, 0x55cd548d32b0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(pool2_pad, 0x55cd548d2be0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(pool2, 0x55cd548cf020)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv3_pad, 0x55cd548d9200)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(batch_norm3, 0x55cd548d9e30)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(pool3_pad, 0x55cd548da3e0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(pool3, 0x55cd548daae0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv4_pad, 0x55cd548db000)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(batch_norm4, 0x55cd548db9f0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(pool4_pad, 0x55cd548dbfe0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(pool4, 0x55cd548dc620)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv5_pad, 0x55cd548d83e0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(batch_norm5, 0x55cd548ddf80)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv6_pad, 0x55cd548de530)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(batch_norm6, 0x55cd548df900)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv7_pad, 0x55cd548dfeb0)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(batch_norm7, 0x55cd548e1390)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(conv8_pad, 0x55cd548da750)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:274: Consumer stage(batch_norm8, 0x55cd548e2c20)
[14:10:05] src/schedule/schedule_dataflow_rewrite.cc:502: Moving tensor input_image to Host...
Segmentation fault

@hecmay
Copy link
Collaborator

hecmay commented Sep 21, 2022

@sjz38 thanks! It does not really give any more useful information tho. I will take a closer look and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants