Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omnitrace hangs before hostCallback function #307

Closed
jakub-homola opened this issue Sep 21, 2023 · 6 comments
Closed

omnitrace hangs before hostCallback function #307

jakub-homola opened this issue Sep 21, 2023 · 6 comments

Comments

@jakub-homola
Copy link

jakub-homola commented Sep 21, 2023

Hello,

I am trying to trace my AMDGPU application with Omnitrace, but I am running into an issue with a host callback function. Using hipStreamAddCallback I submit a host function into a stream. Without Omnitrace, the program works as expected. But with Omnitrace, the program hangs and the host function is never launched.

Reproducer program:

#include <cstdio>
#include <hip/hip_runtime.h>

#define CHECK(status) do { check((status), __FILE__, __LINE__); } while(false)
inline static void check(hipError_t error_code, const char *file, int line)
{
    if (error_code != hipSuccess)
    {
        fprintf(stderr, "HIP Error %d %s: %s. In file '%s' on line %d\n", error_code, hipGetErrorName(error_code), hipGetErrorString(error_code), file, line);
        fflush(stderr);
        exit(error_code);
    }
}

__global__ void dummy_kernel(int a)
{
    printf("I am dummy kernel %d\n", a);
}

int main()
{
    printf("AAA\n");
    CHECK(hipDeviceSynchronize());
    printf("BBB\n");
    dummy_kernel<<< 1,1 >>>(1);
    printf("CCC\n");
    CHECK(hipDeviceSynchronize());
    printf("DDD\n");
    CHECK(hipStreamAddCallback(0, [](hipStream_t stream_, hipError_t status_, void * arg){
        printf("I am host function\n");
    }, nullptr, 0));
    printf("EEE\n");
    CHECK(hipDeviceSynchronize());
    printf("FFF\n");
    dummy_kernel<<< 1,1 >>>(2);
    printf("GGG\n");
    CHECK(hipDeviceSynchronize());
    printf("HHH\n");

    return 0;
}

When running it without omnitrace, the program correctly outputs

AAA
BBB
CCC
I am dummy kernel 1
DDD
EEE
I am host function
FFF
GGG
I am dummy kernel 2
HHH

but with omnitrace, it only outputs

AAA
BBB
CCC
I am dummy kernel 1
DDD
EEE

and then nothing, then it just hangs, seemingly forever.

I am compiling the program using

hipcc -g -O2 source.hip.cpp -o program.x

And runing using

omnitrace-sample -- ./program.x

omnitrace-instrument seems to have the same problem.

I am on LUMI-G compute node (MI250x), using rocm-5.2.3 (the only one properly supported there, module load LUMI/23.03 rocm/5.2.3).
I installed omnitrace using this guide, just running the installation script and adding the appropriate directories to PATH and LD_LIBRARY_PATH.

$ omnitrace-sample --version
omnitrace-sample v1.10.2 (rev: 0b751d2aef7d32d8b4fab184d0b34d4013b6d986, tag: v1.10.2, compiler: GNU v7.5.0, rocm: v5.2.x)

In case I missed any details, please ask.

I would appreciate any help.

@jrmadsen
Copy link
Collaborator

Could you provide a backtrace? There is usually one printed out when you hit Ctrl+C.

I suspect there is something funny going on in roctracer, which delivers callbacks to omnitrace about the HIP calls. Can you try disabling roctracer support and see if it still hangs? Could you also try running it with rocprof and seeing if it still hangs?

@jakub-homola
Copy link
Author

Below is the full output of the program, after hitting ctrl+c.

Unfortunately I don't have time to investigate the other things right now, will get back to it on Monday.

$ omnitrace-sample -- ./program.x

HSA_TOOLS_LIB=/pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace-dl.so.1.10.2
HSA_TOOLS_REPORT_LOAD_FAILURE=1
LD_PRELOAD=/pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace-dl.so.1.10.2
OMNITRACE_CRITICAL_TRACE=false
OMNITRACE_USE_PROCESS_SAMPLING=false
OMNITRACE_USE_SAMPLING=true
OMP_TOOL_LIBRARIES=/pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace-dl.so.1.10.2
ROCP_HSA_INTERCEPT=1
ROCP_TOOL_LIB=/pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace.so.1.10.2

[omnitrace][dl][11382] omnitrace_main
[omnitrace][11382][omnitrace_init_tooling] Instrumentation mode: Sampling


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    omnitrace v1.10.2 (rev: 0b751d2aef7d32d8b4fab184d0b34d4013b6d986, tag: v1.10.2, compiler: GNU v7.5.0, rocm: v5.2.x)
[761.763]       perfetto.cc:58656 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
AAA
BBB
CCC
I am dummy kernel 1
DDD
EEE
^C
[omnitrace][11382][0] Signal 2 caught : Interrupt (Signal sent by the kernel 0 0)

### ERROR ### [omnitrace][PID=11382][TID=0] signal=2 (SIGINT) interrupt program. code: 128
Backtrace:
[PID=11382][TID=0][0/7] __restore_rt
[PID=11382][TID=0][1/7] hsa_amd_image_get_info_max_dim +0x5e9
[PID=11382][TID=0][2/7] hsa_amd_image_get_info_max_dim +0x4ba
[PID=11382][TID=0][7/7] main +0xfa
[PID=11382][TID=0][8/7] omnitrace_main +0x3bd
[PID=11382][TID=0][9/7] __libc_start_main +0xef
[PID=11382][TID=0][10/7] _start +0x2a

Backtrace (demangled):
[PID=11382][TID=0][0/11] /lib64/libpthread.so.0(+0x168c0) [0x1531bc0268c0]
[PID=11382][TID=0][1/11] /opt/rocm/lib/libhsa-runtime64.so.1(+0x4ce49) [0x1531b26b2e49]
[PID=11382][TID=0][2/11] /opt/rocm/lib/libhsa-runtime64.so.1(+0x4cd1a) [0x1531b26b2d1a]
[PID=11382][TID=0][3/11] /opt/rocm/lib/libhsa-runtime64.so.1(+0x40ce9) [0x1531b26a6ce9]
[PID=11382][TID=0][4/11] /opt/rocm/lib/libamdhip64.so.5(+0x27f2cb) [0x1531bae0d2cb]
[PID=11382][TID=0][5/11] /opt/rocm/lib/libamdhip64.so.5(+0x26e6ea) [0x1531badfc6ea]
[PID=11382][TID=0][6/11] /opt/rocm/lib/libamdhip64.so.5(hipDeviceSynchronize+0xe9) [0x1531bac13e09]
[PID=11382][TID=0][7/11] ./program.x() [0x20cd6a]
[PID=11382][TID=0][8/11] /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace-dl.so.1.10.2(+0x14f5d) [0x1531bc66af5d]
[PID=11382][TID=0][9/11] /lib64/libc.so.6(__libc_start_main+0xef) [0x1531ba3ab29d]
[PID=11382][TID=0][10/11] ./program.x() [0x20cb5a]

/proc/11382/maps:
    00200000-0020c000 r--p 00000000 a67:eeb30 144119912674099405             /pfs/lustrep1/users/homolaja/tests/hip_host_function/program.x
    0020c000-0020e000 r-xp 0000b000 a67:eeb30 144119912674099405             /pfs/lustrep1/users/homolaja/tests/hip_host_function/program.x
    0020e000-0020f000 r--p 0000c000 a67:eeb30 144119912674099405             /pfs/lustrep1/users/homolaja/tests/hip_host_function/program.x
    0020f000-00210000 rwxp 0000c000 a67:eeb30 144119912674099405             /pfs/lustrep1/users/homolaja/tests/hip_host_function/program.x
    00210000-021c0000 rw-p 00000000 00:00 0                                  [heap]
    152906710000-152914000000 rw-p 00000000 00:00 0
    152914000000-152914043000 rw-p 00000000 00:00 0
    152914043000-152918000000 ---p 00000000 00:00 0
    15291bc00000-15291fc2f000 rw-p 00000000 00:00 0
    15291fe00000-152923e2f000 rw-p 00000000 00:00 0
    152924000000-152924021000 rw-p 00000000 00:00 0
    152924021000-152928000000 ---p 00000000 00:00 0
    15292b200000-152a2b200000 ---p 00000000 00:00 0
    152a2b400000-152b2b400000 ---p 00000000 00:00 0
    152b2b600000-152c2b600000 ---p 00000000 00:00 0
    152c2b800000-152d2b800000 ---p 00000000 00:00 0
    152d2ba00000-152e2ba00000 ---p 00000000 00:00 0
    152e2bc00000-152f2bc00000 ---p 00000000 00:00 0
    152f2be00000-15302be00000 ---p 00000000 00:00 0
    15302c000000-15302c021000 rw-p 00000000 00:00 0
    15302c021000-153030000000 ---p 00000000 00:00 0
    153032800000-1530335d5000 rw-p 00000000 00:00 0
    153033600000-153035900000 ---p 10a186000 00:05 1606                      /dev/dri/renderD128
    153035900000-153133600000 ---p 00000000 00:00 0
    153138000000-153138021000 rw-p 00000000 00:00 0
    153138021000-15313c000000 ---p 00000000 00:00 0
    15313c5ff000-15313c600000 ---p 00000000 00:00 0
    15313c600000-15313c800000 rwxp 00000000 00:00 0
    15313c800000-15313ca00000 rw-s 10004c000 00:05 1649                      /dev/dri/renderD135
    15313cc00000-15313cd01000 rw-p 00000000 00:00 0
    15313ce00000-15313cf01000 rw-p 00000000 00:00 0
    15313d000000-15313d200000 rw-s 10004c000 00:05 1643                      /dev/dri/renderD134
    15313d400000-15313d501000 rw-p 00000000 00:00 0
    15313d600000-15313d701000 rw-p 00000000 00:00 0
    15313d7fe000-15313d7ff000 ---p 00000000 00:00 0
    15313d7ff000-15317bfff000 rw-p 00000000 00:00 0
    15317bfff000-15317c000000 ---p 00000000 00:00 0
    15317c000000-15317c021000 rw-p 00000000 00:00 0
    15317c021000-153180000000 ---p 00000000 00:00 0
    153180000000-153180021000 rw-p 00000000 00:00 0
    153180021000-153184000000 ---p 00000000 00:00 0
    153184200000-153184400000 rw-s 10004c000 00:05 1637                      /dev/dri/renderD133
    153184600000-153184701000 rw-p 00000000 00:00 0
    153184800000-153184901000 rw-p 00000000 00:00 0
    153184a00000-153184c00000 rw-s 10004c000 00:05 1631                      /dev/dri/renderD132
    153184e00000-153184f01000 rw-p 00000000 00:00 0
    153185000000-153185101000 rw-p 00000000 00:00 0
    153185200000-153185400000 rw-s 10004c000 00:05 1625                      /dev/dri/renderD131
    153185600000-153185701000 rw-p 00000000 00:00 0
    153185800000-153185901000 rw-p 00000000 00:00 0
    153185a00000-153185c00000 rw-s 10004c000 00:05 1619                      /dev/dri/renderD130
    153185e00000-153185f01000 rw-p 00000000 00:00 0
    153186000000-153186101000 rw-p 00000000 00:00 0
    153186200000-153186400000 rw-s 10004c000 00:05 1613                      /dev/dri/renderD129
    153186600000-153186701000 rw-p 00000000 00:00 0
    153186800000-153186901000 rw-p 00000000 00:00 0
    153186a00000-153186c00000 rw-s 100257000 00:05 1606                      /dev/dri/renderD128
    153186d00000-153186d80000 rw-p 00000000 00:00 0
    153186e00000-153186f01000 rw-p 00000000 00:00 0
    153187000000-153187101000 rw-p 00000000 00:00 0
    1531871a7000-1531871a8000 ---p 00000000 00:00 0
    1531871a8000-1531873a8000 rwxp 00000000 00:00 0
    1531873a8000-1531873eb000 r-xp 00000000 07:01 7774                       /opt/rocm-5.2.3/lib/libhsa-amd-aqlprofile64.so.1.0.50203
    1531873eb000-1531875eb000 ---p 00043000 07:01 7774                       /opt/rocm-5.2.3/lib/libhsa-amd-aqlprofile64.so.1.0.50203
    1531875eb000-1531875ee000 r--p 00043000 07:01 7774                       /opt/rocm-5.2.3/lib/libhsa-amd-aqlprofile64.so.1.0.50203
    1531875ee000-1531875fb000 rw-p 00046000 07:01 7774                       /opt/rocm-5.2.3/lib/libhsa-amd-aqlprofile64.so.1.0.50203
    1531875fb000-1531875fc000 ---p 00000000 00:00 0
    1531875fc000-1531877fc000 rwxp 00000000 00:00 0
    1531877fc000-1531877fd000 ---p 00000000 00:00 0
    1531877fd000-1531879fd000 rwxp 00000000 00:00 0
    1531879fd000-1531879fe000 ---p 00000000 00:00 0
    1531879fe000-153187dfe000 rw-p 00000000 00:00 0
    153187dfe000-153187dff000 ---p 00000000 00:00 0
    153187dff000-153187e00000 ---p 00000000 00:00 0
    153187e00000-153188000000 rwxp 00000000 00:00 0
    153188000000-153188021000 rw-p 00000000 00:00 0
    153188021000-15318c000000 ---p 00000000 00:00 0
    15318c000000-15318c021000 rw-p 00000000 00:00 0
    15318c021000-153190000000 ---p 00000000 00:00 0
    153190000000-15319002b000 rw-p 00000000 00:00 0
    15319002b000-153194000000 ---p 00000000 00:00 0
    15319406b000-15319406c000 ---p 00000000 00:00 0
    15319406c000-15319426c000 rwxp 00000000 00:00 0
    15319426c000-15319426d000 ---p 00000000 00:00 0
    15319426d000-15319446d000 rwxp 00000000 00:00 0
    15319446d000-1531959c0000 rw-p 00000000 00:00 0
    153195a1c000-153195a1d000 ---p 00000000 00:00 0
    153195a1d000-153195a60000 rwxp 00000000 00:00 0
    153195a60000-153195a64000 rw-p 00000000 00:00 0
    153195a68000-153195a6c000 rw-s 1093a9000 00:05 1606                      /dev/dri/renderD128
    153195a70000-153195a79000 rw-p 00000000 00:00 0
    153195a80000-153195ac0000 rw-p 00000000 00:00 0
    153195ac6000-153195ac8000 rw-p 00000000 00:00 0
    153195aca000-153195acb000 ---p 00000000 00:00 0
    153195acc000-153195acd000 rw-p 00000000 00:00 0
    153195ace000-153195acf000 rw-p 00000000 00:00 0
    153195ad0000-153195ad9000 rw-s 10938f000 00:05 1606                      /dev/dri/renderD128
    153195ada000-153195adb000 rw-p 00000000 00:00 0
    153195adc000-153195ade000 rw-p 00000000 00:00 0
    153195ae0000-153195ae1000 rw-p 00000000 00:00 0
    153195ae2000-153195ae3000 ---p 1052de000 00:05 1606                      /dev/dri/renderD128
    153195ae4000-153195ae5000 rw-p 00000000 00:00 0
    153195ae6000-153195ae7000 rw-p 00000000 00:00 0
    153195ae8000-153195ae9000 rw-p 00000000 00:00 0
    153195aea000-153195aec000 rw-s fd31c00000000000 00:05 1603               /dev/kfd
    153195aee000-153195aef000 ---p 101269000 00:05 1606                      /dev/dri/renderD128
    153195af0000-153195af1000 rw-p 00000000 00:00 0
    153195af2000-153195af3000 rw-p 00000000 00:00 0
    153195af4000-153195af5000 rw-p 00000000 00:00 0
    153195af6000-153195af7000 rw-p 00000000 00:00 0
    153195af8000-153195af9000 rw-p 00000000 00:00 0
    153195afa000-153195afb000 rw-p 00000000 00:00 0
    153195afc000-153195afd000 rw-p 00000000 00:00 0
    153195afe000-153195aff000 rw-p 00000000 00:00 0
    153195b00000-153195b08000 rw-s 100044000 00:05 1606                      /dev/dri/renderD128
    153195b0a000-153195b0b000 rw-p 00000000 00:00 0
    153195b0c000-153195b0d000 ---p 00000000 00:00 0
    153195b0d000-153195d0d000 rwxp 00000000 00:00 0
    153195d0d000-153195d0e000 ---p 00000000 00:00 0
    153195d0e000-153195f0e000 rwxp 00000000 00:00 0
    153195f0e000-1531ae593000 rw-p 00000000 00:00 0
    1531ae593000-1531ae617000 r-xp 00000000 07:01 7818                       /opt/rocm-5.2.3/lib/librocm_smi64.so.5.0.50203
    1531ae617000-1531ae817000 ---p 00084000 07:01 7818                       /opt/rocm-5.2.3/lib/librocm_smi64.so.5.0.50203
    1531ae817000-1531ae81a000 rwxp 00084000 07:01 7818                       /opt/rocm-5.2.3/lib/librocm_smi64.so.5.0.50203
    1531ae81a000-1531ae81b000 rwxp 00000000 07:02 22662                      /opt/cray/pe/papi/6.0.0.17/lib/libpfm.so.4.12.1
    1531ae81b000-1531aea0b000 r-xp 00001000 07:02 22662                      /opt/cray/pe/papi/6.0.0.17/lib/libpfm.so.4.12.1
    1531aea0b000-1531aec0a000 ---p 001f1000 07:02 22662                      /opt/cray/pe/papi/6.0.0.17/lib/libpfm.so.4.12.1
    1531aec0a000-1531aecb2000 r--p 001f0000 07:02 22662                      /opt/cray/pe/papi/6.0.0.17/lib/libpfm.so.4.12.1
    1531aecb2000-1531aecb3000 rwxp 00298000 07:02 22662                      /opt/cray/pe/papi/6.0.0.17/lib/libpfm.so.4.12.1
    1531aecb3000-1531aed14000 rw-p 00299000 07:02 22662                      /opt/cray/pe/papi/6.0.0.17/lib/libpfm.so.4.12.1
    1531aed14000-1531aed16000 rw-p 00000000 00:00 0
    1531aed16000-1531aed58000 r-xp 00000000 07:01 7821                       /opt/rocm-5.2.3/lib/librocprofiler64.so.1.0.50203
    1531aed58000-1531aef58000 ---p 00042000 07:01 7821                       /opt/rocm-5.2.3/lib/librocprofiler64.so.1.0.50203
    1531aef58000-1531aef59000 r--p 00042000 07:01 7821                       /opt/rocm-5.2.3/lib/librocprofiler64.so.1.0.50203
    1531aef59000-1531aef5a000 rwxp 00043000 07:01 7821                       /opt/rocm-5.2.3/lib/librocprofiler64.so.1.0.50203
    1531aef5a000-1531aef96000 r-xp 00000000 07:01 7833                       /opt/rocm-5.2.3/lib/libroctracer64.so.1.0.50203
    1531aef96000-1531af196000 ---p 0003c000 07:01 7833                       /opt/rocm-5.2.3/lib/libroctracer64.so.1.0.50203
    1531af196000-1531af197000 r--p 0003c000 07:01 7833                       /opt/rocm-5.2.3/lib/libroctracer64.so.1.0.50203
    1531af197000-1531af198000 rwxp 0003d000 07:01 7833                       /opt/rocm-5.2.3/lib/libroctracer64.so.1.0.50203
    1531af198000-1531af199000 rw-p 00000000 00:00 0
    1531af199000-1531b12d8000 r-xp 00000000 a67:eeb30 144119826271531182     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace.so.1.10.2
    1531b12d8000-1531b1320000 r--p 0213e000 a67:eeb30 144119826271531182     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace.so.1.10.2
    1531b1320000-1531b1321000 rwxp 02186000 a67:eeb30 144119826271531182     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace.so.1.10.2
    1531b1321000-1531b132f000 rwxp 02187000 a67:eeb30 144119826271531182     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace.so.1.10.2
    1531b132f000-1531b13a0000 rw-p 02195000 a67:eeb30 144119826271531182     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace.so.1.10.2
    1531b13a0000-1531b19de000 rw-p 00000000 00:00 0
    1531b19de000-1531b19e7000 r-xp 00000000 00:2d 853                        /usr/lib64/libdrm_amdgpu.so.1.0.0
    1531b19e7000-1531b1be6000 ---p 00009000 00:2d 853                        /usr/lib64/libdrm_amdgpu.so.1.0.0
    1531b1be6000-1531b1be7000 r--p 00008000 00:2d 853                        /usr/lib64/libdrm_amdgpu.so.1.0.0
    1531b1be7000-1531b1be8000 rwxp 00009000 00:2d 853                        /usr/lib64/libdrm_amdgpu.so.1.0.0
    1531b1be8000-1531b1bfb000 r-xp 00000000 00:2d 861                        /usr/lib64/libdrm.so.2.4.0
    1531b1bfb000-1531b1dfa000 ---p 00013000 00:2d 861                        /usr/lib64/libdrm.so.2.4.0
    1531b1dfa000-1531b1dfb000 r--p 00012000 00:2d 861                        /usr/lib64/libdrm.so.2.4.0
    1531b1dfb000-1531b1dfc000 rwxp 00013000 00:2d 861                        /usr/lib64/libdrm.so.2.4.0
    1531b1dfc000-1531b1e13000 r-xp 00000000 00:2d 883                        /usr/lib64/libelf-0.185.so
    1531b1e13000-1531b2013000 ---p 00017000 00:2d 883                        /usr/lib64/libelf-0.185.so
    1531b2013000-1531b2014000 r--p 00017000 00:2d 883                        /usr/lib64/libelf-0.185.so
    1531b2014000-1531b2015000 rwxp 00018000 00:2d 883                        /usr/lib64/libelf-0.185.so
    1531b2015000-1531b203b000 r-xp 00000000 00:2d 57                         /lib64/libtinfo.so.6.1
    1531b203b000-1531b223a000 ---p 00026000 00:2d 57                         /lib64/libtinfo.so.6.1
    1531b223a000-1531b223b000 r--p 00025000 00:2d 57                         /lib64/libtinfo.so.6.1
    1531b223b000-1531b223c000 rwxp 00026000 00:2d 57                         /lib64/libtinfo.so.6.1
    1531b223c000-1531b2243000 rw-p 00027000 00:2d 57                         /lib64/libtinfo.so.6.1
    1531b2243000-1531b2259000 r-xp 00000000 00:2d 65                         /lib64/libz.so.1.2.11
    1531b2259000-1531b2458000 ---p 00016000 00:2d 65                         /lib64/libz.so.1.2.11
    1531b2458000-1531b2459000 rwxp 00015000 00:2d 65                         /lib64/libz.so.1.2.11
    1531b2459000-1531b245a000 rw-p 00016000 00:2d 65                         /lib64/libz.so.1.2.11
    1531b245a000-1531b2465000 r-xp 00000000 00:2d 1192                       /usr/lib64/libnuma.so.1.0.0
    1531b2465000-1531b2664000 ---p 0000b000 00:2d 1192                       /usr/lib64/libnuma.so.1.0.0
    1531b2664000-1531b2665000 r--p 0000a000 00:2d 1192                       /usr/lib64/libnuma.so.1.0.0
    1531b2665000-1531b2666000 rwxp 0000b000 00:2d 1192                       /usr/lib64/libnuma.so.1.0.0
    1531b2666000-1531b278c000 r-xp 00000000 07:01 7777                       /opt/rocm-5.2.3/lib/libhsa-runtime64.so.1.5.50203
    1531b278c000-1531b298b000 ---p 00126000 07:01 7777                       /opt/rocm-5.2.3/lib/libhsa-runtime64.so.1.5.50203
    1531b298b000-1531b2993000 r--p 00125000 07:01 7777                       /opt/rocm-5.2.3/lib/libhsa-runtime64.so.1.5.50203
    1531b2993000-1531b2995000 rwxp 0012d000 07:01 7777                       /opt/rocm-5.2.3/lib/libhsa-runtime64.so.1.5.50203
    1531b2995000-1531b2aad000 rw-p 0012f000 07:01 7777                       /opt/rocm-5.2.3/lib/libhsa-runtime64.so.1.5.50203
    1531b2aad000-1531b2ab0000 rw-p 00000000 00:00 0
    1531b2ab0000-1531b9a35000 r-xp 00000000 07:01 7748                       /opt/rocm-5.2.3/lib/libamd_comgr.so.2.4.50203
    1531b9a35000-1531b9c35000 ---p 06f85000 07:01 7748                       /opt/rocm-5.2.3/lib/libamd_comgr.so.2.4.50203
    1531b9c35000-1531ba101000 r--p 06f85000 07:01 7748                       /opt/rocm-5.2.3/lib/libamd_comgr.so.2.4.50203
    1531ba101000-1531ba103000 rwxp 07451000 07:01 7748                       /opt/rocm-5.2.3/lib/libamd_comgr.so.2.4.50203
    1531ba103000-1531ba10f000 rw-p 07453000 07:01 7748                       /opt/rocm-5.2.3/lib/libamd_comgr.so.2.4.50203
    1531ba10f000-1531ba172000 rw-p 00000000 00:00 0
    1531ba172000-1531ba175000 r-xp 00000000 00:2d 32                         /lib64/libdl-2.31.so
    1531ba175000-1531ba374000 ---p 00003000 00:2d 32                         /lib64/libdl-2.31.so
    1531ba374000-1531ba375000 rwxp 00002000 00:2d 32                         /lib64/libdl-2.31.so
    1531ba375000-1531ba376000 rw-p 00003000 00:2d 32                         /lib64/libdl-2.31.so
    1531ba376000-1531ba55c000 r-xp 00000000 00:2d 30                         /lib64/libc-2.31.so
    1531ba55c000-1531ba75c000 ---p 001e6000 00:2d 30                         /lib64/libc-2.31.so
    1531ba75c000-1531ba75e000 r--p 001e6000 00:2d 30                         /lib64/libc-2.31.so
    1531ba75e000-1531ba75f000 rwxp 001e8000 00:2d 30                         /lib64/libc-2.31.so
    1531ba75f000-1531ba767000 rw-p 001e9000 00:2d 30                         /lib64/libc-2.31.so
    1531ba767000-1531ba76b000 rw-p 00000000 00:00 0
    1531ba76b000-1531ba97e000 r-xp 00000000 00:2d 1335                       /usr/lib64/libstdc++.so.6.0.30
    1531ba97e000-1531bab7d000 ---p 00213000 00:2d 1335                       /usr/lib64/libstdc++.so.6.0.30
    1531bab7d000-1531bab88000 r--p 00212000 00:2d 1335                       /usr/lib64/libstdc++.so.6.0.30
    1531bab88000-1531bab8b000 rwxp 0021d000 00:2d 1335                       /usr/lib64/libstdc++.so.6.0.30
    1531bab8b000-1531bab8e000 rwxp 00000000 00:00 0
    1531bab8e000-1531bab8f000 rwxp 00000000 07:01 7751                       /opt/rocm-5.2.3/lib/libamdhip64.so.5.2.50203
    1531bab8f000-1531baf33000 r-xp 00001000 07:01 7751                       /opt/rocm-5.2.3/lib/libamdhip64.so.5.2.50203
    1531baf33000-1531bb133000 ---p 003a5000 07:01 7751                       /opt/rocm-5.2.3/lib/libamdhip64.so.5.2.50203
    1531bb133000-1531bb139000 r--p 003a5000 07:01 7751                       /opt/rocm-5.2.3/lib/libamdhip64.so.5.2.50203
    1531bb139000-1531bb13b000 rwxp 003ab000 07:01 7751                       /opt/rocm-5.2.3/lib/libamdhip64.so.5.2.50203
    1531bb13b000-1531bbaab000 rw-p 003ad000 07:01 7751                       /opt/rocm-5.2.3/lib/libamdhip64.so.5.2.50203
    1531bbaab000-1531bbabc000 rw-p 00000000 00:00 0
    1531bbabc000-1531bbac4000 r-xp 00000000 00:2d 52                         /lib64/librt-2.31.so
    1531bbac4000-1531bbcc3000 ---p 00008000 00:2d 52                         /lib64/librt-2.31.so
    1531bbcc3000-1531bbcc4000 rwxp 00007000 00:2d 52                         /lib64/librt-2.31.so
    1531bbcc4000-1531bbcc5000 rw-p 00008000 00:2d 52                         /lib64/librt-2.31.so
    1531bbcc5000-1531bbe0d000 r-xp 00000000 00:2d 34                         /lib64/libm-2.31.so
    1531bbe0d000-1531bc00d000 ---p 00148000 00:2d 34                         /lib64/libm-2.31.so
    1531bc00d000-1531bc00e000 rwxp 00148000 00:2d 34                         /lib64/libm-2.31.so
    1531bc00e000-1531bc010000 rw-p 00149000 00:2d 34                         /lib64/libm-2.31.so
    1531bc010000-1531bc02e000 r-xp 00000000 00:2d 48                         /lib64/libpthread-2.31.so
    1531bc02e000-1531bc22d000 ---p 0001e000 00:2d 48                         /lib64/libpthread-2.31.so
    1531bc22d000-1531bc22e000 rwxp 0001d000 00:2d 48                         /lib64/libpthread-2.31.so
    1531bc22e000-1531bc22f000 rw-p 0001e000 00:2d 48                         /lib64/libpthread-2.31.so
    1531bc22f000-1531bc233000 rw-p 00000000 00:00 0
    1531bc233000-1531bc250000 r-xp 00000000 00:2d 489                        /lib64/libgcc_s.so.1
    1531bc250000-1531bc450000 ---p 0001d000 00:2d 489                        /lib64/libgcc_s.so.1
    1531bc450000-1531bc451000 r--p 0001d000 00:2d 489                        /lib64/libgcc_s.so.1
    1531bc451000-1531bc452000 rwxp 0001e000 00:2d 489                        /lib64/libgcc_s.so.1
    1531bc452000-1531bc47c000 r-xp 00000000 00:2d 17                         /lib64/ld-2.31.so
    1531bc47e000-1531bc47f000 rw-p 00000000 00:00 0
    1531bc480000-1531bc481000 rw-p 00000000 00:00 0
    1531bc482000-1531bc483000 rw-p 00000000 00:00 0
    1531bc484000-1531bc485000 rw-p 00000000 00:00 0
    1531bc486000-1531bc487000 rw-s 21c3800000000000 00:05 1603               /dev/kfd
    1531bc488000-1531bc489000 rw-s 2b84c00000000000 00:05 1603               /dev/kfd
    1531bc48a000-1531bc48b000 rw-s 3c52800000000000 00:05 1603               /dev/kfd
    1531bc48c000-1531bc48d000 rw-s 2617c00000000000 00:05 1603               /dev/kfd
    1531bc48e000-1531bc5d8000 rw-p 00000000 00:00 0
    1531bc5d8000-1531bc5e6000 r-xp 00000000 a67:eeb30 144119826271531190     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/omnitrace/libunwind.so.99.0.0
    1531bc5e6000-1531bc5e7000 rwxp 0000e000 a67:eeb30 144119826271531190     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/omnitrace/libunwind.so.99.0.0
    1531bc5e7000-1531bc5e8000 rw-p 0000f000 a67:eeb30 144119826271531190     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/omnitrace/libunwind.so.99.0.0
    1531bc5e8000-1531bc5f2000 rw-p 00000000 00:00 0
    1531bc5f2000-1531bc5fb000 r-xp 00000000 a67:eeb30 144119826271531238     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/omnitrace/libgotcha.so.2.1.0
    1531bc5fb000-1531bc5fc000 rwxp 00009000 a67:eeb30 144119826271531238     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/omnitrace/libgotcha.so.2.1.0
    1531bc5fc000-1531bc5fd000 rw-p 0000a000 a67:eeb30 144119826271531238     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/omnitrace/libgotcha.so.2.1.0
    1531bc5fd000-1531bc632000 r-xp 00000000 00:2d 177                        /usr/lib64/libudev.so.1.7.2
    1531bc632000-1531bc633000 rwxp 00034000 00:2d 177                        /usr/lib64/libudev.so.1.7.2
    1531bc633000-1531bc634000 rw-p 00035000 00:2d 177                        /usr/lib64/libudev.so.1.7.2
    1531bc634000-1531bc63f000 rw-p 00000000 00:00 0
    1531bc640000-1531bc641000 rw-s 38ed800000000000 00:05 1603               /dev/kfd
    1531bc642000-1531bc643000 rw-s 22a6c00000000000 00:05 1603               /dev/kfd
    1531bc644000-1531bc645000 rw-s 3b7f400000000000 00:05 1603               /dev/kfd
    1531bc645000-1531bc646000 rw-s 00000000 00:31 750293                     /dev/shm/hsakmt_shared_mem
    1531bc646000-1531bc647000 rw-s 3d31c00000000000 00:05 1603               /dev/kfd
    1531bc647000-1531bc648000 rw-s 00000000 00:31 750292                     /dev/shm/N2DkIP (deleted)
    1531bc648000-1531bc649000 rw-s 00000000 00:31 750287                     /dev/shm/rocm_smi_card7
    1531bc649000-1531bc64a000 rw-s 00000000 00:31 750286                     /dev/shm/rocm_smi_card6
    1531bc64a000-1531bc64b000 rw-s 00000000 00:31 750285                     /dev/shm/rocm_smi_card5
    1531bc64b000-1531bc64c000 rw-s 00000000 00:31 750284                     /dev/shm/rocm_smi_card4
    1531bc64c000-1531bc64d000 rw-s 00000000 00:31 750283                     /dev/shm/rocm_smi_card3
    1531bc64d000-1531bc64e000 rw-s 00000000 00:31 750282                     /dev/shm/rocm_smi_card2
    1531bc64e000-1531bc64f000 rw-s 00000000 00:31 750281                     /dev/shm/rocm_smi_card1
    1531bc64f000-1531bc650000 rw-s 00000000 00:31 750280                     /dev/shm/rocm_smi_card0
    1531bc650000-1531bc652000 rw-p 00000000 00:00 0
    1531bc652000-1531bc654000 r-xp 00000000 a67:eeb30 144119826271531269     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace-user.so.1.10.2
    1531bc654000-1531bc655000 rwxp 00001000 a67:eeb30 144119826271531269     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace-user.so.1.10.2
    1531bc655000-1531bc656000 rw-p 00002000 a67:eeb30 144119826271531269     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace-user.so.1.10.2
    1531bc656000-1531bc678000 r-xp 00000000 a67:eeb30 144119826271531161     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace-dl.so.1.10.2
    1531bc678000-1531bc679000 rwxp 00021000 a67:eeb30 144119826271531161     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace-dl.so.1.10.2
    1531bc679000-1531bc67a000 rw-p 00022000 a67:eeb30 144119826271531161     /pfs/lustrep1/users/homolaja/Apps/omnitrace/lib/libomnitrace-dl.so.1.10.2
    1531bc67a000-1531bc67c000 rw-p 00000000 00:00 0
    1531bc67c000-1531bc67d000 rwxp 0002a000 00:2d 17                         /lib64/ld-2.31.so
    1531bc67d000-1531bc67f000 rw-p 0002b000 00:2d 17                         /lib64/ld-2.31.so
    7fff58cdf000-7fff58d18000 rwxp 00000000 00:00 0                          [stack]
    7fff58d18000-7fff58d21000 rw-p 00000000 00:00 0
    7fff58d9d000-7fff58da1000 r--p 00000000 00:00 0                          [vvar]
    7fff58da1000-7fff58da3000 r-xp 00000000 00:00 0                          [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Backtrace (demangled):
[PID=11382][TID=0][0/7] __restore_rt
[PID=11382][TID=0][1/7] hsa_amd_image_get_info_max_dim +0x5e9
[PID=11382][TID=0][2/7] hsa_amd_image_get_info_max_dim +0x4ba
[PID=11382][TID=0][7/7] main +0xfa
[PID=11382][TID=0][8/7] omnitrace_main +0x3bd
[PID=11382][TID=0][9/7] __libc_start_main +0xef
[PID=11382][TID=0][10/7] _start +0x2a

Backtrace (lineinfo):
[PID=11382][TID=0][0/9]
    [/lib64/libpthread.so.0:?] __restore_rt
[PID=11382][TID=0][1/9]
    [/opt/rocm-5.2.3/lib/libhsa-runtime64.so.1.5.50203:?] hsa_amd_image_get_info_max_dim
[PID=11382][TID=0][2/9]
    [/opt/rocm/lib/libhsa-runtime64.so.1:?] no unwind info found
[PID=11382][TID=0][3/9]
    [/opt/rocm/lib/libamdhip64.so.5:?] no unwind info found
[PID=11382][TID=0][4/9]
    [/opt/rocm/lib/libamdhip64.so.5:?] no unwind info found
[PID=11382][TID=0][5/9]
    [/opt/rocm-5.2.3/lib/libamdhip64.so.5.2.50203:?] no unwind info found
[PID=11382][TID=0][6/9]
    [/pfs/lustrep1/users/homolaja/tests/hip_host_function/source.hip.cpp:10] main
[PID=11382][TID=0][7/9]
    [/home/omnitrace/source/lib/omnitrace-dl/dl.cpp:1443] omnitrace_main
[PID=11382][TID=0][8/9]
    [/lib64/libc-2.31.so:?] __libc_start_main

[omnitrace][11382] Finalizing after signal 2 ::  Signal:     SIGINT (signal number:   2)
        interrupt program

[omnitrace][11382][0][omnitrace_finalize] finalizing...
[omnitrace][11382][0][omnitrace_finalize]
[omnitrace][11382][0][omnitrace_finalize] omnitrace/process/11382 : 13.174841 sec wall_clock,  252.280 MB peak_rss,  251.036 MB page_rss, 25.120000 sec cpu_clock,  190.7 % cpu_util [laps: 1]
[omnitrace][11382][0][omnitrace_finalize] omnitrace/process/11382/thread/0 : 13.173039 sec wall_clock, 12.751506 sec thread_cpu_clock,   96.8 % thread_cpu_util,  252.280 MB peak_rss [laps: 1]
[omnitrace][11382][0][omnitrace_finalize] omnitrace/process/11382/thread/1 : 0.002652 sec wall_clock, 0.000530 sec thread_cpu_clock,   19.9 % thread_cpu_util,    0.000 MB peak_rss [laps: 1]
[omnitrace][11382][0][omnitrace_finalize]
[omnitrace][11382][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][11382][perfetto]> Outputting '/users/homolaja/tests/hip_host_function/omnitrace-program.x-output/2023-09-22_16.27/perfetto-trace-11382.proto' (3331.21 KB / 3.33 MB / 0.00 GB)... Done
[omnitrace][11382][metadata]> Outputting 'omnitrace-program.x-output/2023-09-22_16.27/metadata-11382.json' and 'omnitrace-program.x-output/2023-09-22_16.27/functions-11382.json'
[omnitrace][11382][0][omnitrace_finalize] Finalized: 0.276525 sec wall_clock,  318.508 MB peak_rss,   12.448 MB page_rss, 0.550000 sec cpu_clock,  198.9 % cpu_util

@jakub-homola
Copy link
Author

So I tried running it with rocprof, and it also hangs, it seems that in the first hipDeviceSynchronize():

$ rocprof --sys-trace ./program.x
RPL: on '230925_095831' from '/opt/rocm-5.2.3' in '/users/homolaja/tests/hip_host_function'
RPL: profiling '"./program.x"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_230925_095831_123756'
RPL: result dir '/tmp/rpl_data_230925_095831_123756/input_results_230925_095831'
AAA
ROCTracer (pid=123776):
    HSA-trace()
    HIP-trace()

If I comment out the hipStreamAddCallback, it does not hang.

Running the program with omnitrace-sample --exclude roctracer -- ./program.x, it works fine without any issue (but I don't get the GPU tracing data).

@jrmadsen
Copy link
Collaborator

Yes, it seems like a fundamental issue in roctracer (i.e. outside of the scope of omnitrace). I’ll pass on the bug report and see if it can get patched.

@jamesxu2
Copy link
Contributor

jamesxu2 commented Oct 25, 2024

Hi @jakub-homola ,

Thank you for providing a minimal reproducer. I just tried rerunning this test using a recent ROCm Release (6.2.3). I'm not able to reproduce the failure using Rocprofiler V1 or V2, nor with omnitrace-sample.

Please try this again with a recent ROCm release and let us know if you're still seeing this on your side.

Example output from Omnitrace Sample:

 omnitrace-sample -- ./a.out

HSA_TOOLS_LIB=/opt/rocm-6.2.3/lib/libomnitrace-dl.so.1.11.2
HSA_TOOLS_REPORT_LOAD_FAILURE=1
LD_PRELOAD=/opt/rocm-6.2.3/lib/libomnitrace-dl.so.1.11.2
OMNITRACE_USE_SAMPLING=true
OMP_TOOL_LIBRARIES=/opt/rocm-6.2.3/lib/libomnitrace-dl.so.1.11.2
ROCP_HSA_INTERCEPT=1
ROCP_TOOL_LIB=/opt/rocm-6.2.3/lib/libomnitrace.so.1.11.2

[omnitrace][dl][10983] omnitrace_main
[omnitrace][10983][omnitrace_init_tooling] Instrumentation mode: Sampling


      ______   .___  ___. .__   __.  __  .___________..______          ___       ______  _______
     /  __  \  |   \/   | |  \ |  | |  | |           ||   _  \        /   \     /      ||   ____|
    |  |  |  | |  \  /  | |   \|  | |  | `---|  |----`|  |_)  |      /  ^  \   |  ,----'|  |__
    |  |  |  | |  |\/|  | |  . `  | |  |     |  |     |      /      /  /_\  \  |  |     |   __|
    |  `--'  | |  |  |  | |  |\   | |  |     |  |     |  |\  \----./  _____  \ |  `----.|  |____
     \______/  |__|  |__| |__| \__| |__|     |__|     | _| `._____/__/     \__\ \______||_______|

    omnitrace v1.11.2 (rev: f35895a07b6571ed1a228c228cfdc5465b370c7d, x86_64-linux-gnu, compiler: GNU v11.4.0, rocm: v6.2.x)
[omnitrace][10983] /proc/sys/kernel/perf_event_paranoid has a value of 4. Disabling PAPI (requires a value <= 2)...
[omnitrace][10983] In order to enable PAPI support, run 'echo N | sudo tee /proc/sys/kernel/perf_event_paranoid' where N is <= 2
[687.259]       perfetto.cc:58649 Configured tracing session 1, #sources:1, duration:0 ms, #buffers:1, total buffer size:1024000 KB, total sessions:1, uid:0 session name: ""
AAA
BBB
CCC
I am dummy kernel 1
DDD
EEE
I am host function
FFF
GGG
I am dummy kernel 2
HHH

[omnitrace][10983][0][omnitrace_finalize] finalizing...
[omnitrace][10983][0][omnitrace_finalize]
[omnitrace][10983][0][omnitrace_finalize] omnitrace/process/10983 : 0.381777 sec wall_clock,  234.712 MB peak_rss,  240.345 MB page_rss, 0.510000 sec cpu_clock,  133.6 % cpu_util [laps: 1]
[omnitrace][10983][0][omnitrace_finalize] omnitrace/process/10983/thread/0 : 0.379796 sec wall_clock, 0.257318 sec thread_cpu_clock,   67.8 % thread_cpu_util,  234.148 MB peak_rss [laps: 1]
[omnitrace][10983][0][omnitrace_finalize] omnitrace/process/10983/thread/2 : 0.000075 sec wall_clock, 0.000075 sec thread_cpu_clock,  100.0 % thread_cpu_util,    0.000 MB peak_rss [laps: 1]
[omnitrace][10983][0][omnitrace_finalize] omnitrace/process/10983/thread/4 : 0.000104 sec wall_clock, 0.000103 sec thread_cpu_clock,  100.0 % thread_cpu_util,    0.000 MB peak_rss [laps: 1]
[omnitrace][10983][0][omnitrace_finalize]
[omnitrace][10983][0][omnitrace_finalize] Finalizing perfetto...
[omnitrace][10983][perfetto]> Outputting '/root/omnitrace-a.out-output/2024-10-25_15.22/perfetto-trace-10983.proto' (57.67 KB / 0.06 MB / 0.00 GB)... Done
[omnitrace][10983][metadata]> Outputting 'omnitrace-a.out-output/2024-10-25_15.22/metadata-10983.json' and 'omnitrace-a.out-output/2024-10-25_15.22/functions-10983.json'
[omnitrace][10983][0][omnitrace_finalize] Finalized: 1.040595 sec wall_clock,    1.212 MB peak_rss,    1.241 MB page_rss, 0.060000 sec cpu_clock,    5.8 % cpu_util

@jamesxu2
Copy link
Contributor

jamesxu2 commented Nov 8, 2024

Closing due to inactivity. Feel free to reopen this ticket if you still need help, or file a new one @jakub-homola.

@jamesxu2 jamesxu2 closed this as completed Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants