kvikio still segfaults on program termination #497

EricKern · 2024-10-15T07:45:01Z

Hi everyone,

I'm getting a segfault when my python script terminates. This only happens when kvikio is used.

Reproducer

mamba env create -f img2tensor_kvikio.yaml && mamba clean -afy

// img2tensor_kvikio.yaml
name: img2tensor
channels:
  - pytorch
  - nvidia
  - rapidsai
  - conda-forge
dependencies:
  - notebook
  - tifffile
  - python=3.11
  - pytorch
  - pytorch-cuda=12.4
  - kvikio

bug.py

import kvikio

file_name = 'file0.txt'

fd = kvikio.CuFile(file_name, "w")
fd.close()

I'm running in a kubernetes environment. We use the open kernel driver 535.183.01

I assumed this #462 has fixed the issue but it seems there is more to it.

You can find the concretized environment here:
exported_img2tensor_kvikio.txt

It uses kvikio 24.10 which should include the previously mentioned PR.

The text was updated successfully, but these errors were encountered:

jakirkham · 2024-10-15T09:47:08Z

Could you please slim the environment further like so and retry?

# filename: kvikio2410_cuda122.yaml
name: kvikio2410_cuda122
channels:
  - rapidsai
  - conda-forge
dependencies:
  - cuda-version=12.2
  - python=3.11
  - kvikio=24.10

Asking as there are mismatching CUDA versions in the reproducing environment. Plus some extra bits that appear unused in the example. So would like to simplify further to avoid other potential issues

EricKern · 2024-10-15T11:38:05Z

Unfortunately it still segfaults.
I again attached the concretized dependency list kvikio2410_cuda122.txt.

The cuda version mismatch seems resolved.
Also the cufile.log seems fine to me.
I'm using a MIG slice from an A100 and writing to a weka fs works fine. It only segfaults on program termination

wence- · 2024-10-15T11:43:31Z

Can you show a backtrace from the segfault. e.g. with gdb:

gdb --args python bug.py
(gdb) run
(gdb) backtrace full

EricKern · 2024-10-15T12:09:06Z

(gdb) run
Starting program: /opt/conda/envs/kvikio2410_cuda122/bin/python bug.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff47eb700 (LWP 2675)]
[New Thread 0x7ffff3fea700 (LWP 2676)]
[New Thread 0x7fffeb7e9700 (LWP 2677)]
[New Thread 0x7fffdaae0700 (LWP 2678)]
[New Thread 0x7fffcdfff700 (LWP 2679)]
[New Thread 0x7fffcd21d700 (LWP 2691)]
[New Thread 0x7fffcca1c700 (LWP 2692)]
[New Thread 0x7fffc7fff700 (LWP 2693)]
[New Thread 0x7fffc77fe700 (LWP 2694)]
[New Thread 0x7fffc6ffd700 (LWP 2695)]
[New Thread 0x7fffc67fc700 (LWP 2696)]
[New Thread 0x7fffc5ffb700 (LWP 2697)]
[New Thread 0x7fffc57fa700 (LWP 2698)]
[Thread 0x7fffdaae0700 (LWP 2678) exited]
[Thread 0x7fffcd21d700 (LWP 2691) exited]
[Thread 0x7fffc57fa700 (LWP 2698) exited]
[Thread 0x7fffc5ffb700 (LWP 2697) exited]
[Thread 0x7fffc6ffd700 (LWP 2695) exited]
[Thread 0x7fffc77fe700 (LWP 2694) exited]
[Thread 0x7fffc7fff700 (LWP 2693) exited]
[Thread 0x7fffcca1c700 (LWP 2692) exited]
[Thread 0x7fffeb7e9700 (LWP 2677) exited]
[Thread 0x7ffff3fea700 (LWP 2676) exited]
[Thread 0x7ffff47eb700 (LWP 2675) exited]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
std::basic_streambuf<char, std::char_traits<char> >::xsputn (this=0x7fffffffd7a8, __s=0x5555563aa252 "", __n=93824998875808)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1724798733686/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc:90
90      /home/conda/feedstock_root/build_artifacts/gcc_compilers_1724798733686/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc: No such file or directory.
(gdb) backtrace full
#0  std::basic_streambuf<char, std::char_traits<char> >::xsputn (this=0x7fffffffd7a8, __s=0x5555563aa252 "", __n=93824998875808)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1724798733686/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/streambuf.tcc:90
        __remaining = <optimized out>
        __len = <optimized out>
        __buf_len = 8388607
        __ret = <optimized out>
#1  0x00007ffff78c169d in std::__ostream_write<char, std::char_traits<char> > (__out=..., __s=<optimized out>, __n=93824998875808)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1724798733686/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/basic_ios.h:325
        __put = <optimized out>
#2  0x00007ffff78c1774 in std::__ostream_insert<char, std::char_traits<char> > (__out=..., __s=0x555555baa298 "Read", __n=93824998875808)
    at /home/conda/feedstock_root/build_artifacts/gcc_compilers_1724798733686/work/build/x86_64-conda-linux-gnu/libstdc++-v3/include/bits/basic_ios.h:184
        __w = <error reading variable __w (dwarf2_find_location_expression: Corrupted DWARF expression.)>
        __cerb = {_M_ok = true, _M_os = @0x7fffffffd7a0}
#3  0x00007fffda13044f in ?? () from /opt/conda/envs/kvikio2410_cuda122/lib/python3.11/site-packages/kvikio/_lib/../../../../libcufile.so.0
No symbol table info available.
#4  0x00007fffda13206b in ?? () from /opt/conda/envs/kvikio2410_cuda122/lib/python3.11/site-packages/kvikio/_lib/../../../../libcufile.so.0
No symbol table info available.
#5  0x00007fffda080c82 in ?? () from /opt/conda/envs/kvikio2410_cuda122/lib/python3.11/site-packages/kvikio/_lib/../../../../libcufile.so.0
No symbol table info available.
#6  0x00007ffff7fe0f6b in _dl_fini () at dl-fini.c:138
        array = 0x7fffda2bc1d0
        i = <optimized out>
        l = 0x555555efa720
        maps = 0x7fffffffdb80
        i = <optimized out>
        l = <optimized out>
        nmaps = <optimized out>
        nloaded = <optimized out>
        ns = 0
        do_audit = <optimized out>
        __PRETTY_FUNCTION__ = "_dl_fini"
#7  0x00007ffff7c9a8a7 in __run_exit_handlers (status=0, listp=0x7ffff7e40718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
        atfct = <optimized out>
        onfct = <optimized out>
        cxafct = <optimized out>
        f = <optimized out>
        new_exitfn_called = 262
        cur = 0x7ffff7e41ca0 <initial>
#8  0x00007ffff7c9aa60 in __GI_exit (status=<optimized out>) at exit.c:139
No locals.
#9  0x00007ffff7c7808a in __libc_start_main (main=0x5555557dea20 <main>, argc=2, argv=0x7fffffffdec8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdeb8) at ../csu/libc-start.c:342
        result = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {93824995523264, -3934155394888934001, 93824994896209, 140737488346816, 0, 0, 3934155393885101455, 3934172503229554063}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x2, 
              0x7fffffffdec8}, data = {prev = 0x0, cleanup = 0x0, canceltype = 2}}}
        not_first_call = <optimized out>
#10 0x00005555557de97a in _start () at /usr/local/src/conda/python-3.11.10/Parser/parser.c:33931
No symbol table info available.

wence- · 2024-10-15T17:57:23Z

OK, thanks. Something in cufile is running below main. We'll try and reproduce locally and perhaps build with a debug build so we can get a bit more information.

EricKern · 2024-10-16T06:08:09Z

Thanks a lot for looking into this. If there is something I can do to help you reproduce the error please let me know.

madsbk · 2024-10-16T06:17:03Z

@EricKern, what if you run with KVIKIO_COMPAT_MODE=ON ?

jakirkham · 2024-10-16T06:20:14Z

JFYI, to get a debug build of python add the following to channels above conda-forge: conda-forge/label/python_debug

EricKern · 2024-10-16T06:29:31Z

@EricKern, what if you run with KVIKIO_COMPAT_MODE=ON ?

With compat mode on there is no segmentation fault. If I set it to "off" then it appears again.

JFYI, to get a debug build of python add the following to channels above conda-forge: conda-forge/label/python_debug

Do you think that this might produce a better backtrace from the crash or is there anything else that I could do with a debug build of python?

jakirkham · 2024-10-16T06:34:45Z

Lawrence mentioned doing a debug build. So wanted to share that resource

If the segfault happens somewhere in KvikIO, it may help. If it happens in cuFile, we likely don't learn much

wence- · 2024-10-17T20:15:46Z

If Mads can't repro next week, I guess I'll try and figure out how to set up cufile/gds on my workstation and do some spelunking

madsbk · 2024-10-21T15:19:17Z

If Mads can't repro next week, I guess I'll try and figure out how to set up cufile/gds on my workstation and do some spelunking

I will take a look tomorrow

madsbk · 2024-10-22T13:57:07Z

I am not able to reproduce, the conda environment works fine for me :/
I have asked the cuFile team for input.

kingcrimsontianyu · 2024-10-22T20:11:06Z

cuDF is seeing the same issue (rapidsai/cudf#17121) arising from cuFile (here cuFile API is accessed directly from within cuDF not through KvikIO).

Btw, when cuDF did use KvikIO to perform GDS I/O, we observed that the segfault is manifested when KVIKIO_NTHREADS is set to 8, not the default 1. But I think this is a red herring. At the time of crash, backtrace points to some CUDA calls made by cuFile after the main returns. This should be cuFile doing implicit driver closing.

Also, adding cuFileDriverClose() before the main returns seems to prevent the segfault in cuDF's benchmark.

EricKern · 2024-10-24T12:38:39Z

@madsbk May I ask if you have used a MIG slice or a full GPU in your tests? I'm currently not able to use a full A100 but as soon it's available again I want to try and reproduce the segfault on a full A100. Before using kvikio I have successfully used the cufile C++ API without a problem. Even with a MIG.

madsbk · 2024-10-24T13:22:15Z

I am running on a full GPU.

#514 implements Python bindings to cufileDriverOpen() and cufileDriverClose(). The hope is that we can prevent this issue in Python by calling cufileDriverClose() and module exit.

EricKern · 2024-10-26T19:46:02Z

I continued playing around with the environment to ensure the issue was not related to my setup.
Just a few minutes ago I found out that the segmentation fault on termination does not occur when I set "cufile_stats": 0 in the cufile.json. Any value in cufile_stats above 0 causes the segfault. But as mentioned during execution everything works fine. The READ-WRITE SIZE histogram is written to cufile.log and all. Only on termination, the segfault happens. I could observe this both inside and outside of a docker container.

Do you still think that this is related to cufileDriverClose()?

madsbk · 2024-10-28T06:53:28Z

Originally by @EricKern in #514 (comment):

I've built and reran my small segfault reproducer script without explicitly opening and closing the driver. This still causes the segfault when I set profile.cufile_stats in cufile.json to anything above 0. Also when I explicitly open and close the driver it still happens.

If profile.cufile_stats=0 everything works fine.

I guess my segfault (#497) is unrelated to the driver initialization and destruction.

I have tested this on my local machine where I currently don't have a GDS-supported file system. So no actual writing happened. Only initialization and then cufile's switch to its own compatibility mode. But even then, the segfault was reproducible on another machine.

@tell-rebanta do you know of an cuFIle bug related to setting profile.cufile_stats to something greater than zero?

tell-rebanta · 2024-10-28T17:41:18Z

@madsbk I am not aware of any cufile bug related to > 0 cufile_stats value. Wrote a small program which does direct dlopen of libcufile (not through kvikio) without explicit opening/closing the driver along with a non-zero positive cufile_stats value, but could not reproduce the issue with the latest bits of libcufile. Which libcufile version you were using ?

EricKern · 2024-10-29T15:50:50Z

@tell-rebanta
according to cufile.log debug output:

GDS release version: 1.7.2.10
nvidia_fs version:  2.17
libcufile version: 2.12
Platform: x86_64

I can install gds-tools and set LD_LIBRARY_PATH to libcufile of the conda installation (/opt/conda/envs/kvikio2410_cuda122/lib/) and then run gdsio with it. Then there is no problem. No segfault occurs independent of the cufile_stats level.

The segfault only happens when libcufile is loaded by kvikio in python when the python program terminates.

Of course the possibility of a user error on myside still exists. I remember that the segfault also happend a few weeks ago when I was trying out cuCIM. This was a hint to me that it might be caused by my environment. As far as I know cuCIM have their own gds wrapper and don't use kvikio under the hood. At that time I had no idea what the root cause could be and switched to kvikio. But since then with kvikio I could reproduce the segfault in a kubernetes pod, on a VM inside and outside a docker container and on my personal laptop. So I assume that this error is not related to the machines I'm running on.

From the software perspective I assume the containerized environment should also rule out any software environment issues.

My docker image is basically:

FROM condaforge/miniforge3:24.3.0-0 as base

RUN apt-get update && \
    apt-get -y install ibverbs-providers libibverbs-dev librdmacm-dev \
    && apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/*

COPY kvikio2410_cuda122.yaml /tmp/
RUN mamba env create -f /tmp/kvikio2410_cuda122.yaml && mamba clean -afy

RUN apt-get update && \
    apt-get -y install libnuma-dev \
    && apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/*

Then I run the container with this docker wrapper or even slightly more priviliges when using a wekaFS in kubernetes (hostnetwork=true).

I don't know what else I could do wrong or you do differently

EricKern · 2024-11-07T07:12:24Z

@madsbk How do we continue with this? Have you been able to reproduce the segfault with cufile-stats > 0?

madsbk · 2024-11-07T08:22:55Z

Sorry, I am still not able to reproduce :/

Can you try setting allow_compat_mode=false in the config cufile.json? This will force cuFile to use GDS or fail.
Also try setting execution::parallel_io=false to rule out a threading issue.

EricKern · 2024-11-08T08:00:23Z

Thanks for the suggestions I'll try it with these options again

jakirkham mentioned this issue Oct 22, 2024

Conda package isn't build with GDS support! #378

Closed

madsbk mentioned this issue Oct 23, 2024

Python bindings for explicit initialize and shutdown #506

Open

madsbk mentioned this issue Oct 24, 2024

Python bindings to cuFileDriverOpen() and cuFileDriverClose() #514

Merged

kingcrimsontianyu mentioned this issue Oct 18, 2024

[BUG] cuFile driver closing causes segfault upon program termination rapidsai/cudf#17121

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvikio still segfaults on program termination #497

kvikio still segfaults on program termination #497

EricKern commented Oct 15, 2024

jakirkham commented Oct 15, 2024

EricKern commented Oct 15, 2024

wence- commented Oct 15, 2024

EricKern commented Oct 15, 2024

wence- commented Oct 15, 2024

EricKern commented Oct 16, 2024

madsbk commented Oct 16, 2024 •

edited

Loading

jakirkham commented Oct 16, 2024

EricKern commented Oct 16, 2024 •

edited

Loading

jakirkham commented Oct 16, 2024

wence- commented Oct 17, 2024 •

edited

Loading

madsbk commented Oct 21, 2024

madsbk commented Oct 22, 2024

kingcrimsontianyu commented Oct 22, 2024 •

edited

Loading

EricKern commented Oct 24, 2024

madsbk commented Oct 24, 2024

EricKern commented Oct 26, 2024

madsbk commented Oct 28, 2024 •

edited

Loading

tell-rebanta commented Oct 28, 2024

EricKern commented Oct 29, 2024

EricKern commented Nov 7, 2024

madsbk commented Nov 7, 2024

EricKern commented Nov 8, 2024

kvikio still segfaults on program termination #497

kvikio still segfaults on program termination #497

Comments

EricKern commented Oct 15, 2024

Reproducer

jakirkham commented Oct 15, 2024

EricKern commented Oct 15, 2024

wence- commented Oct 15, 2024

EricKern commented Oct 15, 2024

wence- commented Oct 15, 2024

EricKern commented Oct 16, 2024

madsbk commented Oct 16, 2024 • edited Loading

jakirkham commented Oct 16, 2024

EricKern commented Oct 16, 2024 • edited Loading

jakirkham commented Oct 16, 2024

wence- commented Oct 17, 2024 • edited Loading

madsbk commented Oct 21, 2024

madsbk commented Oct 22, 2024

kingcrimsontianyu commented Oct 22, 2024 • edited Loading

EricKern commented Oct 24, 2024

madsbk commented Oct 24, 2024

EricKern commented Oct 26, 2024

madsbk commented Oct 28, 2024 • edited Loading

tell-rebanta commented Oct 28, 2024

EricKern commented Oct 29, 2024

EricKern commented Nov 7, 2024

madsbk commented Nov 7, 2024

EricKern commented Nov 8, 2024

madsbk commented Oct 16, 2024 •

edited

Loading

EricKern commented Oct 16, 2024 •

edited

Loading

wence- commented Oct 17, 2024 •

edited

Loading

kingcrimsontianyu commented Oct 22, 2024 •

edited

Loading

madsbk commented Oct 28, 2024 •

edited

Loading