-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvikio still segfaults on program termination #497
Comments
Could you please slim the environment further like so and retry? # filename: kvikio2410_cuda122.yaml
name: kvikio2410_cuda122
channels:
- rapidsai
- conda-forge
dependencies:
- cuda-version=12.2
- python=3.11
- kvikio=24.10 Asking as there are mismatching CUDA versions in the reproducing environment. Plus some extra bits that appear unused in the example. So would like to simplify further to avoid other potential issues |
Unfortunately it still segfaults. The cuda version mismatch seems resolved. |
Can you show a backtrace from the segfault. e.g. with gdb:
|
|
OK, thanks. Something in cufile is running below main. We'll try and reproduce locally and perhaps build with a debug build so we can get a bit more information. |
Thanks a lot for looking into this. If there is something I can do to help you reproduce the error please let me know. |
@EricKern, what if you run with KVIKIO_COMPAT_MODE=ON ? |
JFYI, to get a debug build of |
With compat mode on there is no segmentation fault. If I set it to "off" then it appears again.
Do you think that this might produce a better backtrace from the crash or is there anything else that I could do with a debug build of python? |
Lawrence mentioned doing a debug build. So wanted to share that resource If the segfault happens somewhere in KvikIO, it may help. If it happens in cuFile, we likely don't learn much |
If Mads can't repro next week, I guess I'll try and figure out how to set up cufile/gds on my workstation and do some spelunking |
I will take a look tomorrow |
I am not able to reproduce, the conda environment works fine for me :/ |
cuDF is seeing the same issue (rapidsai/cudf#17121) arising from cuFile (here cuFile API is accessed directly from within cuDF not through KvikIO). Btw, when cuDF did use KvikIO to perform GDS I/O, we observed that the segfault is manifested when Also, adding |
@madsbk May I ask if you have used a MIG slice or a full GPU in your tests? I'm currently not able to use a full A100 but as soon it's available again I want to try and reproduce the segfault on a full A100. Before using kvikio I have successfully used the cufile C++ API without a problem. Even with a MIG. |
I am running on a full GPU. #514 implements Python bindings to |
I continued playing around with the environment to ensure the issue was not related to my setup. Do you still think that this is related to |
Originally by @EricKern in #514 (comment):
@tell-rebanta do you know of an cuFIle bug related to setting |
@madsbk I am not aware of any cufile bug related to > 0 cufile_stats value. Wrote a small program which does direct dlopen of libcufile (not through kvikio) without explicit opening/closing the driver along with a non-zero positive cufile_stats value, but could not reproduce the issue with the latest bits of libcufile. Which libcufile version you were using ? |
@tell-rebanta
I can install gds-tools and set LD_LIBRARY_PATH to libcufile of the conda installation ( The segfault only happens when libcufile is loaded by kvikio in python when the python program terminates. Of course the possibility of a user error on myside still exists. I remember that the segfault also happend a few weeks ago when I was trying out cuCIM. This was a hint to me that it might be caused by my environment. As far as I know cuCIM have their own gds wrapper and don't use kvikio under the hood. At that time I had no idea what the root cause could be and switched to kvikio. But since then with kvikio I could reproduce the segfault in a kubernetes pod, on a VM inside and outside a docker container and on my personal laptop. So I assume that this error is not related to the machines I'm running on. From the software perspective I assume the containerized environment should also rule out any software environment issues. My docker image is basically: FROM condaforge/miniforge3:24.3.0-0 as base
RUN apt-get update && \
apt-get -y install ibverbs-providers libibverbs-dev librdmacm-dev \
&& apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/*
COPY kvikio2410_cuda122.yaml /tmp/
RUN mamba env create -f /tmp/kvikio2410_cuda122.yaml && mamba clean -afy
RUN apt-get update && \
apt-get -y install libnuma-dev \
&& apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/* Then I run the container with this docker wrapper or even slightly more priviliges when using a wekaFS in kubernetes (hostnetwork=true). I don't know what else I could do wrong or you do differently |
@madsbk How do we continue with this? Have you been able to reproduce the segfault with cufile-stats > 0? |
Sorry, I am still not able to reproduce :/ Can you try setting |
Thanks for the suggestions I'll try it with these options again |
Hi everyone,
I'm getting a segfault when my python script terminates. This only happens when kvikio is used.
Reproducer
mamba env create -f img2tensor_kvikio.yaml && mamba clean -afy
bug.py
I'm running in a kubernetes environment. We use the open kernel driver 535.183.01
I assumed this #462 has fixed the issue but it seems there is more to it.
You can find the concretized environment here:
exported_img2tensor_kvikio.txt
It uses kvikio 24.10 which should include the previously mentioned PR.
The text was updated successfully, but these errors were encountered: