Handling fatal errors #603

Vipitis · 2024-09-26T18:46:53Z

This is a WIP, it's purpose is to fail badly... so don't merge.

I am compiling all failure cases that cause the python process to crash or silently exit, to think about how to handle them as python exceptions. Debuggers will also just exit, so you have to step through the code to figure out which parts breaks it quite a lot.
Where possible I link the appropriate upstream issues and minimal reproducers.
Some cases are dependant on their backend, I usually default to Vulkan - but test DX12 for comparison too.

Motivation

My main use case is running a large amount of shaders to sort which work and which don't. This is part of my thesis work on evaluating (generated) shadercode. So there can be all kinds of messed up code. I am not interested on how to write working code. This is about being able to handle errors.

Previous attempts

Naga validation (and by extension get_compilation info) will not catch these problems, and it's equivalent to getting a GPUValidationError in `create_shader_module``
Multiprocess/threadding doesn't currently work in wgpu-py but changes for async support might make this possible. Some problems are not being able to pickle the cffi stuff and not really detecting spawned child processes and if they timeout/crash.
writing to a temp python file and running that with subprocess - this is actually working so far... horrible solution and really slow. On my machine the overhead to request adapter etc is easily 3+ seconds. When trying to test 30k shaders it can take a whole day. reference to awful code

Plan

collect cases

I started a test_wgpu_errors_fatal.py file as part of the test suite (reused other test code mostly), now running this will also crash pytests - so maybe we skip it by default. It would be great to add some more cases and also note down how they panic and where in our code (usually calling the c function). Please contribute any kind of crashes you encounter, even if you don't have a minimal reproducer... I spent a few nights hunting bugs really far down so might see something.

WGSL and GLSL doesn't really matter, since it's always translated to WGSL, so I just went with that.

find a solution

Problems will eventually be fixed upstream and make it to wgpu-py... but that can take months and some issues haven't been
fixed upstream for nearly a year, but that will be the best solution.

This is the case where I am sorta lost myself. Maybe the changes to the device lost logging could lead to a raised issue, as tried in #547, otherwise changes to how SafeLibCalls to not drop the Python instance when the rust code reaches panic!.
Usually the last row that gets executed is this

wgpu-py/wgpu/backends/wgpu_native/_helpers.py

Line 305 in cf59eb0

result = ob(*args)

Korijn · 2024-09-26T19:36:08Z

Not sure if it helps to solve but it might help investigate options... faulthandler.enable() is my go to stdlib util to debug hard crashes. It will tell you exactly where and when the crash happened.

Aside of that, it's not technically possible to catch a segfault in the current process (threads or not). You can only detect if a subprocess has (likely) segfaulted by examining it's exit code.

The "proper" fix would be to modify the upstream rust libraries, such that they handle their internal exceptions without panicking or segfaulting.

fyellin · 2024-09-27T21:07:24Z

Are you looking for bad shader code? Or anything that makes wgpu crash and burn? And if we find something, how do we send it to you. I know of some issues with RenderBundles.

Vipitis · 2024-09-27T23:09:32Z

exactly, I am concerned with shadercode that is buggy enough it will crash an otherwise working program. There will be plenty of ways to get wgpu to reach a panic, but assume you are just accepting user/generated shadercode, nothing else. You want to avoid crashing your scripts, no matter how bad or even malicious the shadercode is.

There is two examples in the diff right now with one or two more coming. For example, right now I am trying to root cause and minimize the problem with this shadertoy and another one I am trying to figure out is here which just exits somewhere in create_render_pipeline with Vulkan but works in DX12

almarklein · 2024-09-30T10:30:27Z

result = ob(*args)

This is where we make a call into the library, i.e. Rust land. Indeed if something bad happens there (like a Panic) there's no way to catch that.

Not sure if this helps, but maybe the test script can run the failing examples in a subprocess, so that pytest itself is still alive and can e.g. check the return code, and Rust tracebacks.

first two fatal cases

d1ab41c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling fatal errors #603

Handling fatal errors #603

Vipitis commented Sep 26, 2024

Korijn commented Sep 26, 2024 •

edited

Loading

fyellin commented Sep 27, 2024

Vipitis commented Sep 27, 2024

almarklein commented Sep 30, 2024

Handling fatal errors #603

Are you sure you want to change the base?

Handling fatal errors #603

Conversation

Vipitis commented Sep 26, 2024

Motivation

Previous attempts

Plan

collect cases

find a solution

Korijn commented Sep 26, 2024 • edited Loading

fyellin commented Sep 27, 2024

Vipitis commented Sep 27, 2024

almarklein commented Sep 30, 2024

Korijn commented Sep 26, 2024 •

edited

Loading