From ae68635af9dfcae359f621dd3e1df3b3c3d97042 Mon Sep 17 00:00:00 2001 From: Nicolas Patry Date: Wed, 2 Aug 2023 18:16:50 +0200 Subject: [PATCH] Add small error management. --- candle-book/src/error_manage.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/candle-book/src/error_manage.md b/candle-book/src/error_manage.md index af7593d68d..c1a16bd9da 100644 --- a/candle-book/src/error_manage.md +++ b/candle-book/src/error_manage.md @@ -36,4 +36,16 @@ Another thing to note, is that since Rust is compiled it is not necessarily as e especially in release builds. We're using [`anyhow`](https://docs.rs/anyhow/latest/anyhow/) for that. The library is still young, please [report](https://github.com/LaurentMazare/candle/issues) any issues detecting where an error is coming from. +## Cuda error management + +When running a model on Cuda, you might get a stacktrace not really representing the error. +The reason is that CUDA is async by nature, and therefore the error might be caught while you were sending totally different kernels. + +One way to avoid this is to use `CUDA_LAUNCH_BLOCKING=1` as an environment variable. This will force every kernel to be launched sequentially. +You might still however see the error happening on other kernels as the faulty kernel might exit without an error but spoiling some pointer for which the error will happen when dropping the `CudaSlice` only. + + +If this occurs, you can use [`compute-sanitizer`](https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html) +This tool is like `valgrind` but for cuda. It will help locate the errors in the kernels. +