-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask_cudf, when OOM or illegal access, hangs #6279
Comments
That's a really weird backtrace. From xgboost to cupy to numpy then to cupy. And from libstdc++ to libgcc then back to libstdc++ .. |
Yes, I noticed that too, didn't know whether or not it was odd. I would guess that is a cupy issue, just doing some super basic numpy things that don't use CPU data, but not sure. |
To avoid hanging, the best way is just fixing the segfault, proper Python exception is fine and should not lead to hang. Another way is let RABIT detect whether current allreduce is consistent with rest of the workers, which is quite difficult to implement at the moment. |
I don't think we can handle segfault with fault tolerance. If you have specific example of segfault please share, we will do our best to address them. |
See for setup details: #6232
Running dask_cudf in way very similar to rapidsai/ucx-py#655
illegal.txt.zip
fragment:
This happens when using dask_cudf and I'm fitting over and over again, all that works. But then one more fit in slightly different python context (same fork/thread though) leads to this. it doesn't always happen, and I'll try to make an MRE, but maybe something is clear from the back trace.
The text was updated successfully, but these errors were encountered: