-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal in ParquetChunkedReader #11215
Comments
Thanks for filing the issue. Do you have any details on how we could reproduce this? I'm wondering if this failure is specific to the chunked reader. You could try re-running with spark.rapids.sql.reader.chunked=false to see if the failure is specific to the chunked reader. |
The RAPIDS Spark plugin will work on the driver versions that come with CUDA 11 and CUDA 12, so the CUDA version should not be an issue. |
This likely indicates that we're trying to run a kernel with 0 blocks. @nvdbaranec is there a way the input data could trigger the Parquet chunked reader to do this? |
There shouldn't be, but there's certainly the possibility of a bug. Based on this trace, it looks like it's in one of the setup kernels that runs before the actual decoding - computing various bits of bookkeeping related to chunking. |
Would it be possible to get a sample of the data here? |
@nvdbaranec I tried to describe the problem more clearly. We encountered this error when executing the following SQL command.
where we LEFT JOIN table genotypesdt_src by another table variantdt_src. genotypesdt_src
variantdt_src
Exception does not appear when only LEFT JOIN is executed. Request can be finished in half a minute. However, once we add GROUP BY sampleId then exception starts to appear. This is the logical plan of our query
Moreover, after we had this exception, we observe that session will not stop and spark cluster continuously occupies an executor. Note: No exception will happen if we group by runName or SYMBOL. |
Thanks for adding some details, @LIN-Yu-Ting! I suspect this is not related to the GROUPBY but instead is related to which columns are or are not being loaded from the tables. For reference, here's the working query:
Note that the number of columns being loaded during the scans between this and the failing query listed above is significantly different due to Spark's column pruning. In the original query, the
There's a chance it may not fail if enough sampleID values are produced from the join and hits the limit before a task gets around to scanning the problematic file, since limits can cause it to early out of the query processing. Increasing or removing the limit should cause it to fail as with the GROUPBY (assuming the driver has enough memory to hold the result, which may be infeasible for this scenario). So my guess is it's the absence of the few columns I listed above that triggers the error. It would be interesting to see if we can trigger this without a join at all, e.g. I would expect one of these queries to fail, depending on whether the problematic file is in the GenotypeDT or VariantsDT table. Limits or writes to temporary tables may need to be added if the driver cannot hold the results, with the caveat that limits may early-out the query before it fails.
or
We also could check which stage is failing in the query and map that back to which table is being scanned to know which table must be the problematic one. @nvdbaranec given that this provides clues that adding columns to the scan avoids the invalid device ordinal crash, I'm wondering if there's an issue where a large cluster of nulls in a column (or some other corner case) could cause one of the kernels being used to try to launch with zero blocks? |
@jlowe Thanks for your comments. It seems that there is no exception while executing two SQL queries you recommended. I think it is because Table genotypesdt_src contains more than 9 billion records. I then tried with following SQL and got
Then by executing again the same query, we reproduced the original exception.
|
OK, those new queries and exception messages help. That shows us that the problem can be reproduced by loading just the sampleId column from the genotypesdt_src and specifically that the issue can be reproduced when trying to load the concatenated data from these ranges, assuming this is the first occurrence of the error:
There are some other file range groups reporting an error, but it's hard to tell if these occurred before or after the one reported above. They seemed to occur afterwards, but there could be a race between threads for which one reports the error first. Once the error occurs, all reads and any other GPU operations will report the error since a GPU illegal address exception is unrecoverable for a CUDA process once it occurs.
@LIN-Yu-Ting it would be great if you have some time to investigate if anything "interesting" is going on with sampleId values in these files (e.g.: long sequences of NULL values, etc.) that might trigger a corner-case in the Parquet reader code. It would also be interesting if we could reproduce the error just by performing the query on one or more tables consisting of just the files in each group separately (e.g.: running Spark in local mode with only 1 executor with only 1 core and a large setting for spark.sql.files.maxPartitionBytes so that a single task tries to load all of the data in one task). |
@LIN-Yu-Ting If you want to share some sample file which can repro this issue or if you want to discuss further with more details with sensitive information, you can use spark-rapids-support [email protected] and we will keep this discussion internal with you. |
@viadea Are there any updates from your sides ? |
Describe the bug
We are using Spark Rapids 24.08-SNAPSHOT with delta table 2.3.0 and then we encounter the following exception while executing a LEFT JOIN sql query.
Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.
Expected behavior
A clear and concise description of what you expected to happen.
Environment details (please complete the following information)
Additional context
We observe through our nvidia-smi saying that we are using CUDA 12.2. Would it be possible caused by CUDA version mismatch ?
The text was updated successfully, but these errors were encountered: