-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Parquet writer encodes timestamp statistics incorrectly #14315
Comments
Note that the "good" files are actually a bit smaller than the "bad" files which is a bit surprising. |
I tracked down in Apache Spark why this isn't working properly, and it's because Spark does not recognize the logical type for purposes of applying a filter. It expects the logical timestamp type to be adjusted to UTC for filtering, but after #14264 the logical type for timestamps says they are not adjusted for UTC. That fails to find a match on the logical type when trying to see if the filter can be applied, so it ends up no-op'ing the filter. A no-op filter explains the behavior we're seeing. See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L147-L150 for Spark's definition of the expected timestamp logical types and it's use in |
From a discussion with @etseidl, it does not look like the logical type has changed. Instead, the issue likely happens now because we write the logical type (that was, and still is, incorrect*) more consistently. |
Yes, that makes sense to me. |
Closes #14315 Closes #14326 Parquet writer writes time and timestamp types with logical type with `isAdjustedToUTC` as `false`. However, timestamps in libcudf tables are implicitly in UTC and don't need to be adjusted. This PR changes the `isAdjustedToUTC` to true. Also added a writer option to write timestamps as local, as this is the expected behavior on the Python side. Also changed the way logical type is handled for UNKNOWN type columns in `merge_row_group_metadata` - the logical type is excluded from merged metadata because of issues with type inference. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - David Wendt (https://github.com/davidwendt) - GALI PREM SAGAR (https://github.com/galipremsagar) - Mike Wilson (https://github.com/hyperbolic2346) URL: #14322
Describe the bug
After #14264, Parquet files written with timestamp columns have the timestamp statistics for the column in the footer encoded incorrectly. Apache Spark is no longer able to properly perform predicate pushdown on timestamps for these files.
Steps/Code to reproduce bug
Attached are two files with equivalent contents, one was created with the RAPIDS Accelerator for Apache Spark with cudf before #14264 and the other is the same software except cudf has been rolled back just one commit to before #14264. Spark can properly evaluate timestamp predicates on the file created before #14264 and fails to do so afterwards, always loading all files.
You can see the failed predicate pushdown in Spark by running the following query in Spark and examining the Spark UI to see how many rows were produced by the file scan before it is filtered by the filter. Replace "/tmp/pqgood" with the appropriate path to the good or bad Parquet files.
In the "good" case, Spark should load less than the total number of rows because it can properly see via the footers in the separate files that only one file is worth loading. In the "bad" case, Spark loads all of the data because it cannot properly detect that only one file could possibly contain the value trying to be found by the query.
Expected behavior
Apache Spark should be able to correctly perform predicate pushdown on timestamp columns for Parquet files written by libcudf.
pqbad.zip
pqgood.zip
The text was updated successfully, but these errors were encountered: