Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parquet writer encodes timestamp statistics incorrectly #14315

Closed
jlowe opened this issue Oct 23, 2023 · 4 comments · Fixed by #14322
Closed

[BUG] Parquet writer encodes timestamp statistics incorrectly #14315

jlowe opened this issue Oct 23, 2023 · 4 comments · Fixed by #14322
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@jlowe
Copy link
Member

jlowe commented Oct 23, 2023

Describe the bug
After #14264, Parquet files written with timestamp columns have the timestamp statistics for the column in the footer encoded incorrectly. Apache Spark is no longer able to properly perform predicate pushdown on timestamps for these files.

Steps/Code to reproduce bug
Attached are two files with equivalent contents, one was created with the RAPIDS Accelerator for Apache Spark with cudf before #14264 and the other is the same software except cudf has been rolled back just one commit to before #14264. Spark can properly evaluate timestamp predicates on the file created before #14264 and fails to do so afterwards, always loading all files.

You can see the failed predicate pushdown in Spark by running the following query in Spark and examining the Spark UI to see how many rows were produced by the file scan before it is filtered by the filter. Replace "/tmp/pqgood" with the appropriate path to the good or bad Parquet files.

spark.read.parquet("/tmp/pqgood").filter(col("a") === java.sql.Timestamp.valueOf("1970-01-01 00:00:00.500")).collect

In the "good" case, Spark should load less than the total number of rows because it can properly see via the footers in the separate files that only one file is worth loading. In the "bad" case, Spark loads all of the data because it cannot properly detect that only one file could possibly contain the value trying to be found by the query.

Expected behavior
Apache Spark should be able to correctly perform predicate pushdown on timestamp columns for Parquet files written by libcudf.

pqbad.zip
pqgood.zip

@jlowe jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Oct 23, 2023
@jlowe
Copy link
Member Author

jlowe commented Oct 23, 2023

Note that the "good" files are actually a bit smaller than the "bad" files which is a bit surprising.

@jlowe
Copy link
Member Author

jlowe commented Oct 23, 2023

I tracked down in Apache Spark why this isn't working properly, and it's because Spark does not recognize the logical type for purposes of applying a filter. It expects the logical timestamp type to be adjusted to UTC for filtering, but after #14264 the logical type for timestamps says they are not adjusted for UTC. That fails to find a match on the logical type when trying to see if the filter can be applied, so it ends up no-op'ing the filter. A no-op filter explains the behavior we're seeing.

See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L147-L150 for Spark's definition of the expected timestamp logical types and it's use in valueCanMakeFilterOn at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L623

@vuule
Copy link
Contributor

vuule commented Oct 23, 2023

From a discussion with @etseidl, it does not look like the logical type has changed. Instead, the issue likely happens now because we write the logical type (that was, and still is, incorrect*) more consistently.
*Since the timestamps are implicitly in UTC in libcudf, we propose to simply modify the logical type to have isAdjustedToUTC = true. This would imply that libcudf timestamps are in UTC timezone and don't need to be adjusted when written to PQ.
@jlowe does this make sense from your perspective? I want to make sure I'm not just hiding a more serious issue with this change.

@jlowe
Copy link
Member Author

jlowe commented Oct 24, 2023

Yes, that makes sense to me.

rapids-bot bot pushed a commit that referenced this issue Oct 31, 2023
Closes #14315
Closes #14326

Parquet writer writes time and timestamp types with logical type with `isAdjustedToUTC` as `false`. However, timestamps in libcudf tables are implicitly in UTC and don't need to be adjusted.
This PR changes the `isAdjustedToUTC` to true.

Also added a writer option to write timestamps as local, as this is the expected behavior on the Python side.

Also changed the way logical type is handled for UNKNOWN type columns in `merge_row_group_metadata` - the logical type is excluded from merged metadata because of issues with type inference.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #14322
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants