[BUG] Parquet writer encodes timestamp statistics incorrectly #14315

jlowe · 2023-10-23T18:47:24Z

Describe the bug
After #14264, Parquet files written with timestamp columns have the timestamp statistics for the column in the footer encoded incorrectly. Apache Spark is no longer able to properly perform predicate pushdown on timestamps for these files.

Steps/Code to reproduce bug
Attached are two files with equivalent contents, one was created with the RAPIDS Accelerator for Apache Spark with cudf before #14264 and the other is the same software except cudf has been rolled back just one commit to before #14264. Spark can properly evaluate timestamp predicates on the file created before #14264 and fails to do so afterwards, always loading all files.

You can see the failed predicate pushdown in Spark by running the following query in Spark and examining the Spark UI to see how many rows were produced by the file scan before it is filtered by the filter. Replace "/tmp/pqgood" with the appropriate path to the good or bad Parquet files.

spark.read.parquet("/tmp/pqgood").filter(col("a") === java.sql.Timestamp.valueOf("1970-01-01 00:00:00.500")).collect

In the "good" case, Spark should load less than the total number of rows because it can properly see via the footers in the separate files that only one file is worth loading. In the "bad" case, Spark loads all of the data because it cannot properly detect that only one file could possibly contain the value trying to be found by the query.

Expected behavior
Apache Spark should be able to correctly perform predicate pushdown on timestamp columns for Parquet files written by libcudf.

pqbad.zip
pqgood.zip

The text was updated successfully, but these errors were encountered:

jlowe · 2023-10-23T18:48:33Z

Note that the "good" files are actually a bit smaller than the "bad" files which is a bit surprising.

jlowe · 2023-10-23T20:48:43Z

I tracked down in Apache Spark why this isn't working properly, and it's because Spark does not recognize the logical type for purposes of applying a filter. It expects the logical timestamp type to be adjusted to UTC for filtering, but after #14264 the logical type for timestamps says they are not adjusted for UTC. That fails to find a match on the logical type when trying to see if the filter can be applied, so it ends up no-op'ing the filter. A no-op filter explains the behavior we're seeing.

See https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L147-L150 for Spark's definition of the expected timestamp logical types and it's use in valueCanMakeFilterOn at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L623

vuule · 2023-10-23T22:59:37Z

From a discussion with @etseidl, it does not look like the logical type has changed. Instead, the issue likely happens now because we write the logical type (that was, and still is, incorrect*) more consistently.
*Since the timestamps are implicitly in UTC in libcudf, we propose to simply modify the logical type to have isAdjustedToUTC = true. This would imply that libcudf timestamps are in UTC timezone and don't need to be adjusted when written to PQ.
@jlowe does this make sense from your perspective? I want to make sure I'm not just hiding a more serious issue with this change.

jlowe · 2023-10-24T13:45:15Z

Yes, that makes sense to me.

Closes #14315 Closes #14326 Parquet writer writes time and timestamp types with logical type with `isAdjustedToUTC` as `false`. However, timestamps in libcudf tables are implicitly in UTC and don't need to be adjusted. This PR changes the `isAdjustedToUTC` to true. Also added a writer option to write timestamps as local, as this is the expected behavior on the Python side. Also changed the way logical type is handled for UNKNOWN type columns in `merge_row_group_metadata` - the logical type is excluded from merged metadata because of issues with type inference. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - David Wendt (https://github.com/davidwendt) - GALI PREM SAGAR (https://github.com/galipremsagar) - Mike Wilson (https://github.com/hyperbolic2346) URL: #14322

jlowe added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Oct 23, 2023

jlowe mentioned this issue Oct 23, 2023

[BUG] Spark 3.2+/ParquetFilterSuite/Parquet filter pushdown - timestamp/ FAILED NVIDIA/spark-rapids#9507

Closed

vuule mentioned this issue Oct 24, 2023

Fix logical type issues in the Parquet writer #14322

Merged

3 tasks

rapids-bot bot closed this as completed in #14322 Oct 31, 2023

sameerz mentioned this issue Nov 1, 2023

Revert "Ignore failing Parquet filter test to unblock CI (#9519)" NVIDIA/spark-rapids#9594

Merged

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Parquet writer encodes timestamp statistics incorrectly #14315

[BUG] Parquet writer encodes timestamp statistics incorrectly #14315

jlowe commented Oct 23, 2023

jlowe commented Oct 23, 2023

jlowe commented Oct 23, 2023

vuule commented Oct 23, 2023

jlowe commented Oct 24, 2023

[BUG] Parquet writer encodes timestamp statistics incorrectly #14315

[BUG] Parquet writer encodes timestamp statistics incorrectly #14315

Comments

jlowe commented Oct 23, 2023

jlowe commented Oct 23, 2023

jlowe commented Oct 23, 2023

vuule commented Oct 23, 2023

jlowe commented Oct 24, 2023