[SPARK-45891][SQL] Rebuild variant binary from shredded data. #48851

chenhao-db · 2024-11-14T19:31:19Z

What changes were proposed in this pull request?

It implements the variant rebuild functionality according to the current shredding spec in apache/parquet-format#461, and allows the Parquet reader will be able to read shredded variant data.

Why are the changes needed?

It gives Spark the basic ability to read shredded variant data. It can be improved in the future to read only requested fields.

Does this PR introduce any user-facing change?

Yes, the Parquet reader will be able to read shredded variant data.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

chenhao-db · 2024-11-14T21:32:55Z

@gene-db @cashmand @cloud-fan could you help review? Thanks!

cashmand

Thanks, change LGTM!

cashmand · 2024-11-21T22:03:31Z

common/variant/src/main/java/org/apache/spark/types/variant/ShreddingUtils.java

+    int numElements();
+  }
+
+  public static Variant rebuild(ShreddedRow row, VariantSchema schema) {


Can you mention that this rebuild function should only be called on the top-level schema, and that other one can be called on any recursively shredded sub-schema?

cashmand · 2024-11-21T22:14:54Z

common/variant/src/main/java/org/apache/spark/types/variant/ShreddingUtils.java

+    } else if (variantIdx >= 0 && !row.isNullAt(variantIdx)) {
+      builder.appendVariant(new Variant(row.getBinary(variantIdx), metadata));
+    } else {
+      builder.appendNull();


This is the case that shouldn't really be valid, right? For objects, we shouldn't be calling rebuild recursively, and for arrays or top-level values, we should be storing it in value? Might be worth a comment.

cashmand · 2024-11-21T22:25:16Z

sql/core/src/test/scala/org/apache/spark/sql/VariantShreddingSuite.scala

+          Row(metadata(Seq("b", "d")), null, Row(Row(1, null), Row(null, value("null")))),
+          Row(metadata(Seq("a", "b", "c", "d")),
+            shreddedValue("""{"a": 1, "c": 3}""", Seq("a", "b", "c", "d")),
+            Row(Row(2, null), Row(null, value("4")))),


Just to make sure I understand, would the result here be any different if the value "4" was put into the typed_value field of d? Is this an example where shredding made a suboptimal but valid decision?

gene-db

@chenhao-db Thanks for this read feature! I left a few questions.

gene-db · 2024-11-25T19:24:03Z

common/variant/src/main/java/org/apache/spark/types/variant/ShreddingUtils.java

+          builder.appendDate(row.getInt(typedIdx));
+        } else if (scalar instanceof VariantSchema.TimestampType) {
+          builder.appendTimestamp(row.getLong(typedIdx));
+        } else {


Can this be

} else if (scalar instanceof VariantSchema.TimestampNTZType) { builder.appendTimestampNtz(row.getLong(typedIdx)); } else { // error handling }

Should the error handling ultimately return a malformed variant exception? RIght now, it just crashes?

gene-db · 2024-11-25T19:25:57Z

common/variant/src/main/java/org/apache/spark/types/variant/ShreddingUtils.java

+          rebuild(array.getStruct(i, elementSchema.numFields), metadata, elementSchema, builder);
+        }
+        builder.finishWritingArray(start, offsets);
+      } else {


Is this the object case? Should we explicitly check for

} else if (schema.objectSchema != null) { ... } else { // error handling }

And the error handling should be malformed variant, right?

gene-db · 2024-11-25T19:32:27Z

common/variant/src/main/java/org/apache/spark/types/variant/ShreddingUtils.java

+          for (int i = 0; i < v.objectSize(); ++i) {
+            Variant.ObjectField field = v.getFieldAtIndex(i);
+            int id = builder.addKey(field.key);
+            fields.add(new VariantBuilder.FieldEntry(field.key, id, builder.getWritePos() - start));


Don't we have to check to see if there are no duplicate fields in the variant blob? The spec says the shredded field must overwrite the encoded blob field.

gene-db · 2024-11-25T19:33:28Z

common/variant/src/main/java/org/apache/spark/types/variant/ShreddingUtils.java

+        builder.finishWritingObject(start, fields);
+      }
+    } else if (variantIdx >= 0 && !row.isNullAt(variantIdx)) {
+      builder.appendVariant(new Variant(row.getBinary(variantIdx), metadata));


We should add a comment here. This is when there is no typed_value, and only the value parquet column?

gene-db · 2024-11-25T19:52:59Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkShreddingUtils.scala

+    SparkShreddedRow(row.getStruct(ordinal, numFields))
+  override def getArray(ordinal: Int): SparkShreddedRow =
+    SparkShreddedRow(row.getArray(ordinal))
+  override def numElements(): Int = row.asInstanceOf[ArrayData].numElements()


How is row guaranteed to be ArrayData?

github-actions bot added the SQL label Nov 14, 2024

chenhao-db force-pushed the rebuild_variant branch from 16ca3bd to 99d806f Compare November 14, 2024 21:11

initial

a5119a5

chenhao-db force-pushed the rebuild_variant branch from 99d806f to a5119a5 Compare November 14, 2024 21:23

cashmand approved these changes Nov 21, 2024

View reviewed changes

gene-db reviewed Nov 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45891][SQL] Rebuild variant binary from shredded data. #48851

[SPARK-45891][SQL] Rebuild variant binary from shredded data. #48851

chenhao-db commented Nov 14, 2024

chenhao-db commented Nov 14, 2024

cashmand left a comment

cashmand Nov 21, 2024

cashmand Nov 21, 2024

cashmand Nov 21, 2024

gene-db left a comment

gene-db Nov 25, 2024

gene-db Nov 25, 2024

gene-db Nov 25, 2024

gene-db Nov 25, 2024

gene-db Nov 25, 2024

[SPARK-45891][SQL] Rebuild variant binary from shredded data. #48851

Are you sure you want to change the base?

[SPARK-45891][SQL] Rebuild variant binary from shredded data. #48851

Conversation

chenhao-db commented Nov 14, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

chenhao-db commented Nov 14, 2024

cashmand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gene-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment