Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1787322 Fix InsertError for structured data type #888

Merged
merged 3 commits into from
Nov 8, 2024

Conversation

sfc-gh-alhuang
Copy link
Contributor

@sfc-gh-alhuang sfc-gh-alhuang commented Nov 5, 2024

Currently the InsertError doesn't populate extra columns, missing columns, null value for non null columns in InsertValidationResponse when ingesting structured data type to Iceberg tables. The PR is fixing this.

We use parquet dot path with escaping dot and back slash character in field name to represent sub-columns. For example, column x in map_col(string, object("a.b" array(object("c\d" object(x int))))) has a path MAP_COL.key_value.value.a\.b.list.element.c\\d.x

*
* @param missingNotNullColName the missing non-nullable column name
*/
public void addMissingNotNullColName(String missingNotNullColName) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance this can be called from multiple threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert error instance are independent between rows, I think this should be safe.

error);
listVal.add(parsedValue.getValue());
estimatedParquetSize += parsedValue.getSize();
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this else block is (a) not incrementing estimatedParquetSize, and (b) not adding to listVal in one branch (when required=true). Unclear why you really need this else block?
You can also do the following I think and avoid the unnecessary if-else branching?

String fieldName = type.getFieldName(i);
Object val = structVal.getOrDefault(fieldName, null);
ParquetBufferValue parsedValue = parseColumnValueToParquet(val, ....);
listVal.add(parsedValue.getValue());
estParquetSize = ...;
if (type.getType(i).isRepetition(REQUIRED)) {
    missingFields.add(fieldName);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to add this else is to distinguish between missing column and column with null value. That is, we cannot use getOrDefault(fieldName, null) as this ruined the difference between null value and missing key.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, got it. please add a comment around this.

@sfc-gh-alhuang sfc-gh-alhuang force-pushed the alhuang-iceberg-column-name branch from d939c5a to af46e60 Compare November 5, 2024 21:03
@sfc-gh-alhuang sfc-gh-alhuang force-pushed the alhuang-schema-evolution branch 2 times, most recently from 8924513 to ba6cdba Compare November 6, 2024 00:22
Base automatically changed from alhuang-iceberg-column-name to master November 6, 2024 00:42
}
if (sb.length() > 0) {
sb.append(".");
}
sb.append(p);
sb.append(p.replace("\\", "\\\\").replace(".", "\\."));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does an empty string in p need to also be replaced by something? Have IT for an empty field name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think empty string does not need to be replaced. An empty string does not collision with other dot path. An IT with empty fields was included in previous PR in IcebergStructuredIT.testFieldName.

@sfc-gh-hmadan sfc-gh-hmadan force-pushed the alhuang-schema-evolution branch from ba6cdba to 960d35d Compare November 6, 2024 04:04
@@ -85,7 +85,7 @@ public void parseValueInt() {
};
ParquetBufferValue pv =
IcebergParquetValueParser.parseColumnValueToParquet(
Integer.MAX_VALUE, type, rowBufferStatsMap, mockSubColumnFinder, UTC, 0);
Integer.MAX_VALUE, type, rowBufferStatsMap, mockSubColumnFinder, UTC, 0, null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for not adding a method overload :)

"object(k1 int not null, k2 object(k3 int not null, k4 object(k5 int not null) not"
+ " null) not null) not null");
SnowflakeStreamingIngestChannel channel =
openChannel(tableName, OpenChannelRequest.OnErrorOption.ABORT);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a pre-existing gap where someone using OnErrorOption.ABORT will not get the same set of information (insertErrors object) as OnErrorOption.CONTINUE ? (unrelated to your PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For single row, errors on different OnErrorOption should be the same.

channel.insertRow(createStreamingIngestRow(row), UUID.randomUUID().toString());
assertThat(insertValidationResponse.getInsertErrors().size()).isEqualTo(1);
assertThat(insertValidationResponse.getInsertErrors().get(0).getExtraColNames())
.containsOnly("VALUE.k2", "VALUE.k\\.3", "VALUE.k\\\\4");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FOR KC team: The first VALUE is the default column name used by the test harness, and doesn't have any semantics. This is unlike the key-value/list/element/value words that you'll see in the examples that follow.

@sfc-gh-hmadan
Copy link
Collaborator

Signing off but please hold off on merging until we get a positive ack from warsaw on this change being sufficient, slack conversation ongoing.

@sfc-gh-hmadan sfc-gh-hmadan force-pushed the alhuang-schema-evolution branch from 960d35d to f622569 Compare November 8, 2024 18:24
@sfc-gh-hmadan sfc-gh-hmadan enabled auto-merge (squash) November 8, 2024 18:39
@sfc-gh-hmadan sfc-gh-hmadan merged commit 70943f7 into master Nov 8, 2024
45 checks passed
@sfc-gh-hmadan sfc-gh-hmadan deleted the alhuang-schema-evolution branch November 8, 2024 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants