Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1757554 Support quoted object fields name #869

Merged
merged 4 commits into from
Nov 6, 2024

Conversation

sfc-gh-alhuang
Copy link
Contributor

This PR aims to support quoted object fields name for Iceberg table ingestion. The PR include following change:

  • Preserve original column name in schema: Decode column name after schema creation to avoid sub-column name mismatch. As we use TypeToMessageType to convert from Iceberg schema to parquet schema which include Avro column name encoding for all non-digit/letters characters.
  • Escape dot character in EP info keys: Escape the dot character in column name of a dot path with backslash. Without this, it's possible for two different columns to have the same dot path. E.g. ("a.a" int, a object(a int)).
  • Change stats map's key to field id: The old logic build dot path as key of stats map along with structured data type validation, which might cause performance issue. Remove this logic and use fieldId as key instead to avoid string construction. Keep a map of fieldId -> dotPath in subcolumnFinder for logging purpose.

@sfc-gh-alhuang sfc-gh-alhuang marked this pull request as ready for review October 22, 2024 17:03
@sfc-gh-alhuang sfc-gh-alhuang requested review from sfc-gh-tzhang and a team as code owners October 22, 2024 17:03
@sfc-gh-alhuang sfc-gh-alhuang force-pushed the alhuang-iceberg-column-name branch from 6dc18a1 to ba48f45 Compare November 4, 2024 22:01
}
if (parquetType.getId() != null) {
builder.id(parquetType.getId().intValue());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The middle layer of Iceberg list/map does not have id. We don't want to set it.

@sfc-gh-alhuang sfc-gh-alhuang force-pushed the alhuang-iceberg-column-name branch from de23571 to d939c5a Compare November 5, 2024 21:01
for (String subColumn : subColumnFinder.getSubColumns(columnName)) {
RowBufferStats stats = statsMap.get(subColumn);
for (String subColumnId :
subColumnFinder.getSubColumns(fieldIndex.get(columnName).type.getId())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fieldIndex.get(columnName) can return null, lets handle that and throw explicitly? OK in your next PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The columnName is from Set.difference(fieldIndex.keySet(), otherSet). I think this should be safe.

@sfc-gh-alhuang sfc-gh-alhuang enabled auto-merge (squash) November 5, 2024 23:24
@sfc-gh-alhuang sfc-gh-alhuang merged commit 8e6fcb9 into master Nov 6, 2024
45 checks passed
@sfc-gh-alhuang sfc-gh-alhuang deleted the alhuang-iceberg-column-name branch November 6, 2024 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants