Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1727532 Set number of values for repeated fields #861

Merged
merged 6 commits into from
Oct 18, 2024

Conversation

sfc-gh-alhuang
Copy link
Contributor

@sfc-gh-alhuang sfc-gh-alhuang commented Oct 11, 2024

This PR includes following change:

  1. Set numberOfValues for repeated fields (e.g. map, list) and used it in EP file registration.
  2. Increment null count for all sub-columns of a null column.
  3. Use hash for approximate NDV for string and bytes.
  4. Fix NPE when inserting null to a structured column, throwing SF custom exception instead.
  5. Added extendedMetadataSize and metadataSize fields for EP registration.
  6. Address comments of SNOW-1708577 Parquet V2 support for new table format #851

Base automatically changed from alhuang/parquetV2 to master October 14, 2024 21:16
@sfc-gh-alhuang sfc-gh-alhuang marked this pull request as ready for review October 15, 2024 22:45
@sfc-gh-alhuang sfc-gh-alhuang requested review from sfc-gh-tzhang and a team as code owners October 15, 2024 22:45
@sfc-gh-alhuang sfc-gh-alhuang force-pushed the alhuang/ndv branch 2 times, most recently from ae10eee to 9c77f1b Compare October 15, 2024 22:50
this.maxChunkSizeInBytes = maxChunkSizeInBytes;
this.maxAllowedRowSizeInBytes = maxAllowedRowSizeInBytes;
this.bdecParquetCompression = bdecParquetCompression;
this.enableNewJsonParsingLogic = enableNewJsonParsingLogic;
this.maxRowGroups = maxRowGroups;
this.isIcebergMode = isIcebergMode;
this.enableDistinctValuesCount = enableDistinctValuesCount;
this.enableValuesCount = enableValuesCount;
}

/** @param clientInternal reference to the client object where the relevant parameters are set */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never understood why clientInternal can or should be allowed to be null. Looks like a testcase author leaking their convenience needs into production code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clientInterTest need this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's another ctor for testcases to pass in whatever booleans they want, so still unclear why there's a need to pass in a null clientInternal to this ctor.

final int magicOffset = bytes.length - ParquetFileWriter.MAGIC.length;
final int footerSizeOffset = magicOffset - Integer.BYTES;
if (footerSizeOffset < 0
|| !ParquetFileWriter.MAGIC_STR.equals(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why is this MAGIC_STR comparison being added only now / why is it needed only when calculating ext metadta size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing this comment

@@ -427,7 +432,7 @@ Set<String> verifyInputColumns(
List<String> missingCols = new ArrayList<>();
for (String columnName : this.nonNullableFieldNames) {
if (!inputColNamesMap.containsKey(columnName)) {
missingCols.add(statsMap.get(columnName).getColumnDisplayName());
missingCols.add(fieldIndex.get(columnName).columnMetadata.getName());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove the getColumnDisplayName() method from the RowBufferStats class now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, is there a need to even make this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a root column is a structured data type, statsMap.get(columnName) will trow NPE as we only stores leaf column stats. Use fieldIndex instead for logging to avoid this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add this as a code comment.

@@ -49,6 +48,7 @@ public class ParquetRowBuffer extends AbstractRowBuffer<ParquetChunkData> {
private final ParquetProperties.WriterVersion parquetWriterVersion;

private MessageType schema;
private SubColumnFinder statsFinder;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to subColumnFinder

statsMap.get(columnName).incCurrentNullCount();
statsFinder
.getSubColumns(columnName)
.forEach(subColumn -> statsMap.get(subColumn).incCurrentNullCount());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullref if statsMap.get(subColumn) returns null.
can that happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No as both subColumnFinder and statsMap are built from the same schema. Added check here in case.

* @return the footer size
*/
public static long getParquetFooterSize(byte[] bytes) throws IOException {
final int magicOffset = bytes.length - ParquetFileWriter.MAGIC.length;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, does this handle both PARE and PAR1 ?
I guess they're both 4 bytes so the length won't change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they have the same length. Added check for PARE.

new String(bytes, magicOffset, ParquetFileWriter.MAGIC.length)));
}

return BytesUtils.readIntLittleEndian(bytes, footerSizeOffset);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it always little endian irrespective of parquet V1 and parquet V2? How can we add a test that ensures this method is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the V1 and V2 only affects data page. (ref). For testing, I think we can use serializedBytes.length - writer.getColumnIndexReference().getOffset() - metadataSize - ParquetFileWriter.MAGIC.length - footerSizeBytes(4) = extenedMetadataSize. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test for parquet footer size & extendedMetadataSize in BlobBuilderTest

@@ -313,9 +352,27 @@ private float addRow(
RowBufferStats.getCombinedStats(statsMap.get(columnName), forkedColStats.getValue()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wow this looks like a thread unsafe operation here. RowBuffer.statsMap and tempStatsMap are both constructed as non-concurrent collections, but we don't prohibit concurrent insertRows on one channel!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nothig to fix for now we'll followup on this separately)

Copy link
Collaborator

@sfc-gh-hmadan sfc-gh-hmadan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sfc-gh-alhuang sfc-gh-alhuang enabled auto-merge (squash) October 18, 2024 20:20
@sfc-gh-alhuang sfc-gh-alhuang merged commit ee48554 into master Oct 18, 2024
45 checks passed
@sfc-gh-alhuang sfc-gh-alhuang deleted the alhuang/ndv branch October 18, 2024 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants