Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-672156 support specifying compression algorithm to be used for BDEC Parquet files #579

Merged
merged 5 commits into from
Oct 31, 2023

Conversation

sfc-gh-gdoci
Copy link
Collaborator

@sfc-gh-gdoci sfc-gh-gdoci commented Aug 29, 2023

This PR adds support for specifying the compression algorithm to be used for BDEC Parquet files. The allowed value is just GZIP for now.

@sfc-gh-gdoci sfc-gh-gdoci force-pushed the gdoci-SNOW-672156-compression branch from d4273a7 to 9cab491 Compare August 29, 2023 13:08
@sfc-gh-gdoci sfc-gh-gdoci marked this pull request as ready for review August 29, 2023 15:04
@sfc-gh-gdoci sfc-gh-gdoci requested review from sfc-gh-tzhang and a team as code owners August 29, 2023 15:04
@sfc-gh-tzhang
Copy link
Contributor

This PR adds support for specifying the compression algorithm to be used for BDEC Parquet files. The allowed values are UNCOMPRESSED, GZIP, ZSTD, SNAPPY.

I think the intention for this parameter is to allow customer to go back to GZIP if there is any issue with ZSTD, so I suggest we only support GZIP and ZSTD, WDYT?

@@ -51,6 +54,9 @@ public class ParameterProvider {
It reduces memory consumption compared to using Java Objects for buffering.*/
public static final boolean ENABLE_PARQUET_INTERNAL_BUFFERING_DEFAULT = false;

public static final Constants.BdecParquetCompression BDEC_PARQUET_COMPRESSION_ALGORITHM_DEFAULT =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we want ZSTD to be the default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was the ZSTD already tested with all our tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested GZIP, ZSTD, SNAPPY previously, but I will run the tests again. The SDK has changed a lot since then.

@sfc-gh-tzhang
Copy link
Contributor

Please hold this PR until we're done with the current release, thanks!

@sfc-gh-xhuang
Copy link
Contributor

This PR adds support for specifying the compression algorithm to be used for BDEC Parquet files. The allowed values are UNCOMPRESSED, GZIP, ZSTD, SNAPPY.

I think the intention for this parameter is to allow customer to go back to GZIP if there is any issue with ZSTD, so I suggest we only support GZIP and ZSTD, WDYT?

I agree. Do we need to support anything else as customers do not have access to the underlying blob?
We only need to allow compression algos that we want

@sfc-gh-gdoci sfc-gh-gdoci force-pushed the gdoci-SNOW-672156-compression branch from 9cab491 to d00a148 Compare September 8, 2023 13:43
@sfc-gh-azagrebin
Copy link
Contributor

sfc-gh-azagrebin commented Sep 18, 2023

This PR adds support for specifying the compression algorithm to be used for BDEC Parquet files. The allowed values are UNCOMPRESSED, GZIP, ZSTD, SNAPPY.

I think the intention for this parameter is to allow customer to go back to GZIP if there is any issue with ZSTD, so I suggest we only support GZIP and ZSTD, WDYT?

I agree. Do we need to support anything else as customers do not have access to the underlying blob? We only need to allow compression algos that we want

if other parquet supported compression algos are tested against server side, why not allow them?
We do not have expose them in the docs even. It could be useful for certain use cases where e.g. SNAPPY suddenly behaves better or good enough but much faster. If we do not test some algos, +1 to not exposing them.

@sfc-gh-gdoci sfc-gh-gdoci force-pushed the gdoci-SNOW-672156-compression branch from d00a148 to bc911e4 Compare September 25, 2023 07:15
@sfc-gh-gdoci sfc-gh-gdoci force-pushed the gdoci-SNOW-672156-compression branch from 667af1b to f59dfc7 Compare October 23, 2023 13:38
@sfc-gh-gdoci sfc-gh-gdoci force-pushed the gdoci-SNOW-672156-compression branch 2 times, most recently from 2f9ee4d to 8f104b2 Compare October 30, 2023 10:59
@sfc-gh-gdoci
Copy link
Collaborator Author

I narrowed down the scope of this PR to just make compression algorithm configurable. GZIP is the only allowed value as it has been the default value so far. I will have to do more correctness testing and performance evaluations before adding other allowed compression algorithms.

@sfc-gh-tjones @sfc-gh-tzhang @sfc-gh-azagrebin PTAL.

@sfc-gh-gdoci sfc-gh-gdoci force-pushed the gdoci-SNOW-672156-compression branch from 8f104b2 to b3f19ad Compare October 30, 2023 11:30
return e;
}
}
throw new IllegalArgumentException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a test case for invalid input?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, done.

@sfc-gh-gdoci sfc-gh-gdoci force-pushed the gdoci-SNOW-672156-compression branch from 9ad933b to e9b3d36 Compare October 31, 2023 09:40
@sfc-gh-gdoci sfc-gh-gdoci merged commit d0fd1d8 into master Oct 31, 2023
10 checks passed
@sfc-gh-gdoci sfc-gh-gdoci deleted the gdoci-SNOW-672156-compression branch October 31, 2023 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants