Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-983635 Allow ZSTD compression algorithm #654

Merged
merged 30 commits into from
Feb 1, 2024

Conversation

sfc-gh-lthiede
Copy link
Contributor

Test setting ZSTD as default compression algorithm

@sfc-gh-lthiede
Copy link
Contributor Author

sfc-gh-lthiede commented Jan 4, 2024

Maybe add zstd-jni as dependency in README.md Edit: done

@sfc-gh-lthiede
Copy link
Contributor Author

DEW testing

DEW tests using a local release of the Snowpipe Streaming SDK with my changes and ZSTD as the default compression algorithm passed.
To run successfully I had to add a dependency on zstd-jni to the DEW dependencies (PR)

@sfc-gh-lthiede sfc-gh-lthiede force-pushed the lthiede-SNOW-983635-ZSTD-compression branch 4 times, most recently from eed81af to d671c1f Compare January 19, 2024 10:01
@sfc-gh-lthiede
Copy link
Contributor Author

sfc-gh-lthiede commented Jan 22, 2024

How I ran the precommits

I ran a precommit with my branch rebased on sdk release 2.0.5.
I ran a second precommit with sdk release 2.0.5 to check which test failures are actually caused by my changes.
I ran both precommits with some changes to the BDEC file generator and its build in RegressionTests. (PR)
The changes fixed the file generator for SDK release 2.0.5 and improved the its build.

Results

4 tests failed, that also failed on master. 9 tests failed because of reduced byte counts. 1 test failed because of a higher byte count. 1 test failed unexpectedly.

Is zstd sometimes leading to a higher byte count a problem?
@sfc-gh-gdoci suggested that the unexpected failure is probably flaky. I ran another precommit and the failure didn't show up again.

@sfc-gh-lthiede sfc-gh-lthiede marked this pull request as ready for review January 22, 2024 14:49
@sfc-gh-lthiede sfc-gh-lthiede requested review from sfc-gh-tzhang and a team as code owners January 22, 2024 14:49
@sfc-gh-lthiede sfc-gh-lthiede changed the title SNOW-983635 Set ZSTD as default compression algorithm SNOW-983635 Allow ZSTD compression algorithm Jan 22, 2024
Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! I think we need more test coverage here, for example: we need coverage to make sure it works on other data types. We probably have most of them which you can just add a parameter like what you did in StreamingIngestBigFilesIT.java

@@ -17,6 +17,7 @@ The Snowflake Ingest Service SDK depends on the following libraries:

* snowflake-jdbc (3.13.30 to 3.13.33)
* slf4j-api
* com.github.luben:zstd-jni (1.5.0-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we decide to treat this as one of the special dependencies? @sfc-gh-lsembera has more context here if you need more help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, the problem is that zstd-jni can't be shaded. Because of that we can't include it in our sdk's shaded jar. It has to be provided separately when customers want to build their application with our sdk. I thought this was reason enough to list it with our dependencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-lsembera Any idea if we can do better here? Or any potential issue? I'm concerned that we keep adding to this list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good if shade fewer dependencies, and this change goes in that direction. Another advantage of the unshaded approach here is that if the customer does not want to use zstd and wants smaller distribution, they can exclude the zstd jar from their maven project.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we stop relocating com.github.luben, is there any concern that this is a behavior change?

@sfc-gh-tzhang
Copy link
Contributor

What's our plan of making ZSTD the default?

@sfc-gh-lthiede
Copy link
Contributor Author

sfc-gh-lthiede commented Jan 23, 2024

@sfc-gh-tzhang In the protocol of our Streaming Ingest Repeating Eng Sync on Dec 5th we wrote

[@sfc-gh-gdoci /Lorenz] Rolling out (risky) changes in the SDK, e.g. switching from Parquet page version v1 to v2

  • A way to control client parameters from Snowflake server-side to rollback in case of issues before customers are asked to upgrade to SDK version with the fix, e.g. override parameters in clientConfigure or openChannel request
  • [Toby] Wanted to do smth similar for Kafka connector SNOW-980722
  • [Toby] IMO makes sense to do, it’s a priority question. Could be a good starter task for Alec when he returns

I think this also applies to making ZSTD the default

@sfc-gh-lthiede sfc-gh-lthiede force-pushed the lthiede-SNOW-983635-ZSTD-compression branch from 7d03433 to ff91de2 Compare January 24, 2024 14:49
Copy link
Collaborator

@sfc-gh-gdoci sfc-gh-gdoci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider me as approved if you can check with @sfc-gh-lsembera to see if we can do better on the ZSTD dependency, thanks!

@@ -17,6 +17,7 @@ The Snowflake Ingest Service SDK depends on the following libraries:

* snowflake-jdbc (3.13.30 to 3.13.33)
* slf4j-api
* com.github.luben:zstd-jni (1.5.0-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-lsembera Any idea if we can do better here? Or any potential issue? I'm concerned that we keep adding to this list

public_pom.xml Show resolved Hide resolved
@sfc-gh-tzhang sfc-gh-tzhang self-requested a review January 27, 2024 01:54
@sfc-gh-tzhang
Copy link
Contributor

sfc-gh-tzhang commented Jan 27, 2024

@sfc-gh-tzhang In the protocol of our Streaming Ingest Repeating Eng Sync on Dec 5th we wrote

[@sfc-gh-gdoci /Lorenz] Rolling out (risky) changes in the SDK, e.g. switching from Parquet page version v1 to v2

  • A way to control client parameters from Snowflake server-side to rollback in case of issues before customers are asked to upgrade to SDK version with the fix, e.g. override parameters in clientConfigure or openChannel request
  • [Toby] Wanted to do smth similar for Kafka connector SNOW-980722
  • [Toby] IMO makes sense to do, it’s a priority question. Could be a good starter task for Alec when he returns

I think this also applies to making ZSTD the default

Ah, thanks for the reminder, we can do this, is this something you want to start if you have free time?

pom.xml Outdated
@@ -691,6 +694,9 @@
<configuration>
<failOnWarning>true</failOnWarning>
<ignoreNonCompile>true</ignoreNonCompile>
<ignoredDependencies>
<ignoredDependency>com.github.luben:zstd-jni</ignoredDependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We declare zstd-jni as a dependency even though it's not directly used in our code. If we didn't ignore zstd-jni here analyze-only would fail due to unused dependency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we can use the runtime maven scope?

pom.xml Outdated
@@ -1092,6 +1095,12 @@
<exclude>org/slf4j/**</exclude>
</excludes>
</filter>
<filter>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not shading this dependency, why is this filter needed?

public_pom.xml Show resolved Hide resolved
@@ -17,6 +17,7 @@ The Snowflake Ingest Service SDK depends on the following libraries:

* snowflake-jdbc (3.13.30 to 3.13.33)
* slf4j-api
* com.github.luben:zstd-jni (1.5.0-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good if shade fewer dependencies, and this change goes in that direction. Another advantage of the unshaded approach here is that if the customer does not want to use zstd and wants smaller distribution, they can exclude the zstd jar from their maven project.

…ncies filtered from shading, because it's already excluded

Description

Testing
@sfc-gh-lthiede
Copy link
Contributor Author

Ah, thanks for the reminder, we can do this, is this something you want to start if you have free time?
@sfc-gh-tzhang Currently I'm busy with other parts of my project

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@@ -17,6 +17,7 @@ The Snowflake Ingest Service SDK depends on the following libraries:

* snowflake-jdbc (3.13.30 to 3.13.33)
* slf4j-api
* com.github.luben:zstd-jni (1.5.0-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we stop relocating com.github.luben, is there any concern that this is a behavior change?

@sfc-gh-lthiede
Copy link
Contributor Author

The README.md of the ZSTD library say the following. This is the reason we can't shade it.

The Java classes cannot be renamed/minimized/relocated. JVM linking the native library depends on the class name that is trying to link the native part, so changing the class names will lead to failed linking at runtime.

https://github.com/luben/zstd-jni/

@sfc-gh-lthiede sfc-gh-lthiede merged commit 84420bf into master Feb 1, 2024
14 checks passed
@sfc-gh-lthiede sfc-gh-lthiede deleted the lthiede-SNOW-983635-ZSTD-compression branch February 1, 2024 09:49
@sfc-gh-tzhang sfc-gh-tzhang mentioned this pull request Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants