Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1437885 Disable blob interleaving when in Iceberg mode #763

Merged
merged 6 commits into from
Jun 11, 2024

Conversation

sfc-gh-alhuang
Copy link
Contributor

  • Added a method to SnowflakeStreamingIngestClientInternal to activate/deactivate streaming to Iceberg tables.
  • A blob should only contain one chunk (parquet) under Iceberg mode. Disable blob interleaving under Iceberg mode.

@sfc-gh-alhuang sfc-gh-alhuang marked this pull request as ready for review May 23, 2024 17:01
@sfc-gh-alhuang sfc-gh-alhuang requested review from sfc-gh-tzhang and a team as code owners May 23, 2024 17:01
@@ -123,6 +123,10 @@ List<List<ChannelData<T>>> getData() {
// blob encoding version
private final Constants.BdecVersion bdecVersion;

// Indicates if it's flushing to Iceberg tables, a blob could only contain one chunk under Iceberg
// mode
private final boolean isIcebergMode;
Copy link
Collaborator

@sfc-gh-gdoci sfc-gh-gdoci May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this have a more general name, e.g. isNonInterleavedMode or disableInterleavedBlobs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can make this more general for the internal names since we disable interleaved mode due to other reasons in the future, not only Iceberg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the non-interleave parameter for now as we could control this via MAX_CHUNKS_IN_BLOB_AND_REGISTRATION_REQUEST

}

/*** Constructor for TEST ONLY
*
* @param name the name of the client
*/
SnowflakeStreamingIngestClientInternal(String name) {
this(name, null, null, null, true, null, new HashMap<>());
this(name, null, null, null, false, true, null, new HashMap<>());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's about 20 testcases using this ctor, I'd want to get iceberg mode test coverage for all those tests too (unless there's a good reason to not need it).
Can you expose isIcebergMode on this test mode ctor's signature too, and parameterize the calling tests to run with both icebergMode = on / off?

I see we're using JUnit which has good support for doing multiple test runs with different parameter values declared on an annotation / on a values provider defined as a test class member, so it'll be a one-time cost to set this up in all the test classes, but in exchange we'll get comprehensive ongoing test coverage.

cc @sfc-gh-tzhang in you case you want to weigh in on this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, you can search for Parameterized and see how to use that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added parameters for unit tests.

@@ -437,7 +447,7 @@ && shouldStopProcessing(
}
// Add processed channels to the current blob, stop if we need to create a new blob
blobData.add(channelsDataPerTable.subList(0, idx));
if (idx != channelsDataPerTable.size()) {
if (idx != channelsDataPerTable.size() || isIcebergMode) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to guard against the possibility of having isIcebergMode peppered across 10 different places in code, making it that much harder to maintain the client.

Ideally we should have one place in the whole client that does if (isIcebergMode) { /* initialize booleans / ints / settings one way */ } else { /* initialize another way */ } - it seems that the ParameterProvider is the best place to do this, or we introduce a separate ClientSettings class that holds all these settings that control different behaviors in different places in the client.

I also see there are three parameters on the parameter provider that need different values for iceberg - MAX_CHUNK_SIZE_IN_BYTES (set to 512 MB rn) and MAX_CHUNKS_IN_BLOB_AND_REGISTRATION_REQUEST_DEFAULT (set to 100 rn) and MAX_BLOB_SIZE_IN_BYTES (set to 1 GB rn). This is in addition to MAX_CLIENT_LAG needing an iceberg-specific value.

It looks like if parameterProvider.getMaxChunksInBlobAndRegistrationRequest() returns 1 then you'll get this same desired behavior ?

Note - in the parameterProvider you'll need to validate that customers can't override MAX_CHUNKS_IN_BLOB_AND_REGISTRATION_REQUEST and it can only ever be 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted the isIcebergMode param and use MAX_CHUNKS_IN_BLOB_AND_REGISTRATION_REQUEST for now.

Copy link
Collaborator

@sfc-gh-hmadan sfc-gh-hmadan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments, PTAL.

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there're a few things missing in this PR:

  • Block creation of regular channels on Iceberg client and creation of iceberg channels on regular client
  • No need to call the client/configure endpoint for Iceberg client

Do you plan to do that in a separate PR?

@@ -123,6 +123,10 @@ List<List<ChannelData<T>>> getData() {
// blob encoding version
private final Constants.BdecVersion bdecVersion;

// Indicates if it's flushing to Iceberg tables, a blob could only contain one chunk under Iceberg
// mode
private final boolean isIcebergMode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can make this more general for the internal names since we disable interleaved mode due to other reasons in the future, not only Iceberg

}

/*** Constructor for TEST ONLY
*
* @param name the name of the client
*/
SnowflakeStreamingIngestClientInternal(String name) {
this(name, null, null, null, true, null, new HashMap<>());
this(name, null, null, null, false, true, null, new HashMap<>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, you can search for Parameterized and see how to use that

@sfc-gh-alhuang
Copy link
Contributor Author

sfc-gh-alhuang commented May 28, 2024

Thanks for the comments and tips! Added icebergMode on/off params for some unit tests. Some tests included blob registration and authentication only test wit regular mode for now. @sfc-gh-tzhang I plan to delete or modify the client configure call in later PRs as it's only called by StreamingIngestStage and we might not use this class in the iceberg mode later. The Iceberg client & table compatibility check is adding in the server side (ref), should we also add a check in the client?

// Required parameter override for Iceberg mode
if (this.isIcebergMode) {
this.parameterMap.put(
MAX_CHUNKS_IN_BLOB_AND_REGISTRATION_REQUEST,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call this.updateValue here ? That's the only method that's updating parameterMap before this block was added, lets keep it that way..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updateValue considers props and paramOverrides before default value. While under iceberg mode we should not allow user to set the chunk number > 1, that's why I use the put method directly. I think we can use updateValue(key, value, null, null) to achieve the same result, wdyt?

Copy link
Collaborator

@sfc-gh-hmadan sfc-gh-hmadan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left one small comment that you can take in the next PR too as you have other changes waiting for this to go in.

public static final int MAX_CHUNKS_IN_BLOB_AND_REGISTRATION_REQUEST_ICEBERG_MODE_DEFAULT = 1;

// If the provided parameters need to be verified and modified to meet Iceberg mode
private final boolean isIcebergMode;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if customer create a Iceberg client but override this value to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think client can inject ParameterProvider directly, instantiate is called by ClientInternal at here.

public class FlushServiceTest {

@Parameterized.Parameters(name = "isIcebergMode: {0}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought what we should be doing is to test the isIcebergMode with {true, false}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is the format of the test name, where the first parameter is either false or true from here.

@sfc-gh-alhuang
Copy link
Contributor Author

Per offline discussion with @sfc-gh-tzhang , moved MAX_CHUNKS_IN_BLOB_AND_REGISTRATION_REQUEST_ICEBERG_MODE_DEFAULT back to ParameterProvider.java.

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, PTAL, otherwise LGTM! One request I have is could we merge all the Iceberg related changes to a feature branch instead of main? We can move everything to main once we ready, otherwise the new release will contain all the WIP changes and customer might do something with it.

@sfc-gh-alhuang sfc-gh-alhuang changed the base branch from master to iceberg-support June 11, 2024 00:30
@sfc-gh-alhuang sfc-gh-alhuang merged commit c6dfbf3 into iceberg-support Jun 11, 2024
15 checks passed
@sfc-gh-alhuang sfc-gh-alhuang deleted the alhuang-iceberg-mode branch June 11, 2024 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants