Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1258064 Remove/Relax the limitations in order to generate bigger files #730

Merged
merged 8 commits into from
Apr 3, 2024

Conversation

sfc-gh-alhuang
Copy link
Contributor

@sfc-gh-alhuang sfc-gh-alhuang commented Mar 25, 2024

Relax file size limitation to deal with longer client flush lag which might produce larger file.

JIRA

@sfc-gh-alhuang sfc-gh-alhuang marked this pull request as ready for review March 26, 2024 20:59
@sfc-gh-alhuang sfc-gh-alhuang requested review from sfc-gh-tzhang and a team as code owners March 26, 2024 20:59
@@ -673,15 +671,6 @@ static <T> AbstractRowBuffer<T> createRowBuffer(
}
}

private void checkBatchSizeEnforcedMaximum(float batchSizeInBytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this still be here but with higher limits?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This change was added before to avoid OOM issue in the scanner, given that the Parquert scanner is much better now and the iceberg doc mentions that it works well for files upto 1GB compressed, I don't think this is necessary now
  2. We still have a warning on this, see checkBatchSizeRecommendedMaximum right below this function
  3. We have valid use cases where the batch size could be big, imagine the case where the Kafka Connector has a big flush interval and everything will be inserted as one batch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I somehow missed this PR until it got merged. I have concerns about this change because there was another reason why these limits were introduced - to prevent long GC pauses. We observed JVM freezes for customers who were passing hundreds of thousands of rows into insertRows(), see more details here.

I would prefer to keep some limit there, customers should not pass hundreds of megabytes of data in a single batch.

cc @sfc-gh-wfateem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-alhuang Do you want to run some local tests to see if things are working with a few GBs of a batch and 1 minutes of flush interval? If not, then we can add the exception back but with a much higher limit, WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got GCLocker timeout error message after building a 600+MB of blobs. I'll add the assertion back with higher limit.

@@ -673,15 +671,6 @@ static <T> AbstractRowBuffer<T> createRowBuffer(
}
}

private void checkBatchSizeEnforcedMaximum(float batchSizeInBytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This change was added before to avoid OOM issue in the scanner, given that the Parquert scanner is much better now and the iceberg doc mentions that it works well for files upto 1GB compressed, I don't think this is necessary now
  2. We still have a warning on this, see checkBatchSizeRecommendedMaximum right below this function
  3. We have valid use cases where the batch size could be big, imagine the case where the Kafka Connector has a big flush interval and everything will be inserted as one batch

Copy link
Contributor

@sfc-gh-tzhang sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@sfc-gh-alhuang sfc-gh-alhuang merged commit f75124c into master Apr 3, 2024
15 checks passed
@sfc-gh-alhuang sfc-gh-alhuang deleted the alhuang-file-size branch April 3, 2024 17:03
@jdcaperon
Copy link

Keen to see this released, any idea on how long til this is cut?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants