-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1845050: Proposal to speed up write_pandas for large dataframes #2114
Comments
Hello @mdwgrogan , Thanks for raising the issue, we are looking into it, will update. Regards, |
Hello @mdwgrogan ,
So, The current implementation in write_pandas aligns with Snowflake’s guidelines by creating manageable file sizes for the PUT command. Regards, |
@sfc-gh-sghosh, thanks for your replies - I just want to clarify a few things.
I created a dataframe full of random data to test out the magnitude of the impact, and it is generally 20-30%. RNG = np.random.default_rng()
nrows = 100_000_000
ncols = 10
df = pd.DataFrame(RNG.random(size=(nrows, ncols)), columns=list(string.ascii_uppercase[0:ncols]))
# check the file size for a given chunk_size
chunk_size = 2_000_000
test_path = pathlib.Path("test.snappy.parquet")
df.iloc[0:chunk_size].to_parquet(test_path)
test_path.stat().st_size / (1 << 20) # ~160 MB
_ = pd_tools.write_pandas(
conn,
df,
"CURR_WRITER",
auto_create_table=True,
overwrite=True,
compression='snappy',
chunk_size=chunk_size,
parallel=10,
) In my container, the current version of write_pandas took 3m30, and the changed version I proposed took 2m30. Hopefully this minimal example helps explain the value of updating this. Thanks! |
This for-loop allows for parallelized upload of files in the PUT command, so favors fewer, larger files.
snowflake-connector-python/src/snowflake/connector/pandas_tools.py
Lines 374 to 391 in 703d7f4
However, the efficiency of the COPY INTO statement is improved by having more, smaller files (up to a point). This competition leads to a reduction in overall efficiency. In practice, I found that for large data frames (10+ GB in memory) it can increase load times by ~20% compared to a directory upload (below).
In my testing on smaller dataframes, I observed comparable performance between the new and old versions of this loop. Note that this does result in more usage of the temp filesystem, as the full data frame is persisted to disk before being uploaded. However, since the dataframe to upload is in-memory, I wouldn't expect there to be filesystem space issues caused by this in most circumstances.
The text was updated successfully, but these errors were encountered: