Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1845050: Proposal to speed up write_pandas for large dataframes #2114

Closed
mdwgrogan opened this issue Dec 4, 2024 · 3 comments
Closed
Assignees
Labels
status-triage_done Initial triage done, will be further handled by the driver team

Comments

@mdwgrogan
Copy link

This for-loop allows for parallelized upload of files in the PUT command, so favors fewer, larger files.

with TemporaryDirectory() as tmp_folder:
for i, chunk in chunk_helper(df, chunk_size):
chunk_path = os.path.join(tmp_folder, f"file{i}.txt")
# Dump chunk into parquet file
chunk.to_parquet(chunk_path, compression=compression, **kwargs)
# Upload parquet file
upload_sql = (
"PUT /* Python:snowflake.connector.pandas_tools.write_pandas() */ "
"'file://{path}' @{stage_location} PARALLEL={parallel}"
).format(
path=chunk_path.replace("\\", "\\\\").replace("'", "\\'"),
stage_location=stage_location,
parallel=parallel,
)
logger.debug(f"uploading files with '{upload_sql}'")
cursor.execute(upload_sql, _is_internal=True)
# Remove chunk file
os.remove(chunk_path)

However, the efficiency of the COPY INTO statement is improved by having more, smaller files (up to a point). This competition leads to a reduction in overall efficiency. In practice, I found that for large data frames (10+ GB in memory) it can increase load times by ~20% compared to a directory upload (below).

    with TemporaryDirectory() as tmp_folder:
        for i, chunk in chunk_helper(df, chunk_size):
            chunk_path = os.path.join(tmp_folder, f"file{i}.txt")
            # Dump chunks into parquet files
            chunk.to_parquet(chunk_path, compression=compression, **kwargs)

        # Upload directory
        upload_sql = (
            "PUT /* Python:snowflake.connector.pandas_tools.write_pandas() */ "
            "'file://{path}/*' @{stage_location} PARALLEL={parallel}"
        ).format(
            path=tmp_folder.replace("\\", "\\\\").replace("'", "\\'"),
            stage_location=stage_location,
            parallel=parallel,
        )
        logger.debug(f"uploading files with '{upload_sql}'")
        cursor.execute(upload_sql, _is_internal=True)

In my testing on smaller dataframes, I observed comparable performance between the new and old versions of this loop. Note that this does result in more usage of the temp filesystem, as the full data frame is persisted to disk before being uploaded. However, since the dataframe to upload is in-memory, I wouldn't expect there to be filesystem space issues caused by this in most circumstances.

@github-actions github-actions bot changed the title Proposal to speed up write_pandas for large dataframes SNOW-1845050: Proposal to speed up write_pandas for large dataframes Dec 4, 2024
@sfc-gh-sghosh sfc-gh-sghosh self-assigned this Dec 6, 2024
@sfc-gh-sghosh
Copy link

Hello @mdwgrogan ,

Thanks for raising the issue, we are looking into it, will update.

Regards,
Sujan

@sfc-gh-sghosh sfc-gh-sghosh added the status-triage Issue is under initial triage label Dec 6, 2024
@sfc-gh-sghosh
Copy link

Hello @mdwgrogan ,

  1. Snowflake recommends using data files between 100 MB to 250 MB (compressed) for optimal performance.
    If files exceed this size range, consider using a third-party tool to split them into smaller chunks before uploading.

  2. The PUT command does not support uploading files with divergent directory paths as it does not preserve directory structures. Wildcards (e.g., file:///tmp/data/*.parquet) can be used to match multiple files in the same directory but will fail for nested subdirectories (e.g., file:///tmp/data/**).

So, The current implementation in write_pandas aligns with Snowflake’s guidelines by creating manageable file sizes for the PUT command.
If optimizing for performance, ensure the chunk size creates files within the 100 MB to 250 MB range to balance PUT and COPY INTO efficiencies. For very large datasets, adjusting the chunk size to produce appropriately sized files is key. While directory-based uploads might reduce the number of PUT commands, file size constraints and directory structure limitations must be adhered to for seamless operations.

Regards,
Sujan

@sfc-gh-sghosh sfc-gh-sghosh added status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Dec 6, 2024
@mdwgrogan
Copy link
Author

@sfc-gh-sghosh, thanks for your replies - I just want to clarify a few things.

  1. Yes - I know - I'm taking some care in setting the chunk size to make sure the df is split appropriately by the write_pandas command; my point is that by uploading each file individually, you are diminishing the value from the parallelization
  2. This is not relevant, since the write_pandas function only has a single dimension of chunking (it's entirely row based, no nested subdirectories at all).

I created a dataframe full of random data to test out the magnitude of the impact, and it is generally 20-30%.

RNG = np.random.default_rng()
nrows = 100_000_000
ncols = 10

df = pd.DataFrame(RNG.random(size=(nrows, ncols)), columns=list(string.ascii_uppercase[0:ncols]))

# check the file size for a given chunk_size
chunk_size = 2_000_000
test_path = pathlib.Path("test.snappy.parquet")
df.iloc[0:chunk_size].to_parquet(test_path)
test_path.stat().st_size / (1 << 20)  # ~160 MB

_ = pd_tools.write_pandas(
    conn,
    df,
    "CURR_WRITER",
    auto_create_table=True,
    overwrite=True,
    compression='snappy',
    chunk_size=chunk_size,
    parallel=10,
)

In my container, the current version of write_pandas took 3m30, and the changed version I proposed took 2m30. Hopefully this minimal example helps explain the value of updating this.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status-triage_done Initial triage done, will be further handled by the driver team
Projects
None yet
Development

No branches or pull requests

2 participants