SNOW-1845050: Proposal to speed up write_pandas for large dataframes #2114

mdwgrogan · 2024-12-04T14:17:37Z

This for-loop allows for parallelized upload of files in the PUT command, so favors fewer, larger files.

snowflake-connector-python/src/snowflake/connector/pandas_tools.py

Lines 374 to 391 in 703d7f4

    
           with TemporaryDirectory() as tmp_folder: 
        
               for i, chunk in chunk_helper(df, chunk_size): 
        
                   chunk_path = os.path.join(tmp_folder, f"file{i}.txt") 
        
                   # Dump chunk into parquet file 
        
                   chunk.to_parquet(chunk_path, compression=compression, **kwargs) 
        
                   # Upload parquet file 
        
                   upload_sql = ( 
        
                       "PUT /* Python:snowflake.connector.pandas_tools.write_pandas() */ " 
        
                       "'file://{path}' @{stage_location} PARALLEL={parallel}" 
        
                   ).format( 
        
                       path=chunk_path.replace("\\", "\\\\").replace("'", "\\'"), 
        
                       stage_location=stage_location, 
        
                       parallel=parallel, 
        
                   ) 
        
                   logger.debug(f"uploading files with '{upload_sql}'") 
        
                   cursor.execute(upload_sql, _is_internal=True) 
        
                   # Remove chunk file 
        
                   os.remove(chunk_path)

However, the efficiency of the COPY INTO statement is improved by having more, smaller files (up to a point). This competition leads to a reduction in overall efficiency. In practice, I found that for large data frames (10+ GB in memory) it can increase load times by ~20% compared to a directory upload (below).

    with TemporaryDirectory() as tmp_folder:
        for i, chunk in chunk_helper(df, chunk_size):
            chunk_path = os.path.join(tmp_folder, f"file{i}.txt")
            # Dump chunks into parquet files
            chunk.to_parquet(chunk_path, compression=compression, **kwargs)

        # Upload directory
        upload_sql = (
            "PUT /* Python:snowflake.connector.pandas_tools.write_pandas() */ "
            "'file://{path}/*' @{stage_location} PARALLEL={parallel}"
        ).format(
            path=tmp_folder.replace("\\", "\\\\").replace("'", "\\'"),
            stage_location=stage_location,
            parallel=parallel,
        )
        logger.debug(f"uploading files with '{upload_sql}'")
        cursor.execute(upload_sql, _is_internal=True)

In my testing on smaller dataframes, I observed comparable performance between the new and old versions of this loop. Note that this does result in more usage of the temp filesystem, as the full data frame is persisted to disk before being uploaded. However, since the dataframe to upload is in-memory, I wouldn't expect there to be filesystem space issues caused by this in most circumstances.

sfc-gh-sghosh · 2024-12-06T03:26:26Z

Hello @mdwgrogan ,

Thanks for raising the issue, we are looking into it, will update.

Regards,
Sujan

sfc-gh-sghosh · 2024-12-06T06:05:19Z

Hello @mdwgrogan ,

Snowflake recommends using data files between 100 MB to 250 MB (compressed) for optimal performance.
If files exceed this size range, consider using a third-party tool to split them into smaller chunks before uploading.
The PUT command does not support uploading files with divergent directory paths as it does not preserve directory structures. Wildcards (e.g., file:///tmp/data/*.parquet) can be used to match multiple files in the same directory but will fail for nested subdirectories (e.g., file:///tmp/data/**).

So, The current implementation in write_pandas aligns with Snowflake’s guidelines by creating manageable file sizes for the PUT command.
If optimizing for performance, ensure the chunk size creates files within the 100 MB to 250 MB range to balance PUT and COPY INTO efficiencies. For very large datasets, adjusting the chunk size to produce appropriately sized files is key. While directory-based uploads might reduce the number of PUT commands, file size constraints and directory structure limitations must be adhered to for seamless operations.

Regards,
Sujan

mdwgrogan · 2024-12-12T22:22:56Z

@sfc-gh-sghosh, thanks for your replies - I just want to clarify a few things.

Yes - I know - I'm taking some care in setting the chunk size to make sure the df is split appropriately by the write_pandas command; my point is that by uploading each file individually, you are diminishing the value from the parallelization
This is not relevant, since the write_pandas function only has a single dimension of chunking (it's entirely row based, no nested subdirectories at all).

I created a dataframe full of random data to test out the magnitude of the impact, and it is generally 20-30%.

RNG = np.random.default_rng()
nrows = 100_000_000
ncols = 10

df = pd.DataFrame(RNG.random(size=(nrows, ncols)), columns=list(string.ascii_uppercase[0:ncols]))

# check the file size for a given chunk_size
chunk_size = 2_000_000
test_path = pathlib.Path("test.snappy.parquet")
df.iloc[0:chunk_size].to_parquet(test_path)
test_path.stat().st_size / (1 << 20)  # ~160 MB

_ = pd_tools.write_pandas(
    conn,
    df,
    "CURR_WRITER",
    auto_create_table=True,
    overwrite=True,
    compression='snappy',
    chunk_size=chunk_size,
    parallel=10,
)

In my container, the current version of write_pandas took 3m30, and the changed version I proposed took 2m30. Hopefully this minimal example helps explain the value of updating this.

Thanks!

github-actions bot changed the title ~~Proposal to speed up write_pandas for large dataframes~~ SNOW-1845050: Proposal to speed up write_pandas for large dataframes Dec 4, 2024

sfc-gh-sghosh self-assigned this Dec 6, 2024

sfc-gh-sghosh added the status-triage Issue is under initial triage label Dec 6, 2024

sfc-gh-sghosh added status-triage_done Initial triage done, will be further handled by the driver team and removed status-triage Issue is under initial triage labels Dec 6, 2024

sfc-gh-sghosh closed this as completed Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1845050: Proposal to speed up write_pandas for large dataframes #2114

SNOW-1845050: Proposal to speed up write_pandas for large dataframes #2114

mdwgrogan commented Dec 4, 2024

sfc-gh-sghosh commented Dec 6, 2024

sfc-gh-sghosh commented Dec 6, 2024

mdwgrogan commented Dec 12, 2024

SNOW-1845050: Proposal to speed up write_pandas for large dataframes #2114

SNOW-1845050: Proposal to speed up write_pandas for large dataframes #2114

Comments

mdwgrogan commented Dec 4, 2024

sfc-gh-sghosh commented Dec 6, 2024

sfc-gh-sghosh commented Dec 6, 2024

mdwgrogan commented Dec 12, 2024