Pyarrow writer not encoding correct URL for partitions in delta table #2978

gprashmi · 2024-11-05T20:41:35Z

Environment

Delta-rs version: 0.19.0

What happened:
We write data to delta table using delta-rs with PyArrow engine with DayHour as partition column.

        deltalake.write_deltalake(
            table_or_uri=delta_table_path,
            data=df,
            partition_by=[dayhour_partition_column],
            schema_mode='overwrite',
            mode="append",
            storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": "true"},
        )

I ran the optimize command using the spark sql query below on the delta table

optimize_query = f"""
OPTIMIZE delta.`s3_table_path`
ZORDER BY (col1, col2)
"""
spark.sql(optimize_query)

After optimize, it creates partitions with spaces and does not properly encode the partition urls as shown in the below image i.e; it creates new partitions url with spaces (.zstd.parquet).

@ion-elgreco Can you please let me know how we can run the optimize.compact without having partitions with spaces?

Similar issue was raised in June (#2634), where it was mentioned it is fixed in the 0.18.3 version but I still see the same issue when I optimize now. To clarify, I use Pyarrow engine and not Rust if that is causing the break in partitions.

The text was updated successfully, but these errors were encountered:

gprashmi · 2024-11-05T20:45:54Z

I tried to install the 0.18.3 version, but it says its not available , so I installed 0.19.0 and tried to optimize. the table is written in the 0.17.4 version

ERROR: Could not find a version that satisfies the requirement deltalake==0.18.3 (from versions: 0.2.0, 0.2.1, 0.3.0, 0.4.0, 0.4.1, 0.4.4, 0.4.5, 0.4.6, 0.4.7, 0.4.8, 0.5.0, 0.5.1, 0.5.2, 0.5.3, 0.5.4, 0.5.5, 0.5.6, 0.5.7, 0.5.8, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.7.0, 0.8.0, 0.8.1, 0.9.0, 0.10.0, 0.10.1, 0.10.2, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 0.15.0, 0.15.1, 0.15.2, 0.15.3, 0.16.0, 0.16.1, 0.16.2, 0.16.3, 0.16.4, 0.17.0, 0.17.1, 0.17.2, 0.17.3, 0.17.4, 0.18.0, 0.18.1, 0.18.2, 0.19.0, 0.19.1, 0.19.2, 0.20.0, 0.20.1, 0.20.2, 0.21.0)
ERROR: No matching distribution found for deltalake==0.18.3

gprashmi · 2024-11-06T17:36:37Z

@ion-elgreco @rtyler

ion-elgreco · 2024-11-06T17:46:33Z

Please use the latest version, this is already resolved

gprashmi · 2024-11-06T19:34:22Z

@ion-elgreco
I used the latest version 0.21.0 to both write the table and optimize/ vacuum it .. even then I still see the same encoding, pls find details below:

deltalake = "0.21.0"

I tried to even optimize without Z-order
optimize_query = f"""
OPTIMIZE delta.s3_table_path
"""
spark.sql(optimize_query)

vacuum_query = f"""
VACUUM delta.`{s3_table_path}`
RETAIN 168 HOURS
"""

# Execute the VACUUM command
spark.sql(vacuum_query)

But still see the partitions like this after optimize/ vacuum

gprashmi · 2024-11-06T19:35:31Z

with optimize, I also do not see the number of partition files reduced, in fact with new partitions (with spaces) the file count has increased. This is expected?

gprashmi · 2024-11-06T23:28:35Z

@ion-elgreco It only works sometimes and since there is broken partitions created the optimize fails sometimes due to this with the below error:

pyspark.errors.exceptions.connect.SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in stage 989746.0 failed 4 times, most recent failure: Lost task 0.3 in stage 989746.0 (TID 9353124) (10.13.14.133 executor 6): org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] User defined function (CanonicalPathFunction: (string) => string) failed due to: java.net.URISyntaxException: Illegal character in path at index 18: DayHour=2024-11-06 22%253A00%253A00/part-00000-eac21980-5584-4515-9a80-bfe806ab5ceb.c000.snappy.parquet. SQLSTATE: 39000

ion-elgreco · 2024-11-07T06:47:05Z

Try recreating the table with latest version

gprashmi · 2024-11-07T15:55:09Z

@ion-elgreco Yes, like I mentioned in the previous comment both the write to table, optimize/ vacuum is done using the latest version (0.21.0) which still breaks due to spaces in partition.

When I write to the table, there are no spaces in the partitions, but after optimize the spaces are created. My partition column is DayHour which is like (2024-1-09 21:00:00), is the spaces created because of this during optimize? Should we not have date and hour together as partition column? Is there an alternative we can do for this?

thomasfrederikhoeck · 2024-11-10T13:34:09Z

@gprashmi are you on Windows by any chance?

gprashmi · 2024-11-11T19:53:28Z

@thomasfrederikhoeck I have a windows laptop, but I run these on a kubeflow experiment on a databricks cluster.

thomasfrederikhoeck · 2024-11-12T08:56:08Z

Okay. It was just because a similar issue (apache/arrow-rs#5592) has been fixed upstream but I don't think object_store has been upgraded in delta-rs so the fix is not part of delta-rs yet.

thomasfrederikhoeck · 2024-11-12T09:00:49Z

Yeah delta-rs used object-store=0.10.1 and the fix was added in object-store=0.10.2

https://github.com/delta-io/delta-rs/blob/7a3b3ec38ce1004eab1998669a5a80f8e61c5589/Cargo.toml#L44C1-L45C1

thomasfrederikhoeck · 2024-11-12T09:06:30Z

I guess this PR is fixed then it should also fix this: #2843

gprashmi · 2024-11-12T18:21:45Z

@thomasfrederikhoeck thank you for the update. Can you please let me know when would the delta-rs be updated to have the object-store=0.10.2?

@ion-elgreco Based on the comment from @thomasfrederikhoeck it looks like this would be fixed when delta-rs uses the updated object-store=0.10.2 version. Can you please let me know if this is in plan to have the delta-rs updated to latest object-store version?

ion-elgreco · 2024-11-12T18:38:04Z

@thomasfrederikhoeck thank you for the update. Can you please let me know when would the delta-rs be updated to have the object-store=0.10.2?

@ion-elgreco Based on the comment from @thomasfrederikhoeck it looks like this would be fixed when delta-rs uses the updated object-store=0.10.2 version. Can you please let me know if this is in plan to have the delta-rs updated to latest object-store version?

Feel free to create a PR for it

thomasfrederikhoeck · 2024-11-14T13:00:10Z

Maybe fixed by #2994

thomasfrederikhoeck · 2024-11-14T13:17:42Z

I'm not 100% sure this fixes this case so maybe leave it open @ion-elgreco ?

gprashmi · 2024-11-15T06:34:34Z

@thomasfrederikhoeck @ion-elgreco This did not fix the issue. I installed the delta-rs as python package and updated the object_store = 0.10.2 in the Cargo.toml and tested the delta-write, optimize and vacuum. It still shows the spacing in URL:

Sample code to re-produce

import deltalake

# Dummy data
initial_data = {
    'dayhour': ['2024-10-09 19:00:00', '2024-10-10 20:00:00'],
    'value1': [10, 20],
    'value2': [1.5, 2.5]
}
initial_df = pd.DataFrame(initial_data)

initial_df['dayhour'] = pd.to_datetime(initial_df['dayhour'])

# Define the schema for the Delta Lake table
schema = pa.schema([
    pa.field('dayhour', pa.timestamp('us')),  
    pa.field('value1', pa.int32()),          
    pa.field('value2', pa.float32())         
])

# Initialize the Delta table with the schema
deltalake.write_deltalake(
    table_or_uri=delta_table_path,
    data=initial_df,
    schema=schema,
    partition_by=['dayhour'],
    schema_mode='overwrite',
    mode="overwrite",
    storage_options={"AWS_S3_ALLOW_UNSAFE_RENAME": "true"},
)

optimize_query = f"""
OPTIMIZE delta_table_path
"""
spark.sql(optimize_query)

vacuum_query = f"""
VACUUM delta_table_path
RETAIN 168 HOURS
"""
spark.sql(vacuum_query)

This resulted in spaces in the URL encoding after optimize as below:
before optimize:

after optimize:

Can you please re-open this ticket? as I am unable to reopen from my end.

gprashmi · 2024-11-15T07:16:41Z

@ion-elgreco Thank you for re-opening. So I guess the updated version of object_store did not help in optimize here. Please let me know if there are any other suggestions/ alternatives we can use?
@thomasfrederikhoeck

thomasfrederikhoeck · 2024-11-15T08:51:05Z

Edit:
@gprashmi So this indicates that sparks encodes : but not ~~colon~~ space, right? You are in AWS, right?

gprashmi · 2024-11-15T18:14:10Z

@thomasfrederikhoeck I think it encodes colon, but not the spaces in the dayhour column between date and hour in some URLs. and yes on AWS.

thomasfrederikhoeck · 2024-11-18T08:49:41Z

@gprashmi does spark encode the space in all instances (write/merge/etc.) ?

gprashmi added the bug Something isn't working label Nov 5, 2024

gprashmi mentioned this issue Nov 12, 2024

Delta-rs not using latest object_store which fails the table optimize #2988

Closed

ion-elgreco closed this as completed Nov 14, 2024

ion-elgreco reopened this Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pyarrow writer not encoding correct URL for partitions in delta table #2978

Pyarrow writer not encoding correct URL for partitions in delta table #2978

gprashmi commented Nov 5, 2024

gprashmi commented Nov 5, 2024

gprashmi commented Nov 6, 2024

ion-elgreco commented Nov 6, 2024

gprashmi commented Nov 6, 2024

gprashmi commented Nov 6, 2024

gprashmi commented Nov 6, 2024

ion-elgreco commented Nov 7, 2024

gprashmi commented Nov 7, 2024

thomasfrederikhoeck commented Nov 10, 2024

gprashmi commented Nov 11, 2024 •

edited

Loading

thomasfrederikhoeck commented Nov 12, 2024 •

edited

Loading

thomasfrederikhoeck commented Nov 12, 2024

thomasfrederikhoeck commented Nov 12, 2024

gprashmi commented Nov 12, 2024 •

edited

Loading

ion-elgreco commented Nov 12, 2024

thomasfrederikhoeck commented Nov 14, 2024

thomasfrederikhoeck commented Nov 14, 2024

gprashmi commented Nov 15, 2024 •

edited

Loading

gprashmi commented Nov 15, 2024 •

edited

Loading

thomasfrederikhoeck commented Nov 15, 2024 •

edited

Loading

gprashmi commented Nov 15, 2024

thomasfrederikhoeck commented Nov 18, 2024

Pyarrow writer not encoding correct URL for partitions in delta table #2978

Pyarrow writer not encoding correct URL for partitions in delta table #2978

Comments

gprashmi commented Nov 5, 2024

Environment

gprashmi commented Nov 5, 2024

gprashmi commented Nov 6, 2024

ion-elgreco commented Nov 6, 2024

gprashmi commented Nov 6, 2024

gprashmi commented Nov 6, 2024

gprashmi commented Nov 6, 2024

ion-elgreco commented Nov 7, 2024

gprashmi commented Nov 7, 2024

thomasfrederikhoeck commented Nov 10, 2024

gprashmi commented Nov 11, 2024 • edited Loading

thomasfrederikhoeck commented Nov 12, 2024 • edited Loading

thomasfrederikhoeck commented Nov 12, 2024

thomasfrederikhoeck commented Nov 12, 2024

gprashmi commented Nov 12, 2024 • edited Loading

ion-elgreco commented Nov 12, 2024

thomasfrederikhoeck commented Nov 14, 2024

thomasfrederikhoeck commented Nov 14, 2024

gprashmi commented Nov 15, 2024 • edited Loading

gprashmi commented Nov 15, 2024 • edited Loading

thomasfrederikhoeck commented Nov 15, 2024 • edited Loading

gprashmi commented Nov 15, 2024

thomasfrederikhoeck commented Nov 18, 2024

gprashmi commented Nov 11, 2024 •

edited

Loading

thomasfrederikhoeck commented Nov 12, 2024 •

edited

Loading

gprashmi commented Nov 12, 2024 •

edited

Loading

gprashmi commented Nov 15, 2024 •

edited

Loading

gprashmi commented Nov 15, 2024 •

edited

Loading

thomasfrederikhoeck commented Nov 15, 2024 •

edited

Loading