Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent merges not working >=0.19.0 #2980

Open
echai58 opened this issue Nov 7, 2024 · 0 comments
Open

Concurrent merges not working >=0.19.0 #2980

echai58 opened this issue Nov 7, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@echai58
Copy link

echai58 commented Nov 7, 2024

Environment

Delta-rs version: 0.19.0 and later

Binding: python


Bug

What happened:
On 0.18.2, concurrent merges to different partitions worked (on some types of partition columns, at least - such as strings). On attempting to upgrade, it seems to be completely broken. The regression exists on all versions >= 0.19.0.

What you expected to happen:
Concurrent merges to different partitions continues to work for the partition types that it used to work for.

How to reproduce it:
The following code block succeeds on 0.18.2, but fails with

CommitFailedError: Failed to commit transaction: Error evaluating predicate: Generic DeltaTable error: Internal error: Failed to coerce types Date32 and Int64 in BETWEEN expression.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

on 0.19.0 and later.

from deltalake import DeltaTable, write_deltalake
import pyarrow as pa
import datetime
import pandas as pd 

path = "test_dt"
partition_columns = ["string_key"]
schema = pa.schema(
    [
        pa.field("date_key", pa.date32()),
        pa.field("string_key", pa.string()),
        pa.field("int32_key", pa.int32()),
        pa.field("value", pa.float64()),
    ]
)

DeltaTable.create(
    table_uri=path,
    schema=schema,
    mode="error",
    partition_by=partition_columns,
)

# this simulates a concurrent write
delta_table_1 = DeltaTable(path)
delta_table_2 = DeltaTable(path)

delta_table_1.merge(
    pa.Table.from_pandas(
        pd.DataFrame(
            {
                "date_key": [datetime.date(2020, 1, 1)],
                "string_key": ["foo"],
                "int32_key": [1],
                "value": [2.0],
            }
        ),
        schema=schema,
    ),
    predicate="s.date_key = t.date_key AND s.string_key = t.string_key AND s.int32_key = t.int32_key",
    source_alias="s",
    target_alias="t",
).when_matched_update_all().when_not_matched_insert_all().execute()

delta_table_2.merge(
    pa.Table.from_pandas(
        pd.DataFrame(
            {
                "date_key": [datetime.date(2021, 1, 1)],
                "string_key": ["foo2"],
                "int32_key": [2],
                "value": [3.0],
            }
        )
    ),
    predicate="s.date_key = t.date_key AND s.string_key = t.string_key AND s.int32_key = t.int32_key",
    source_alias="s",
    target_alias="t",
).when_matched_update_all().when_not_matched_insert_all().execute()
@echai58 echai58 added the bug Something isn't working label Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant