Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash #44643

Open
snakingfire opened this issue Nov 5, 2024 · 7 comments

Comments

@snakingfire
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Related to #44640

When attempting to convert a pandas dataframe that has a dict type column to a pyarrow table with a map column, if the dataframe and column are of sufficient size, the conversion fails with:

/.../arrow/cpp/src/arrow/array/builder_nested.cc:103:  Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder

This is immediately followed by SIGABRT and the process crashing.

When the dataframe is of a smaller size, the conversion succeeds without error. See below for reproduction code, when dataframe_size is set to a small value (eg 1M rows) there is no error, but at a certain size (eg, 10M rows) the error condition occurs.

import pandas as pd
import pyarrow

# Example DataFrame creation
import numpy as np
import random
import string

dataframe_size = 10_000_000

map_keys = [
    "a1B2c3D4e5",
    "f6G7h8I9j0",
    "k1L2m3N4o5",
    "p6Q7r8S9t0",
    "u1V2w3X4y5",
    "z6A7b8C9d0",
    "e1F2g3H4i5",
    "j6K7l8M9n0",
    "o1P2q3R4s5",
    "t6U7v8W9x0",
    "y1Z2a3B4c5",
    "d6E7f8G9h0",
    "i1J2k3L4m5",
    "n6O7p8Q9r0",
    "s1T2u3V4w5",
]

# Pre-generate random strings for columns to avoid repeated computation
print("Generating random column strings")
random_strings = [
    "".join(random.choices(string.ascii_letters + string.digits, k=20))
    for _ in range(int(dataframe_size / 100))
]

# Pre-generate random map values
print("Generating random map value strings")
random_map_values = [
    "".join(
        random.choices(
            string.ascii_letters + string.digits, k=random.randint(20, 200)
        )
    )
    for _ in range(int(dataframe_size / 100))
]

print("Generating random maps")
random_maps = [
    {
        key: random.choice(random_map_values)
        for key in random.sample(map_keys, random.randint(5, 10))
    }
    for _ in range(int(dataframe_size / 100))
]

print("Generating random dataframe")
data_with_map_col = {
    "partition": np.full(dataframe_size, "1"),
    "column1": np.random.choice(random_strings, dataframe_size),
    "map_col": np.random.choice(random_maps, dataframe_size),
}

# Create DataFrame
df_with_map_col = pd.DataFrame(data_with_map_col)

column_types = {
    "partition": pyarrow.string(),
    "column1": pyarrow.string(),
    "map_col": pyarrow.map_(pyarrow.string(), pyarrow.string()),
}
schema = pyarrow.schema(fields=column_types)

# Process crashes when dataframe is large enough
table = pyarrow.Table.from_pandas(
    df=df_with_map_col, schema=schema, preserve_index=False, safe=True
)

Environment Details:

  • Python Version: Python 3.11.8
  • Pyarrow version: 18.0.0

Component(s)

Python

@snakingfire snakingfire changed the title Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash [Python] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash Nov 5, 2024
@pitrou
Copy link
Member

pitrou commented Nov 9, 2024

cc @jorisvandenbossche @raulcd

@raulcd raulcd changed the title [Python] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash [Python][C++] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash Nov 14, 2024
@raulcd
Copy link
Member

raulcd commented Nov 14, 2024

really weird:

item_builder_->length()   19456153
key_builder_->length():   19456154

I'll have to debug a bunch to understand where this mismatch is coming from :)

@pitrou
Copy link
Member

pitrou commented Nov 14, 2024

Intuitively, I think what happens is that the item_builder_ overflows because it's a StringBuilder and we try to append more than 2 GiB to it. The converter logic then tries to finish the chunk and start another one, but the key and item builders are out of sync.

@pitrou
Copy link
Member

pitrou commented Nov 14, 2024

It looks like the rewind-on-overflow in arrow/util/converter.h is too naive. In particular, if appending to one of a StructBuilder's child builders raises CapacityError, then all child builders should be rewind to the same length to ensure consistency.

@raulcd
Copy link
Member

raulcd commented Nov 15, 2024

Thanks @pitrou that was also what I thought but I am confused after some debugging because even if I decide to not rewind (just for testing purposes):

-          length_ -= 1;
-          offset -= 1;
+          // length_ -= 1;
+          // offset -= 1;
+          std::cout << "Let's not rewind" << '\n';

The two builders still have a mismatch, below some logging from my tests when not rewinding:

Chunker::Extend length_before 0 length_after: 2601690
Chunker::Extend num_converted 2601690
Let's not rewind
FinishChunk length_: 2601690
FinishChunk reserved_: 0
ToArray(int64_t length) length: 2601690
ToArray length: 2601690
item_builder_->length()   19530430
key_builder_->length():   19530431

@pitrou
Copy link
Member

pitrou commented Nov 15, 2024

Yes, I don't think the explicit rewinding is the problem. The problem is that, when a field of a struct builder fails appending, the other fields are not rewinded.

@raulcd
Copy link
Member

raulcd commented Nov 15, 2024

The error is coming from here https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/python_to_arrow.cc#L1059-L1062
If an element on the tuple fails to be added due to capacity we should rewind the already appended tuple elements from the other child arrays otherwise there is a mismatch. I was trying to understand whether there is a method to Pop an element or if we should first check that all elements for the tuple are going to fit (maybe through Builder::CheckCapacity even though that method seems to be private) and if positive append to the individual child arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants