[Python][C++] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash #44643

snakingfire · 2024-11-05T05:36:57Z

Describe the bug, including details regarding any error messages, version, and platform.

Related to #44640

When attempting to convert a pandas dataframe that has a dict type column to a pyarrow table with a map column, if the dataframe and column are of sufficient size, the conversion fails with:

/.../arrow/cpp/src/arrow/array/builder_nested.cc:103:  Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder

This is immediately followed by SIGABRT and the process crashing.

When the dataframe is of a smaller size, the conversion succeeds without error. See below for reproduction code, when dataframe_size is set to a small value (eg 1M rows) there is no error, but at a certain size (eg, 10M rows) the error condition occurs.

import pandas as pd
import pyarrow

# Example DataFrame creation
import numpy as np
import random
import string

dataframe_size = 10_000_000

map_keys = [
    "a1B2c3D4e5",
    "f6G7h8I9j0",
    "k1L2m3N4o5",
    "p6Q7r8S9t0",
    "u1V2w3X4y5",
    "z6A7b8C9d0",
    "e1F2g3H4i5",
    "j6K7l8M9n0",
    "o1P2q3R4s5",
    "t6U7v8W9x0",
    "y1Z2a3B4c5",
    "d6E7f8G9h0",
    "i1J2k3L4m5",
    "n6O7p8Q9r0",
    "s1T2u3V4w5",
]

# Pre-generate random strings for columns to avoid repeated computation
print("Generating random column strings")
random_strings = [
    "".join(random.choices(string.ascii_letters + string.digits, k=20))
    for _ in range(int(dataframe_size / 100))
]

# Pre-generate random map values
print("Generating random map value strings")
random_map_values = [
    "".join(
        random.choices(
            string.ascii_letters + string.digits, k=random.randint(20, 200)
        )
    )
    for _ in range(int(dataframe_size / 100))
]

print("Generating random maps")
random_maps = [
    {
        key: random.choice(random_map_values)
        for key in random.sample(map_keys, random.randint(5, 10))
    }
    for _ in range(int(dataframe_size / 100))
]

print("Generating random dataframe")
data_with_map_col = {
    "partition": np.full(dataframe_size, "1"),
    "column1": np.random.choice(random_strings, dataframe_size),
    "map_col": np.random.choice(random_maps, dataframe_size),
}

# Create DataFrame
df_with_map_col = pd.DataFrame(data_with_map_col)

column_types = {
    "partition": pyarrow.string(),
    "column1": pyarrow.string(),
    "map_col": pyarrow.map_(pyarrow.string(), pyarrow.string()),
}
schema = pyarrow.schema(fields=column_types)

# Process crashes when dataframe is large enough
table = pyarrow.Table.from_pandas(
    df=df_with_map_col, schema=schema, preserve_index=False, safe=True
)

Environment Details:

Python Version: Python 3.11.8
Pyarrow version: 18.0.0

Component(s)

Python

The text was updated successfully, but these errors were encountered:

pitrou · 2024-11-09T21:34:32Z

cc @jorisvandenbossche @raulcd

raulcd · 2024-11-14T16:26:36Z

really weird:

item_builder_->length()   19456153
key_builder_->length():   19456154

I'll have to debug a bunch to understand where this mismatch is coming from :)

pitrou · 2024-11-14T17:09:10Z

Intuitively, I think what happens is that the item_builder_ overflows because it's a StringBuilder and we try to append more than 2 GiB to it. The converter logic then tries to finish the chunk and start another one, but the key and item builders are out of sync.

pitrou · 2024-11-14T17:26:22Z

It looks like the rewind-on-overflow in arrow/util/converter.h is too naive. In particular, if appending to one of a StructBuilder's child builders raises CapacityError, then all child builders should be rewind to the same length to ensure consistency.

raulcd · 2024-11-15T11:24:50Z

Thanks @pitrou that was also what I thought but I am confused after some debugging because even if I decide to not rewind (just for testing purposes):

-          length_ -= 1;
-          offset -= 1;
+          // length_ -= 1;
+          // offset -= 1;
+          std::cout << "Let's not rewind" << '\n';

The two builders still have a mismatch, below some logging from my tests when not rewinding:

Chunker::Extend length_before 0 length_after: 2601690
Chunker::Extend num_converted 2601690
Let's not rewind
FinishChunk length_: 2601690
FinishChunk reserved_: 0
ToArray(int64_t length) length: 2601690
ToArray length: 2601690
item_builder_->length()   19530430
key_builder_->length():   19530431

pitrou · 2024-11-15T11:47:00Z

Yes, I don't think the explicit rewinding is the problem. The problem is that, when a field of a struct builder fails appending, the other fields are not rewinded.

raulcd · 2024-11-15T17:08:26Z

The error is coming from here https://github.com/apache/arrow/blob/main/python/pyarrow/src/arrow/python/python_to_arrow.cc#L1059-L1062
If an element on the tuple fails to be added due to capacity we should rewind the already appended tuple elements from the other child arrays otherwise there is a mismatch. I was trying to understand whether there is a method to Pop an element or if we should first check that all elements for the tuple are going to fit (maybe through Builder::CheckCapacity even though that method seems to be private) and if positive append to the individual child arrays.

snakingfire added the Type: bug label Nov 5, 2024

github-actions bot added the Component: Python label Nov 5, 2024

snakingfire changed the title ~~Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash~~ [Python] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash Nov 5, 2024

snakingfire mentioned this issue Nov 7, 2024

Debugging obscure MapBuilder error in builder_nested.cc #44640

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][C++] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash #44643

[Python][C++] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash #44643

snakingfire commented Nov 5, 2024

pitrou commented Nov 9, 2024

raulcd commented Nov 14, 2024

pitrou commented Nov 14, 2024

pitrou commented Nov 14, 2024

raulcd commented Nov 15, 2024

pitrou commented Nov 15, 2024

raulcd commented Nov 15, 2024 •

edited

Loading

[Python][C++] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash #44643

[Python][C++] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash #44643

Comments

snakingfire commented Nov 5, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

pitrou commented Nov 9, 2024

raulcd commented Nov 14, 2024

pitrou commented Nov 14, 2024

pitrou commented Nov 14, 2024

raulcd commented Nov 15, 2024

pitrou commented Nov 15, 2024

raulcd commented Nov 15, 2024 • edited Loading

raulcd commented Nov 15, 2024 •

edited

Loading