Loading List of List of Strings leads to nans #917

olegsinavski · 2024-02-03T18:47:28Z

Describe the issue:
Hello,
I have a peculiar field type in my parquet file:
List of Lists of strings.

For example:
0 []
1 [["hello"]]
2 [["hello", "bye"]]
3 [["hello"], ["bye"]]
...

I found that pyarrow loads those fine (default pandas engine), while fastparquet silently converts them to nans.

Minimal Complete Verifiable Example:

import pandas as pd
import numpy as np
import sys
from fastparquet import ParquetFile

print(f"Python {sys.version}")
print(f"Numpy {np.__version__}")
print(f"Fasparquet {fastparquet.__version__}")
print(f"Pandas {pd.__version__}")

data = {
    "texts": [[["Message1", "Message2"]]],
}
df = pd.DataFrame(data)
df.to_parquet('test.parquet')

df_fast = ParquetFile('test.parquet')
df_fast = df_fast.to_pandas()
print("fastparquet:")
print(df_fast)

df_pandas = pd.read_parquet('approximated_structure.parquet')
print("pyarrow:")
print(df_pandas)

prints out:

Python 3.9.18 (main, Oct  3 2023, 01:30:02) 
[Clang 17.0.1 ]
Numpy 1.25.2
Fasparquet 2023.10.1
Pandas 1.5.3
fastparquet:
  texts
0  None
pyarrow:
                    texts
0  [[Message1, Message2]]

Environment:
Fastparquet version: 2023.10.1

The text was updated successfully, but these errors were encountered:

martindurant · 2024-02-03T20:16:42Z

Fastparquet only support one-level nested structures (struct, list or map or primitive types) https://fastparquet.readthedocs.io/en/latest/details.html#reading-nested-schema
This is because pandas without arrow doesn't really support anything else either.

olegsinavski · 2024-02-04T13:42:13Z

Thanks, that is indeed in the docs, so not a bug unfortunately..

@martindurant What do you mean it's not supported by pandas? After parquet is read (with arrow by default), I get a nice numpy array of strings there which I can access, map through etc.

Also, wouldn't it be better to hard-crash on parse failure as opposed to silently producing nans? It took me ages to find this since its a just one column out of many I have in the format. Maybe I should file a separate issue on that though..

martindurant · 2024-02-05T14:01:16Z

After parquet is read (with arrow by default)

Either you get a native arrow column with a set of arrays in the background or a numpy array ob objects. fastpython could in theory do the latter, but it would be a lot of work and be really slow loopy python code.

olegsinavski · 2024-02-07T14:38:54Z

Sounds good! So what do you think about hard-crashing instead of silently ignoring data in this case? I don't think ignoring a specific column would fly in production systems..

martindurant · 2024-02-07T14:40:42Z

A warning may be OK, but I don't think we want to force users to list specific columns they want to read just to avoid those they can't

olegsinavski · 2024-02-07T14:49:41Z

It's probably more frequent to either want all the data (in that case, a user would expect things to be read correctly) or already specify what you want anyway. I haven't seen a library where data is ignored silently just for the syntactic convenience. Also, "ignore_columns" could be an option. But that's just an opinion. Thank you a lot for the very fast library!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading List of List of Strings leads to nans #917

Loading List of List of Strings leads to nans #917

olegsinavski commented Feb 3, 2024

martindurant commented Feb 3, 2024

olegsinavski commented Feb 4, 2024

martindurant commented Feb 5, 2024

olegsinavski commented Feb 7, 2024

martindurant commented Feb 7, 2024

olegsinavski commented Feb 7, 2024

Loading List of List of Strings leads to nans #917

Loading List of List of Strings leads to nans #917

Comments

olegsinavski commented Feb 3, 2024

martindurant commented Feb 3, 2024

olegsinavski commented Feb 4, 2024

martindurant commented Feb 5, 2024

olegsinavski commented Feb 7, 2024

martindurant commented Feb 7, 2024

olegsinavski commented Feb 7, 2024