Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading List of List of Strings leads to nans #917

Open
olegsinavski opened this issue Feb 3, 2024 · 6 comments
Open

Loading List of List of Strings leads to nans #917

olegsinavski opened this issue Feb 3, 2024 · 6 comments

Comments

@olegsinavski
Copy link

Describe the issue:
Hello,
I have a peculiar field type in my parquet file:
List of Lists of strings.

For example:
0 []
1 [["hello"]]
2 [["hello", "bye"]]
3 [["hello"], ["bye"]]
...

I found that pyarrow loads those fine (default pandas engine), while fastparquet silently converts them to nans.

Minimal Complete Verifiable Example:

import pandas as pd
import numpy as np
import sys
from fastparquet import ParquetFile

print(f"Python {sys.version}")
print(f"Numpy {np.__version__}")
print(f"Fasparquet {fastparquet.__version__}")
print(f"Pandas {pd.__version__}")

data = {
    "texts": [[["Message1", "Message2"]]],
}
df = pd.DataFrame(data)
df.to_parquet('test.parquet')

df_fast = ParquetFile('test.parquet')
df_fast = df_fast.to_pandas()
print("fastparquet:")
print(df_fast)

df_pandas = pd.read_parquet('approximated_structure.parquet')
print("pyarrow:")
print(df_pandas)

prints out:

Python 3.9.18 (main, Oct  3 2023, 01:30:02) 
[Clang 17.0.1 ]
Numpy 1.25.2
Fasparquet 2023.10.1
Pandas 1.5.3
fastparquet:
  texts
0  None
pyarrow:
                    texts
0  [[Message1, Message2]]

Environment:
Fastparquet version: 2023.10.1

@martindurant
Copy link
Member

Fastparquet only support one-level nested structures (struct, list or map or primitive types) https://fastparquet.readthedocs.io/en/latest/details.html#reading-nested-schema
This is because pandas without arrow doesn't really support anything else either.

@olegsinavski
Copy link
Author

Thanks, that is indeed in the docs, so not a bug unfortunately..

@martindurant What do you mean it's not supported by pandas? After parquet is read (with arrow by default), I get a nice numpy array of strings there which I can access, map through etc.

Also, wouldn't it be better to hard-crash on parse failure as opposed to silently producing nans? It took me ages to find this since its a just one column out of many I have in the format. Maybe I should file a separate issue on that though..

@martindurant
Copy link
Member

After parquet is read (with arrow by default)

Either you get a native arrow column with a set of arrays in the background or a numpy array ob objects. fastpython could in theory do the latter, but it would be a lot of work and be really slow loopy python code.

@olegsinavski
Copy link
Author

Sounds good! So what do you think about hard-crashing instead of silently ignoring data in this case? I don't think ignoring a specific column would fly in production systems..

@martindurant
Copy link
Member

A warning may be OK, but I don't think we want to force users to list specific columns they want to read just to avoid those they can't

@olegsinavski
Copy link
Author

It's probably more frequent to either want all the data (in that case, a user would expect things to be read correctly) or already specify what you want anyway. I haven't seen a library where data is ignored silently just for the syntactic convenience. Also, "ignore_columns" could be an option. But that's just an opinion. Thank you a lot for the very fast library!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants