-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading List of List of Strings leads to nans #917
Comments
Fastparquet only support one-level nested structures (struct, list or map or primitive types) https://fastparquet.readthedocs.io/en/latest/details.html#reading-nested-schema |
Thanks, that is indeed in the docs, so not a bug unfortunately.. @martindurant What do you mean it's not supported by pandas? After parquet is read (with arrow by default), I get a nice numpy array of strings there which I can access, map through etc. Also, wouldn't it be better to hard-crash on parse failure as opposed to silently producing nans? It took me ages to find this since its a just one column out of many I have in the format. Maybe I should file a separate issue on that though.. |
Either you get a native arrow column with a set of arrays in the background or a numpy array ob objects. fastpython could in theory do the latter, but it would be a lot of work and be really slow loopy python code. |
Sounds good! So what do you think about hard-crashing instead of silently ignoring data in this case? I don't think ignoring a specific column would fly in production systems.. |
A warning may be OK, but I don't think we want to force users to list specific columns they want to read just to avoid those they can't |
It's probably more frequent to either want all the data (in that case, a user would expect things to be read correctly) or already specify what you want anyway. I haven't seen a library where data is ignored silently just for the syntactic convenience. Also, "ignore_columns" could be an option. But that's just an opinion. Thank you a lot for the very fast library! |
Describe the issue:
Hello,
I have a peculiar field type in my parquet file:
List of Lists of strings.
For example:
0 []
1 [["hello"]]
2 [["hello", "bye"]]
3 [["hello"], ["bye"]]
...
I found that pyarrow loads those fine (default pandas engine), while fastparquet silently converts them to nans.
Minimal Complete Verifiable Example:
prints out:
Environment:
Fastparquet version: 2023.10.1
The text was updated successfully, but these errors were encountered: