-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pure-numpy interface to parquet #931
base: main
Are you sure you want to change the base?
Conversation
Hi @martindurant
{0: {
'foo.with.strings-data': array([0, 1, -1], dtype=int8),
'foo.with.strings-cats': ["hey", "there"],
'foo.with.ints-data': array([1, 2, 3], dtype=uint8),
'foo.with.lists.list-offsets': array([0, 1, 2, 3]),
'foo.with.lists.list.element-data': array([0, 0, 0], dtype=uint8),
'foo.with.lists.list.element-cats': [0]}
}
I also am curious to know what will be the input for the general Thank you for your feedback! |
These are complex columns. In this case, a list-of-lists is made up of the data values, offsets and maybe an index (in the case of categoricals). There will be some simple wrappers in https://github.com/dask/fastparquet/blob/a9d3f309068189043f5ecec5f616de90c11fa305/fastparquet/wrappers.py to provide access to these nested structures, or the arrays could be passed directly to arrow, awkward or other libraries that know what to do with them.
becomes ["hey", "there", None] as a list
becomes Yes, |
Thanks a lot for your quick feedbacks !
|
Yes, I think so. So in the simple case of tabular data (nothing nested), this is essentially what pandas gives you anyway: |
Due to the upcoming hard dependence of pandas on pyarrow, this branch investigates what it would look like to have a fastparquet that avoids pandas altogether and deals with numpy arrays alone. For complex columns, the representation will be similar and compatible to awkward/arrow buffers, but not require those packages.