-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Support for arrow
data types when reading data with fastparquet
#854
Comments
Hi @j-bennet ,
Bests, |
It is true that we don't know our userbase very well, but I think it is fair to say that most people use fastparquet over arrow because they don't want to install both. There remain a few features that we have and they don't, but not many. As @yohplala points out, generic (nested, variable-length) arrow support is possible but complex and I don't think on anyone's map - especially since the main use case is via pandas, which has nothing useful it can do with those types. If we were to do this, it would be more likely for awkward, which has very similar layout in memory, but significantly better API to deal with. (awkward also has some more flexibility, allowing start/stop indexing into a buffer rather than just offsets, for instance) As for strings, there is indeed a bigger possible improvement to be made, and pandas is already much better integrated with the arrow backend (for string->string and string->simple type operations). Nevertheless, I don't think we're going to work on this and a possible/optional arrow dependency without significant appetite from the community. |
pyarrow
data types when reading data with fastparquet
arrow
data types when reading data with fastparquet
Yes, strings being the most valuable use case, we wanted to make sure we can support them in
Thank you @yohplala, I don't know how I missed that one - probably searched for |
Hello @martindurant
Martin, you may have spotted in the last pandas 2.2 release notes this comment about PyArrow.
I am raising this as maybe it could have an impact on what direction may follow fastparquet, if future developments are intended. I have unfortunately no time to contribute so far. But I do appreciate that fastparquet is open to contributions.
No definitive answer is really requested, it may be thought later on when the question arises on a real use case. But I am thinking it may help to identify it. I could read this article about arrow in pandas, and it seems like arrow may "naturally" impose itself gradually. So possibly option 1 makes more sense. |
This could well mean the deprecation of this whole project, since there will be no one that doesn't have arrow, and we can't offer enough benefits. We'll see if there is any demand.
Note really, not any more. We can make metadata easier to manipulate, as you have done.
This is an interesting idea. The core.py functions actually do support a dict of numpy arrays, and variable bytes/UTF8 support is coming to numpy https://numpy.org/neps/nep-0055-string_dtype.html ). I think it would be a decent amount of effort to have writer and ParquetFile work like that, and of course existing users expecting pandas would be disappointed.
We appear to be nearing the final stages of that. |
In dask/dask#9979, we added support for using Arrow data types when reading parquet with
pyarrow
engine.I want to start a discussion on whether it makes sense to also support using Arrow data types when reading or writing the data using the
fastparquet
engine. I currently don't have a very good idea of why people usefastparquet
overpyarrow
, how often this happens, whether such a use case would be useful to some users, how large would that userbase be, and if indask
, we should just say "if you would like arrow data types, just use pyarrow".@martindurant Do you have any plans of supporting Arrow data types with
fastparquet
? Do you have any insights, suggestions, or any reasons why this would/would not be a good idea? Your point of view would be very valuable.cc @jrbourbeau
Existing issue for Arrow strings:
The text was updated successfully, but these errors were encountered: