Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Support for arrow data types when reading data with fastparquet #854

Open
j-bennet opened this issue Mar 14, 2023 · 5 comments
Open

Comments

@j-bennet
Copy link

j-bennet commented Mar 14, 2023

In dask/dask#9979, we added support for using Arrow data types when reading parquet with pyarrow engine.

I want to start a discussion on whether it makes sense to also support using Arrow data types when reading or writing the data using the fastparquet engine. I currently don't have a very good idea of why people use fastparquet over pyarrow, how often this happens, whether such a use case would be useful to some users, how large would that userbase be, and if in dask, we should just say "if you would like arrow data types, just use pyarrow".

@martindurant Do you have any plans of supporting Arrow data types with fastparquet? Do you have any insights, suggestions, or any reasons why this would/would not be a good idea? Your point of view would be very valuable.

cc @jrbourbeau

Existing issue for Arrow strings:

@yohplala
Copy link

yohplala commented Mar 14, 2023

Hi @j-bennet ,
@martindurant will obviously be of more help, but meanwhile, I understand that:

  • while you refer in a generic manner to Arrow data types, the PR you point to refers specifically to Arrow string, right?
  • the struct Arrow string, known to be lighter than python string, is already spotted in issue arrow string type?? #640

Bests,

@martindurant
Copy link
Member

It is true that we don't know our userbase very well, but I think it is fair to say that most people use fastparquet over arrow because they don't want to install both. There remain a few features that we have and they don't, but not many.

As @yohplala points out, generic (nested, variable-length) arrow support is possible but complex and I don't think on anyone's map - especially since the main use case is via pandas, which has nothing useful it can do with those types. If we were to do this, it would be more likely for awkward, which has very similar layout in memory, but significantly better API to deal with. (awkward also has some more flexibility, allowing start/stop indexing into a buffer rather than just offsets, for instance)

As for strings, there is indeed a bigger possible improvement to be made, and pandas is already much better integrated with the arrow backend (for string->string and string->simple type operations). Nevertheless, I don't think we're going to work on this and a possible/optional arrow dependency without significant appetite from the community.

@j-bennet j-bennet changed the title [Question] Support for pyarrow data types when reading data with fastparquet [Question] Support for arrow data types when reading data with fastparquet Mar 14, 2023
@j-bennet
Copy link
Author

@yohplala

  • while you refer in a generic manner to Arrow data types, the PR you point to refers specifically to Arrow string, right?

Yes, strings being the most valuable use case, we wanted to make sure we can support them in dask as soon as possible. But eventually, we'll want to support other arrow-backed data types.

Thank you @yohplala, I don't know how I missed that one - probably searched for pyarrow vs arrow.

@yohplala
Copy link

yohplala commented Feb 9, 2024

Hello @martindurant
I hope you are well. It is sometime I have not come here around.
You mentioned previously:

It is true that we don't know our userbase very well, but I think it is fair to say that most people use fastparquet over arrow because they don't want to install both. There remain a few features that we have and they don't, but not many.

Martin, you may have spotted in the last pandas 2.2 release notes this comment about PyArrow.

PyArrow will become a required dependency with pandas 3.0 to accommodate this change.

I am raising this as maybe it could have an impact on what direction may follow fastparquet, if future developments are intended. I have unfortunately no time to contribute so far. But I do appreciate that fastparquet is open to contributions.
For next evolutions/contributions possibly related/interacting with arrow, maybe there are some "general direction" to be identified? For instance:

  • 1/ fastparquet could be favored over arrow because it is "more" open and implements some specificities not in PyArrow? But it could rely on arrow as well as pandas does? (or on arrow through pandas)
  • 2/ or fastparquet could keep its specificity to do without arrow, but then if pandas is using it even more, should fastparquet try to favor full numpy use without pandas, and then offer numpy to pandas conversion only when pandas is installed?

No definitive answer is really requested, it may be thought later on when the question arises on a real use case. But I am thinking it may help to identify it.

I could read this article about arrow in pandas, and it seems like arrow may "naturally" impose itself gradually. So possibly option 1 makes more sense.

@martindurant
Copy link
Member

PyArrow will become a required dependency with pandas 3.0 to accommodate this change.

This could well mean the deprecation of this whole project, since there will be no one that doesn't have arrow, and we can't offer enough benefits. We'll see if there is any demand.

implements some specificities not in PyArrow

Note really, not any more. We can make metadata easier to manipulate, as you have done.

should fastparquet try to favor full numpy use without pandas

This is an interesting idea. The core.py functions actually do support a dict of numpy arrays, and variable bytes/UTF8 support is coming to numpy https://numpy.org/neps/nep-0055-string_dtype.html ). I think it would be a decent amount of effort to have writer and ParquetFile work like that, and of course existing users expecting pandas would be disappointed.

arrow may "naturally" impose itself gradually

We appear to be nearing the final stages of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants