[Question] Support for `arrow` data types when reading data with `fastparquet` #854

j-bennet · 2023-03-14T01:18:01Z

In dask/dask#9979, we added support for using Arrow data types when reading parquet with pyarrow engine.

I want to start a discussion on whether it makes sense to also support using Arrow data types when reading or writing the data using the fastparquet engine. I currently don't have a very good idea of why people use fastparquet over pyarrow, how often this happens, whether such a use case would be useful to some users, how large would that userbase be, and if in dask, we should just say "if you would like arrow data types, just use pyarrow".

@martindurant Do you have any plans of supporting Arrow data types with fastparquet? Do you have any insights, suggestions, or any reasons why this would/would not be a good idea? Your point of view would be very valuable.

cc @jrbourbeau

Existing issue for Arrow strings:

arrow string type?? #640

The text was updated successfully, but these errors were encountered:

yohplala · 2023-03-14T08:43:03Z

Hi @j-bennet ,
@martindurant will obviously be of more help, but meanwhile, I understand that:

while you refer in a generic manner to Arrow data types, the PR you point to refers specifically to Arrow string, right?
the struct Arrow string, known to be lighter than python string, is already spotted in issue arrow string type?? #640

Bests,

martindurant · 2023-03-14T13:18:03Z

It is true that we don't know our userbase very well, but I think it is fair to say that most people use fastparquet over arrow because they don't want to install both. There remain a few features that we have and they don't, but not many.

As @yohplala points out, generic (nested, variable-length) arrow support is possible but complex and I don't think on anyone's map - especially since the main use case is via pandas, which has nothing useful it can do with those types. If we were to do this, it would be more likely for awkward, which has very similar layout in memory, but significantly better API to deal with. (awkward also has some more flexibility, allowing start/stop indexing into a buffer rather than just offsets, for instance)

As for strings, there is indeed a bigger possible improvement to be made, and pandas is already much better integrated with the arrow backend (for string->string and string->simple type operations). Nevertheless, I don't think we're going to work on this and a possible/optional arrow dependency without significant appetite from the community.

j-bennet · 2023-03-14T16:39:33Z

@yohplala

while you refer in a generic manner to Arrow data types, the PR you point to refers specifically to Arrow string, right?

Yes, strings being the most valuable use case, we wanted to make sure we can support them in dask as soon as possible. But eventually, we'll want to support other arrow-backed data types.

the struct Arrow string, known to be lighter than python string, is already spotted in issue arrow string type?? #640

Thank you @yohplala, I don't know how I missed that one - probably searched for pyarrow vs arrow.

yohplala · 2024-02-09T21:48:54Z

Hello @martindurant
I hope you are well. It is sometime I have not come here around.
You mentioned previously:

It is true that we don't know our userbase very well, but I think it is fair to say that most people use fastparquet over arrow because they don't want to install both. There remain a few features that we have and they don't, but not many.

Martin, you may have spotted in the last pandas 2.2 release notes this comment about PyArrow.

PyArrow will become a required dependency with pandas 3.0 to accommodate this change.

I am raising this as maybe it could have an impact on what direction may follow fastparquet, if future developments are intended. I have unfortunately no time to contribute so far. But I do appreciate that fastparquet is open to contributions.
For next evolutions/contributions possibly related/interacting with arrow, maybe there are some "general direction" to be identified? For instance:

1/ fastparquet could be favored over arrow because it is "more" open and implements some specificities not in PyArrow? But it could rely on arrow as well as pandas does? (or on arrow through pandas)
2/ or fastparquet could keep its specificity to do without arrow, but then if pandas is using it even more, should fastparquet try to favor full numpy use without pandas, and then offer numpy to pandas conversion only when pandas is installed?

No definitive answer is really requested, it may be thought later on when the question arises on a real use case. But I am thinking it may help to identify it.

I could read this article about arrow in pandas, and it seems like arrow may "naturally" impose itself gradually. So possibly option 1 makes more sense.

martindurant · 2024-02-12T21:14:07Z

PyArrow will become a required dependency with pandas 3.0 to accommodate this change.

This could well mean the deprecation of this whole project, since there will be no one that doesn't have arrow, and we can't offer enough benefits. We'll see if there is any demand.

implements some specificities not in PyArrow

Note really, not any more. We can make metadata easier to manipulate, as you have done.

should fastparquet try to favor full numpy use without pandas

This is an interesting idea. The core.py functions actually do support a dict of numpy arrays, and variable bytes/UTF8 support is coming to numpy https://numpy.org/neps/nep-0055-string_dtype.html ). I think it would be a decent amount of effort to have writer and ParquetFile work like that, and of course existing users expecting pandas would be disappointed.

arrow may "naturally" impose itself gradually

We appear to be nearing the final stages of that.

j-bennet changed the title ~~[Question] Support for pyarrow data types when reading data with fastparquet~~ [Question] Support for arrow data types when reading data with fastparquet Mar 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Support for `arrow` data types when reading data with `fastparquet` #854

[Question] Support for `arrow` data types when reading data with `fastparquet` #854

j-bennet commented Mar 14, 2023 •

edited

Loading

yohplala commented Mar 14, 2023 •

edited

Loading

martindurant commented Mar 14, 2023

j-bennet commented Mar 14, 2023

yohplala commented Feb 9, 2024

martindurant commented Feb 12, 2024

[Question] Support for arrow data types when reading data with fastparquet #854

[Question] Support for arrow data types when reading data with fastparquet #854

Comments

j-bennet commented Mar 14, 2023 • edited Loading

yohplala commented Mar 14, 2023 • edited Loading

martindurant commented Mar 14, 2023

j-bennet commented Mar 14, 2023

yohplala commented Feb 9, 2024

martindurant commented Feb 12, 2024

[Question] Support for `arrow` data types when reading data with `fastparquet` #854

[Question] Support for `arrow` data types when reading data with `fastparquet` #854

j-bennet commented Mar 14, 2023 •

edited

Loading

yohplala commented Mar 14, 2023 •

edited

Loading