You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there any known issue or incompatibility when any of these libraries is used with fastparquet?
To simplify the code and the tests, could it be enough to chose one of the libraries and add it as a fixed dependency?
I could provide a PR if useful.
In the use case I'm considering, a column contains short lists of floats of variable length, and orjson is the fastest of the list.
Loading a column as a nested array using Dremel encoding is even faster, but from my understanding fastparquet doesn't support writing it. There was an attempt to support writing nested objects in #272, but it seems it wasn't completed.
The text was updated successfully, but these errors were encountered:
I certainly think that the same JSON encoder/decoder should be used throughout the library, and it is an oversight that the indicated line in writer does not.
I would be surprised if the same choice would be the fastest for all types of input, but any of the list would be faster than the python builtin one
the most commonly used library appears (to me) to be ujson
In short, I would approve of either sticking with the current logic (and documenting it) for picking the library and using it everywhere, or for going with ujson.
I created #805 keeping the same fallback logic based on the availability of the libraries (orjson, ujson, rapidjson, json).
An alternative could be to let the user decide what library to use based on some env variable (something like FASTPARQUET_JSON_ENGINE=ujson for example), it could be implemented if it's considered better.
fastparquet
can currently use any oforjson, ujson, rapidjson, json
modules for decoding (loading) if available, according tofastparquet/fastparquet/util.py
Line 499 in 34069fe
fastparquet/fastparquet/writer.py
Lines 275 to 277 in 34069fe
I could provide a PR if useful.
In the use case I'm considering, a column contains short lists of floats of variable length, and
orjson
is the fastest of the list.Loading a column as a nested array using Dremel encoding is even faster, but from my understanding fastparquet doesn't support writing it. There was an attempt to support writing nested objects in #272, but it seems it wasn't completed.
The text was updated successfully, but these errors were encountered: