Use orjson for both json encoder and decoder? #798

GianlucaFicarelli · 2022-08-12T10:12:47Z

fastparquet can currently use any of orjson, ujson, rapidjson, json modules for decoding (loading) if available, according to

fastparquet/fastparquet/util.py

Line 499 in 34069fe

for lib in ['orjson', 'ujson', 'rapidjson', 'json']:

However, this isn't publicly documented so I wonder if it's possible to rely on it, or if it can change without notice.

I wonder also if the same libraries can be used for encoding (dumping), as mentioned in

fastparquet/fastparquet/writer.py

Lines 275 to 277 in 34069fe

    
           # TODO: avoid list, use better JSON 
        
           out = np.array([json.dumps(x).encode('utf8') for x in data], 
        
                          dtype="O")

Is there any known issue or incompatibility when any of these libraries is used with fastparquet?
To simplify the code and the tests, could it be enough to chose one of the libraries and add it as a fixed dependency?

I could provide a PR if useful.

In the use case I'm considering, a column contains short lists of floats of variable length, and orjson is the fastest of the list.
Loading a column as a nested array using Dremel encoding is even faster, but from my understanding fastparquet doesn't support writing it. There was an attempt to support writing nested objects in #272, but it seems it wasn't completed.

The text was updated successfully, but these errors were encountered:

martindurant · 2022-08-17T17:32:43Z

I certainly think that the same JSON encoder/decoder should be used throughout the library, and it is an oversight that the indicated line in writer does not.
I would be surprised if the same choice would be the fastest for all types of input, but any of the list would be faster than the python builtin one
the most commonly used library appears (to me) to be ujson

In short, I would approve of either sticking with the current logic (and documenting it) for picking the library and using it everywhere, or for going with ujson.

GianlucaFicarelli · 2022-09-02T16:43:24Z

I created #805 keeping the same fallback logic based on the availability of the libraries (orjson, ujson, rapidjson, json).

An alternative could be to let the user decide what library to use based on some env variable (something like FASTPARQUET_JSON_ENGINE=ujson for example), it could be implemented if it's considered better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use orjson for both json encoder and decoder? #798

Use orjson for both json encoder and decoder? #798

GianlucaFicarelli commented Aug 12, 2022

martindurant commented Aug 17, 2022

GianlucaFicarelli commented Sep 2, 2022

Use orjson for both json encoder and decoder? #798

Use orjson for both json encoder and decoder? #798

Comments

GianlucaFicarelli commented Aug 12, 2022

martindurant commented Aug 17, 2022

GianlucaFicarelli commented Sep 2, 2022