Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use orjson for both json encoder and decoder? #798

Open
GianlucaFicarelli opened this issue Aug 12, 2022 · 2 comments
Open

Use orjson for both json encoder and decoder? #798

GianlucaFicarelli opened this issue Aug 12, 2022 · 2 comments

Comments

@GianlucaFicarelli
Copy link

fastparquet can currently use any of orjson, ujson, rapidjson, json modules for decoding (loading) if available, according to

for lib in ['orjson', 'ujson', 'rapidjson', 'json']:

  1. However, this isn't publicly documented so I wonder if it's possible to rely on it, or if it can change without notice.
  2. I wonder also if the same libraries can be used for encoding (dumping), as mentioned in
    # TODO: avoid list, use better JSON
    out = np.array([json.dumps(x).encode('utf8') for x in data],
    dtype="O")
  3. Is there any known issue or incompatibility when any of these libraries is used with fastparquet?
  4. To simplify the code and the tests, could it be enough to chose one of the libraries and add it as a fixed dependency?

I could provide a PR if useful.

In the use case I'm considering, a column contains short lists of floats of variable length, and orjson is the fastest of the list.
Loading a column as a nested array using Dremel encoding is even faster, but from my understanding fastparquet doesn't support writing it. There was an attempt to support writing nested objects in #272, but it seems it wasn't completed.

@martindurant
Copy link
Member

  • I certainly think that the same JSON encoder/decoder should be used throughout the library, and it is an oversight that the indicated line in writer does not.
  • I would be surprised if the same choice would be the fastest for all types of input, but any of the list would be faster than the python builtin one
  • the most commonly used library appears (to me) to be ujson

In short, I would approve of either sticking with the current logic (and documenting it) for picking the library and using it everywhere, or for going with ujson.

@GianlucaFicarelli
Copy link
Author

I created #805 keeping the same fallback logic based on the availability of the libraries (orjson, ujson, rapidjson, json).

An alternative could be to let the user decide what library to use based on some env variable (something like FASTPARQUET_JSON_ENGINE=ujson for example), it could be implemented if it's considered better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants