-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ran latest thrift. #18
base: master
Are you sure you want to change the base?
Conversation
cc @nevi-me . LGTM but curious what improvements we can get from the version bump. Because of the incompatibility, we'll need to bump up the major version number. |
Meanwhile I found this discussion here, which may cause some of the changes to be reverted. So, we may hold this back until that is clarified. :/ |
Hi @jorgecarleitao and @sunchao, I've been thinking about this wrt versioning. The only 3 crates that use parquet-format-rs are under our control or within our influence https://crates.io/crates/parquet-format/reverse_dependencies (parquet-rs, parquet2, and deltalake). All 3 projects are currently targeting v2.6.0 of the format, so perhaps we could generate updated code from the 2.6.0 version, and release that as v5.0.0 of this crate. |
Hi @jorgecarleitao and @sunchao, does this PR also need to update the thrift dependency to the latest version (0.15.0) or should I open a separate issue for the update? I don't know the exact changes between those versions, my motivation is only that a security scanning tool is complaining about a CVE in the thrift rpc server code. |
Thrift 0.16 has now been released, I wonder if we might be able unblock this somehow? Could we update thrift, regenerate and make a new major release? Currently we're stuck on a rather old version of thrift, which is unfortunate |
Sure, let me take a stab on this today or tomorrow. |
It appears that @jorgecarleitao made this one https://crates.io/crates/parquet-format-safe for parquet2, for what it is worth |
And the incantation is : https://github.com/jorgecarleitao/parquet-format-safe/blob/main/generate_parquet_format.sh |
Yeap, proposed some changes to thrift to support
The main benefits are:
|
Ah, thanks, not aware of the @jorgecarleitao it'd be nice if we can merge these back to the official Thrift repo, do you have JIRAs or open PRs tracking these? |
I found this one: apache/thrift#2426. It has been stale for sometime .. |
Could you perhaps expand a bit on how this could occur? I'm guessing it is pre-allocating buffer sizes based on their encoded lengths?
FWIW
My 2 cents is that moving away from a standard ecosystem implementation requires a pretty compelling motivator, I'm not sure that bar is met in this case...
Regarding the primary topic of this PR, I'm not sure if donating parquet-format to arrow-rs might be something on the cards, as it might help provide more hands to keep this moving forward? It might also speed up validation. As the code is largely auto-generated I'm not sure it would even need to go through the donation process?? |
Ah, I see now, it seems like https://crates.io/crates/parquet-format-safe is a hand modified version of an auto generated file (maybe also created by a fork of the compiler). I didn't realize it had so much by hand editing. I agree that taking the same approach in arrow-rs would be hard to justify
I agree think checking in the output of a code generator does not required a donation process |
I'm more than happy to donate the repo. In addition, I can add you folks to the admin list of this repo if you need. |
I created a ticket proposing this here, please do voice any thoughts, concerns, objections, etc... |
There is no hand-editing, I had to modify the thrift implementation, the C compiler (this by hand, yes ^^) and re-generate it. I.e. if you clone the forked compiler and run the
OOM and panics: yes - cases like getting
Sure, and we also support that strategy in parquet2/arrow2, but not all files have page indexes or users that want to load whole column chunks (specially in very wide tables).
Imo that is subjective. It all depends on the value that we give to the different aspects of the software. I value Rust's premise of correct use of Fwiw arrow2 is also not using the official Google's flatbuffer implementation and instead a less known, easier to use crate, also because there is some unsoundness and panics around (and the design makes it difficult to fix).
The |
Thank you for the clarification -- I think I was confused by commits such as jorgecarleitao/parquet-format-safe@7e05c29 I see now there is a correspondence between changes to the thrift compiler fork https://github.com/jorgecarleitao/thrift/commits/safe and the changes in rust |
This runs the latest thrift against the existing
parquet.thrift
format.It seems that the generation changed in backward incompatible ways. :/