Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ran latest thrift. #18

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jorgecarleitao
Copy link

This runs the latest thrift against the existing parquet.thrift format.

It seems that the generation changed in backward incompatible ways. :/

@sunchao
Copy link
Owner

sunchao commented Aug 1, 2021

cc @nevi-me . LGTM but curious what improvements we can get from the version bump. Because of the incompatibility, we'll need to bump up the major version number.

@jorgecarleitao
Copy link
Author

Meanwhile I found this discussion here, which may cause some of the changes to be reverted. So, we may hold this back until that is clarified. :/

@nevi-me
Copy link
Collaborator

nevi-me commented Aug 15, 2021

Hi @jorgecarleitao and @sunchao, I've been thinking about this wrt versioning.

The only 3 crates that use parquet-format-rs are under our control or within our influence https://crates.io/crates/parquet-format/reverse_dependencies (parquet-rs, parquet2, and deltalake). All 3 projects are currently targeting v2.6.0 of the format, so perhaps we could generate updated code from the 2.6.0 version, and release that as v5.0.0 of this crate.

@jhorstmann
Copy link

Hi @jorgecarleitao and @sunchao, does this PR also need to update the thrift dependency to the latest version (0.15.0) or should I open a separate issue for the update? I don't know the exact changes between those versions, my motivation is only that a security scanning tool is complaining about a CVE in the thrift rpc server code.

@tustvold
Copy link

tustvold commented Jul 5, 2022

Thrift 0.16 has now been released, I wonder if we might be able unblock this somehow? Could we update thrift, regenerate and make a new major release? Currently we're stuck on a rather old version of thrift, which is unfortunate

@sunchao
Copy link
Owner

sunchao commented Jul 5, 2022

Sure, let me take a stab on this today or tomorrow.

@alamb
Copy link

alamb commented Aug 18, 2022

It appears that @jorgecarleitao made this one https://crates.io/crates/parquet-format-safe for parquet2, for what it is worth

@alamb
Copy link

alamb commented Aug 18, 2022

@jorgecarleitao
Copy link
Author

Yeap, proposed some changes to thrift to support async page readers, had limited engagement for some time and forked thrift to support it.

parquet-format-safe essentially contains (i.e. inside the crate)

  • thrift compact protocol (the only relevant one for parquet), fixed to not OOM nor panic on read
  • an async implementation of the protocol
  • the varint (and async) code required by thrift (unsafe free)
  • depends on a fork of the thrift compiler fixed to not generate code that panics and/or OOM.

The main benefits are:

  • it has no unsafe (in any transitive dependency)
  • so far we have been unable to trigger a panic or OOM on it from fuzzy generators
  • supports async (which AFAIK is required for async page readers without page indexes)

@sunchao
Copy link
Owner

sunchao commented Aug 18, 2022

Ah, thanks, not aware of the async and unsafe issues from Thrift.

@jorgecarleitao it'd be nice if we can merge these back to the official Thrift repo, do you have JIRAs or open PRs tracking these?

@sunchao
Copy link
Owner

sunchao commented Aug 18, 2022

I found this one: apache/thrift#2426. It has been stale for sometime ..

@tustvold
Copy link

tustvold commented Aug 18, 2022

fixed to not OOM nor panic on read

Could you perhaps expand a bit on how this could occur? I'm guessing it is pre-allocating buffer sizes based on their encoded lengths?

which AFAIK is required for async page readers without page indexes

FWIW parquet supports async without needing this, it is largely based on the observation that if the fetch latencies are high enough to warrant async, e.g. object store, you want to prefetch the entire column chunk or pages from that column chunk in one shot.

forked thrift to support it.

My 2 cents is that moving away from a standard ecosystem implementation requires a pretty compelling motivator, I'm not sure that bar is met in this case...

I wonder if we might be able unblock this somehow

Regarding the primary topic of this PR, I'm not sure if donating parquet-format to arrow-rs might be something on the cards, as it might help provide more hands to keep this moving forward? It might also speed up validation. As the code is largely auto-generated I'm not sure it would even need to go through the donation process??

@alamb
Copy link

alamb commented Aug 18, 2022

Ah, I see now, it seems like https://crates.io/crates/parquet-format-safe is a hand modified version of an auto generated file (maybe also created by a fork of the compiler). I didn't realize it had so much by hand editing.

I agree that taking the same approach in arrow-rs would be hard to justify

As the code is largely auto-generated I'm not sure it would even need to go through the donation process??

I agree think checking in the output of a code generator does not required a donation process

@sunchao
Copy link
Owner

sunchao commented Aug 18, 2022

Regarding the primary topic of this PR, I'm not sure if donating parquet-format to arrow-rs might be something on the cards, as it might help provide more hands to keep this moving forward?

I'm more than happy to donate the repo. In addition, I can add you folks to the admin list of this repo if you need.

@tustvold
Copy link

I'm more than happy to donate the repo

I created a ticket proposing this here, please do voice any thoughts, concerns, objections, etc...

@jorgecarleitao
Copy link
Author

Ah, I see now, it seems like https://crates.io/crates/parquet-format-safe is a hand modified version of an auto generated file (maybe also created by a fork of the compiler). I didn't realize it had so much by hand editing.

There is no hand-editing, I had to modify the thrift implementation, the C compiler (this by hand, yes ^^) and re-generate it. I.e. if you clone the forked compiler and run the generate....sh, it will produce the same file as the checkout .rs.

Could you perhaps expand a bit on how this could occur? I'm guessing it is pre-allocating buffer sizes based on their encoded lengths?

OOM and panics: yes - cases like getting Vec<u8> in thrift are done with [size as varint][values], which an attacker can abuse since size can be arbitrarily large and unchecked (and it currently reserves it).

FWIW parquet supports async without needing this, it is largely based on the observation that if the fetch latencies are high enough to warrant async, e.g. object store, you want to prefetch the entire column chunk or pages from that column chunk in one shot.

Sure, and we also support that strategy in parquet2/arrow2, but not all files have page indexes or users that want to load whole column chunks (specially in very wide tables).

My 2 cents is that moving away from a standard ecosystem implementation requires a pretty compelling motivator, I'm not sure that bar is met in this case...

Imo that is subjective. It all depends on the value that we give to the different aspects of the software. I value Rust's premise of correct use of unsafe and security more than an "official" stamp.

Fwiw arrow2 is also not using the official Google's flatbuffer implementation and instead a less known, easier to use crate, also because there is some unsoundness and panics around (and the design makes it difficult to fix).

Ah, thanks, not aware of the async and unsafe issues from Thrift.

The unsafe is not very important to us in this case as we do not hit it, it comes from dermesser/integer-encoding-rs#29, the main benefit for me is that it makes parquet2 unsafe free when compression is off.

@alamb
Copy link

alamb commented Aug 19, 2022

There is no hand-editing, I had to modify the thrift implementation,

Thank you for the clarification -- I think I was confused by commits such as jorgecarleitao/parquet-format-safe@7e05c29

I see now there is a correspondence between changes to the thrift compiler fork https://github.com/jorgecarleitao/thrift/commits/safe and the changes in rust

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants