Replies: 2 comments
-
Hi @Xuanwo - this is super exciting indeed. I was looking forward to reaching out to the iceberg-rust community about working on Python bindings for a different reason (to support Bucket Transforms on an Arrow Array) so I'd be very excited to help out with setting up the packaging for the Python bindings when we are ready to start exposing some functions. One of the bottlenecks I have seen in PyIceberg has been in how much memory we need to represent the table as a fully materialized arrow table even when reading relatively small row groups of parquet files. So I think it would be best to build an API that returns a RecordBatchReader first and then think of what the best way to parallelize the work on top of that API, when we don't want to lazily load all of those RecordBatches in sequence and want to materialize an Arrow Table right away. |
Beta Was this translation helpful? Give feedback.
-
Thanks for starting this discussion! +1 on having pluggable file IO. Users can pick and choose based on their constraints. My gut feeling tells me that the current pyiceberg codebase is tightly coupled with pyarrow on read and write. It'll be great to add another FileIO implementation to standardize the interface |
Beta Was this translation helpful? Give feedback.
-
Motivation
Many PyIceberg users need to run the software in resource-limited environments like AWS Lambda.
They sometimes complain that
pyarrow
is too large. For instance,pyarrow-17.0.0
forlinux x86_64
needs 40MiB while downloading and 100MiB on disk.The
pyiceberg
library utilizesfsspec
for assistance; however, sincefsspec
lacks support forarrow
, users might still requirepyarrow
to handle arrow related works.So I propose to use iceberg-rust as pyiceberg file io, which can:
Benefits
It's a significant loss that our community cannot benefit from the existing iceberg-java implementations. We have to build many things from scratch. However, thanks to Rust's excellent interoperability, we can address this issue.
By incorporating parts of iceberg-rust into pyiceberg, we can evolve the community together and ultimately power pyiceberg with a Rust core.
For Fast:
I apologize for saying this without conducting any benchmarks at this time. But we can imagine a pyiceberg core without GIL and runtime cost. We can revisit this part after we've actually built it.
For Small:
Someone has built an
arrow-rs
python binding arro3. It only needs 1MiB on disk! And no numpy!Plan
Pyiceberg features a dynamic file IO system that enables users to implement their own solutions.
So our first step could be implement into file IO system first. At this stage, users can use
iceberg-rust-fileio
as an alternative tofsspec
.The next step is make
pyarrow
optional too, we can provideproject_table
and other APIs thatpyiceberg
needs. Maybe some refactor to allow users tp provide their ownpy-arrow-impl
. The details could be extended later.Questions
How python call rust code?
pyo3 is a great lib that widely used in the rust community to build python bindings. It allows us to build interoperable, zero-copy python bindings easily.
There is a quick example
In rust we write:
In Python, we can call it:
With pyo3, we can export python API without extra efforts.
Are you trying to rewrite pyiceberg in rust?
No, I'm not.
pyiceberg exsists, and it works well. We should not break things that work.
In the future, Pyiceberg might be powered by a Rust core, but we will ensure it's implemented without any breaking changes. As outlined in the plan section, we are introducing new features to Pyicberg and offering them as optional for users to try, allowing us to gradually stabilize these additions.
I expect pyiceberg to become rusty without any visible changes to users.
Beta Was this translation helpful? Give feedback.
All reactions