idea: Use iceberg-rust as pyiceberg file io #513

Xuanwo · 2024-07-31T13:24:43Z

Xuanwo
Jul 31, 2024
Collaborator

Motivation

Many PyIceberg users need to run the software in resource-limited environments like AWS Lambda.

They sometimes complain that pyarrow is too large. For instance, pyarrow-17.0.0 for linux x86_64 needs 40MiB while downloading and 100MiB on disk.

The pyiceberg library utilizes fsspec for assistance; however, since fsspec lacks support for arrow, users might still require pyarrow to handle arrow related works.

So I propose to use iceberg-rust as pyiceberg file io, which can:

Handle IO Operations like read/write/delete/list.
Convert data stream between iceberg and arrow.

Benefits

Combine community efforts.

It's a significant loss that our community cannot benefit from the existing iceberg-java implementations. We have to build many things from scratch. However, thanks to Rust's excellent interoperability, we can address this issue.

By incorporating parts of iceberg-rust into pyiceberg, we can evolve the community together and ultimately power pyiceberg with a Rust core.

Fast yet small pyiceberg

For Fast:

I apologize for saying this without conducting any benchmarks at this time. But we can imagine a pyiceberg core without GIL and runtime cost. We can revisit this part after we've actually built it.

For Small:

Someone has built an arrow-rs python binding arro3. It only needs 1MiB on disk! And no numpy!

Plan

Pyiceberg features a dynamic file IO system that enables users to implement their own solutions.

def load_file_io(properties: Properties = EMPTY_DICT, location: Optional[str] = None) -> FileIO:
    # First look for the py-io-impl property to directly load the class
    if io_impl := properties.get(PY_IO_IMPL):
        if file_io := _import_file_io(io_impl, properties):
            logger.info("Loaded FileIO: %s", io_impl)
            return file_io
        else:
            raise ValueError(f"Could not initialize FileIO: {io_impl}")

So our first step could be implement into file IO system first. At this stage, users can use iceberg-rust-fileio as an alternative to fsspec.

The next step is make pyarrow optional too, we can provide project_table and other APIs that pyiceberg needs. Maybe some refactor to allow users tp provide their own py-arrow-impl. The details could be extended later.

Questions

How python call rust code?

pyo3 is a great lib that widely used in the rust community to build python bindings. It allows us to build interoperable, zero-copy python bindings easily.

There is a quick example

In rust we write:

use pyo3::prelude::*;

#[pyfunction]
fn sum_as_string(a: usize, b: usize) -> PyResult<String> {
    Ok((a + b).to_string())
}

#[pymodule]
fn string_sum(m: &Bound<'_, PyModule>) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(sum_as_string, m)?)?;
    Ok(())
}

In Python, we can call it:

>>> import string_sum
>>> string_sum.sum_as_string(5, 20)
'25'

With pyo3, we can export python API without extra efforts.

Are you trying to rewrite pyiceberg in rust?

No, I'm not.

pyiceberg exsists, and it works well. We should not break things that work.

In the future, Pyiceberg might be powered by a Rust core, but we will ensure it's implemented without any breaking changes. As outlined in the plan section, we are introducing new features to Pyicberg and offering them as optional for users to try, allowing us to gradually stabilize these additions.

I expect pyiceberg to become rusty without any visible changes to users.

sungwy · 2024-07-31T13:50:34Z

sungwy
Jul 31, 2024
Collaborator

Hi @Xuanwo - this is super exciting indeed. I was looking forward to reaching out to the iceberg-rust community about working on Python bindings for a different reason (to support Bucket Transforms on an Arrow Array) so I'd be very excited to help out with setting up the packaging for the Python bindings when we are ready to start exposing some functions.

One of the bottlenecks I have seen in PyIceberg has been in how much memory we need to represent the table as a fully materialized arrow table even when reading relatively small row groups of parquet files. So I think it would be best to build an API that returns a RecordBatchReader first and then think of what the best way to parallelize the work on top of that API, when we don't want to lazily load all of those RecordBatches in sequence and want to materialize an Arrow Table right away.

0 replies

kevinjqliu · 2024-07-31T17:44:14Z

kevinjqliu
Jul 31, 2024
Collaborator

Thanks for starting this discussion! +1 on having pluggable file IO. Users can pick and choose based on their constraints.

My gut feeling tells me that the current pyiceberg codebase is tightly coupled with pyarrow on read and write. It'll be great to add another FileIO implementation to standardize the interface

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

idea: Use iceberg-rust as pyiceberg file io #513

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

idea: Use iceberg-rust as pyiceberg file io #513

Xuanwo Jul 31, 2024 Collaborator

Motivation

Benefits

Plan

Questions

How python call rust code?

Are you trying to rewrite pyiceberg in rust?

Replies: 2 comments

sungwy Jul 31, 2024 Collaborator

kevinjqliu Jul 31, 2024 Collaborator

Xuanwo
Jul 31, 2024
Collaborator

sungwy
Jul 31, 2024
Collaborator

kevinjqliu
Jul 31, 2024
Collaborator