Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update delta-kernel to at least 0.4.0 to leverage a lazy scan.execute for large tables #602

Open
BdeUtra opened this issue Oct 31, 2024 · 0 comments

Comments

@BdeUtra
Copy link

BdeUtra commented Oct 31, 2024

Currently, delta kernel loads the entire table in memory, causing all sorts of problems when dealing with large enough data.
From delta-kernel 0.4.0 the scan.execute is lazy and only loads as the iterator is consumed.

I've a fork of your python library where I don't load the entire dataset into a pyarrow.Table but instead work on each RecordBatch separately for memory efficiency reasons. This is currently pointless as delta-kernel is eagerly loading in the supported 0.2.x version.
Supporting 0.4.x would open up a lot of possibilities for large data processing.

Any short term plans on supporting delta-kernel 0.4.x ?
I'm new to rust and couldn't make it work sadly

relevant bit of the changelog:

Scan's execute(..) method now returns a lazy iterator instead of materializing a Vec<ScanResult>

source: https://github.com/delta-incubator/delta-kernel-rs/blob/bd2ea9f2fa44d8bc559659e53d38374309ecf63a/CHANGELOG.md#v040-2024-10-23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant