Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust engine consume a lot of memory compared to pyarrow #2968

Open
djouallah opened this issue Oct 31, 2024 · 5 comments
Open

rust engine consume a lot of memory compared to pyarrow #2968

djouallah opened this issue Oct 31, 2024 · 5 comments
Assignees
Labels
binding/python Issues for the Python package bug Something isn't working
Milestone

Comments

@djouallah
Copy link

Environment

Delta-rs version:
0.21.0

Binding:

Environment:

  • OS:
    Linux

Bug

switching from pyarrow engine to rust increase memory usage by nearly 3X, the job used to works fine, but now, getting OOM errors.

I added a reproducible example with only 60 input files to demo the issue

https://colab.research.google.com/drive/1fahlV0FgKSAS8sQvRMu47s3bDP1ekLbb#scrollTo=333a177b-f075-412e-8ca1-32d44f8c07eb

@djouallah djouallah added the bug Something isn't working label Oct 31, 2024
@rtyler rtyler added the binding/python Issues for the Python package label Oct 31, 2024
@rtyler
Copy link
Member

rtyler commented Oct 31, 2024

@djouallah 👋 in the attached notebook, which write_deltalake call is resulting in memory pressure? I'm less familiar with duckdb, but I assume the df objects that it is producing are pyarrow.DataSet? or are they another type?

@djouallah
Copy link
Author

it is an Arrow RecordBatchReader I think

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Nov 2, 2024

Rust engine materalizes everything to memory prior to starting the whole process. Pyarrow probably writes batch by batch when you pass a reader

@djouallah
Copy link
Author

just for my own understanding, is this something that can be fixed by datafusion ?

@ion-elgreco
Copy link
Collaborator

@djouallah there is a PR to address this but the contributor didn't have time to finish it yet: #2289

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants