-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Dataset.to_batches accumulates memory usage and leaks #39808
Comments
Can you print the output of what you see locally? (and ideally share a full reproducer that includes creating a dummy parquet file that reproduces the issue for you, in case it is dependent on the content/characteristics of the parquet file) |
I am observing the same thing with a single parquet file of 10M records (https://storage.googleapis.com/pinecone-datasets-dev/yfcc-10M-filter-euclidean-formatted/passages/part-0.parquet - 2.3GB). Using code equivalent to OPs, I see Using Variation of the OPs reproducer code which includes the two modes: #!/usr/bin/env python3
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import sys
path="part-0.parquet"
if sys.argv[1] == "Dataset":
print("Using Dataset.to_batches API")
batches = ds.dataset(path).to_batches(batch_size=100,
batch_readahead=0,
fragment_readahead=0)
else:
print("Using ParquetFile.iter_batches API")
batches = pq.ParquetFile(path).iter_batches(batch_size=100)
# Iterate through batches
max_alloc = 0
for batch in batches:
alloc = pa.total_allocated_bytes()
if alloc > max_alloc:
max_alloc = alloc
print("New max total_allocated_bytes", max_alloc)
del batches
print("Final:", pa.total_allocated_bytes()) I see the following numbers (pyarrow 16.1.0, python 3.11.6):
./pyarrow_39808_repro.py Dataset
Using Dataset.to_batches API
New max total_allocated_bytes 549322688
New max total_allocated_bytes 794176064
New max total_allocated_bytes 1094879616
New max total_allocated_bytes 1340710656
New max total_allocated_bytes 3732962048
New max total_allocated_bytes 4236869504
New max total_allocated_bytes 4849658496
New max total_allocated_bytes 6188106432
Final: 185074688
i.e. |
Is it possible to ask for an update on this issue? I am currently running into the same memory leak in my project which gets mitigated when using |
Is there any update on this issue? I am also running into the same memory leak and would like to stick with |
Describe the bug, including details regarding any error messages, version, and platform.
If you want to read in a large parquet file or series of parquet files, the dataset reader accumulates the memory it allocates as you iterate through the batches.
To recreate:
I am running this on OSX, which I believe uses
mimalloc
backend by default. It's worth noting that this is not the behavior thatParquetFile.iter_batches
has. If you swap in that iterator, it will de-allocate the memory as soon as the batch leaves scope.Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: