Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Dataset.to_batches accumulates memory usage and leaks #39808

Open
akoumjian opened this issue Jan 26, 2024 · 4 comments
Open

[Python] Dataset.to_batches accumulates memory usage and leaks #39808

akoumjian opened this issue Jan 26, 2024 · 4 comments

Comments

@akoumjian
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

If you want to read in a large parquet file or series of parquet files, the dataset reader accumulates the memory it allocates as you iterate through the batches.

To recreate:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# Create dataset from large parquet file or files
dataset = ds.dataset('data/parquet', format='parquet')

# Iterate through batches
for batch in dataset.to_batches(batch_size=1000, batch_readahead=0, fragment_readahead=0):
    print(pa.total_allocated_bytes())
    pass
print(pa.total_allocated_bytes())

I am running this on OSX, which I believe uses mimalloc backend by default. It's worth noting that this is not the behavior that ParquetFile.iter_batches has. If you swap in that iterator, it will de-allocate the memory as soon as the batch leaves scope.

Component(s)

Parquet, Python

@kou kou changed the title Dataset.to_batches accumulates memory usage and leaks [Python] Dataset.to_batches accumulates memory usage and leaks Jan 27, 2024
@jorisvandenbossche
Copy link
Member

Can you print the output of what you see locally? (and ideally share a full reproducer that includes creating a dummy parquet file that reproduces the issue for you, in case it is dependent on the content/characteristics of the parquet file)
I am trying to reproduce this on my linux laptop, but I don't see a continuous increase in allocated bytes.

@daverigby
Copy link

I am observing the same thing with a single parquet file of 10M records (https://storage.googleapis.com/pinecone-datasets-dev/yfcc-10M-filter-euclidean-formatted/passages/part-0.parquet - 2.3GB).

Using code equivalent to OPs, I see total_allocated_bytes increase consistently over the run, requiring over 5.9GB to iterate the file.

Using ParquetFile.iter_batches as suggested, memory usage is much more stable (although increases a little over the duration).

Variation of the OPs reproducer code which includes the two modes:

#!/usr/bin/env python3

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import sys

path="part-0.parquet"

if sys.argv[1] == "Dataset":
    print("Using Dataset.to_batches API")    
    batches = ds.dataset(path).to_batches(batch_size=100,
                                          batch_readahead=0,
                                          fragment_readahead=0)
else:
    print("Using ParquetFile.iter_batches API")
    batches = pq.ParquetFile(path).iter_batches(batch_size=100)

# Iterate through batches
max_alloc = 0
for batch in batches:
    alloc = pa.total_allocated_bytes()
    if alloc > max_alloc:
        max_alloc = alloc
        print("New max total_allocated_bytes", max_alloc)
del batches
print("Final:", pa.total_allocated_bytes())

I see the following numbers (pyarrow 16.1.0, python 3.11.6):

  • Dataset.to_batches:
./pyarrow_39808_repro.py Dataset
Using Dataset.to_batches API
New max total_allocated_bytes 549322688
New max total_allocated_bytes 794176064
New max total_allocated_bytes 1094879616
New max total_allocated_bytes 1340710656
New max total_allocated_bytes 3732962048
New max total_allocated_bytes 4236869504
New max total_allocated_bytes 4849658496
New max total_allocated_bytes 6188106432
Final: 185074688
  • ParquetFile.iter_batches:
./pyarrow_39808_repro.py ParquetFile
Using ParquetFile.iter_batches API     
New max total_allocated_bytes 252396608
New max total_allocated_bytes 252403456
<cut>
New max total_allocated_bytes 274519744
New max total_allocated_bytes 274521024
Final: 35072

i.e. ParquetFile.iter_batches requires at most ~261MB, whereas Dataset.to_batches requires ~5901MB - 22x more (!) - plus an additinal 176MB of RAM still in use after iteration complete.

@PatrikBernhard
Copy link

Is it possible to ask for an update on this issue? I am currently running into the same memory leak in my project which gets mitigated when using Parquetfile.iter_batches instead of Dataset.to_batches. However, I'd like to use the HivePartition-discovering within Dataset.

@mats-werrebrouck
Copy link

Is there any update on this issue? I am also running into the same memory leak and would like to stick with Dataset.to_batches because of the HivePartition-discovering within PyArrow Dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants