[Python] Dataset.to_batches accumulates memory usage and leaks #39808

akoumjian · 2024-01-26T15:44:18Z

Describe the bug, including details regarding any error messages, version, and platform.

If you want to read in a large parquet file or series of parquet files, the dataset reader accumulates the memory it allocates as you iterate through the batches.

To recreate:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# Create dataset from large parquet file or files
dataset = ds.dataset('data/parquet', format='parquet')

# Iterate through batches
for batch in dataset.to_batches(batch_size=1000, batch_readahead=0, fragment_readahead=0):
    print(pa.total_allocated_bytes())
    pass
print(pa.total_allocated_bytes())

I am running this on OSX, which I believe uses mimalloc backend by default. It's worth noting that this is not the behavior that ParquetFile.iter_batches has. If you swap in that iterator, it will de-allocate the memory as soon as the batch leaves scope.

Component(s)

Parquet, Python

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-01-30T08:32:04Z

Can you print the output of what you see locally? (and ideally share a full reproducer that includes creating a dummy parquet file that reproduces the issue for you, in case it is dependent on the content/characteristics of the parquet file)
I am trying to reproduce this on my linux laptop, but I don't see a continuous increase in allocated bytes.

daverigby · 2024-06-12T14:37:11Z

I am observing the same thing with a single parquet file of 10M records (https://storage.googleapis.com/pinecone-datasets-dev/yfcc-10M-filter-euclidean-formatted/passages/part-0.parquet - 2.3GB).

Using code equivalent to OPs, I see total_allocated_bytes increase consistently over the run, requiring over 5.9GB to iterate the file.

Using ParquetFile.iter_batches as suggested, memory usage is much more stable (although increases a little over the duration).

Variation of the OPs reproducer code which includes the two modes:

#!/usr/bin/env python3

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import sys

path="part-0.parquet"

if sys.argv[1] == "Dataset":
    print("Using Dataset.to_batches API")    
    batches = ds.dataset(path).to_batches(batch_size=100,
                                          batch_readahead=0,
                                          fragment_readahead=0)
else:
    print("Using ParquetFile.iter_batches API")
    batches = pq.ParquetFile(path).iter_batches(batch_size=100)

# Iterate through batches
max_alloc = 0
for batch in batches:
    alloc = pa.total_allocated_bytes()
    if alloc > max_alloc:
        max_alloc = alloc
        print("New max total_allocated_bytes", max_alloc)
del batches
print("Final:", pa.total_allocated_bytes())

I see the following numbers (pyarrow 16.1.0, python 3.11.6):

Dataset.to_batches:

./pyarrow_39808_repro.py Dataset
Using Dataset.to_batches API
New max total_allocated_bytes 549322688
New max total_allocated_bytes 794176064
New max total_allocated_bytes 1094879616
New max total_allocated_bytes 1340710656
New max total_allocated_bytes 3732962048
New max total_allocated_bytes 4236869504
New max total_allocated_bytes 4849658496
New max total_allocated_bytes 6188106432
Final: 185074688

ParquetFile.iter_batches:

./pyarrow_39808_repro.py ParquetFile
Using ParquetFile.iter_batches API     
New max total_allocated_bytes 252396608
New max total_allocated_bytes 252403456
<cut>
New max total_allocated_bytes 274519744
New max total_allocated_bytes 274521024
Final: 35072

i.e. ParquetFile.iter_batches requires at most ~261MB, whereas Dataset.to_batches requires ~5901MB - 22x more (!) - plus an additinal 176MB of RAM still in use after iteration complete.

PatrikBernhard · 2024-07-30T12:47:57Z

Is it possible to ask for an update on this issue? I am currently running into the same memory leak in my project which gets mitigated when using Parquetfile.iter_batches instead of Dataset.to_batches. However, I'd like to use the HivePartition-discovering within Dataset.

mats-werrebrouck · 2024-11-04T08:58:41Z

Is there any update on this issue? I am also running into the same memory leak and would like to stick with Dataset.to_batches because of the HivePartition-discovering within PyArrow Dataset.

akoumjian added the Type: bug label Jan 26, 2024

github-actions bot added Component: Parquet Component: Python labels Jan 26, 2024

kou changed the title ~~Dataset.to_batches accumulates memory usage and leaks~~ [Python] Dataset.to_batches accumulates memory usage and leaks Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Dataset.to_batches accumulates memory usage and leaks #39808

[Python] Dataset.to_batches accumulates memory usage and leaks #39808

akoumjian commented Jan 26, 2024

jorisvandenbossche commented Jan 30, 2024

daverigby commented Jun 12, 2024

PatrikBernhard commented Jul 30, 2024

mats-werrebrouck commented Nov 4, 2024

[Python] Dataset.to_batches accumulates memory usage and leaks #39808

[Python] Dataset.to_batches accumulates memory usage and leaks #39808

Comments

akoumjian commented Jan 26, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

jorisvandenbossche commented Jan 30, 2024

daverigby commented Jun 12, 2024

PatrikBernhard commented Jul 30, 2024

mats-werrebrouck commented Nov 4, 2024