Feature request: read a large DataFrame in chunks #1709

River-Shi · 2024-07-24T07:40:16Z

Is your feature request related to a problem? Please describe.
If we have a really large dataframe that exceeds memory, and we need to process each part of it, parquet supports batch_size. I'm wondering if lib.read has similar functionality.

def read_parquet_in_batches(file_path, batch_size=10000):
    parquet_file = pq.ParquetFile(file_path)
    for batch in parquet_file.iter_batches(batch_size=batch_size):
        yield batch.to_pandas()

The text was updated successfully, but these errors were encountered:

DrNickClarke · 2024-07-25T14:55:35Z

Hi. You can do this using the row_range argument of the read function.

We are planning various future improvements that will make this easier to use and faster.

River-Shi · 2024-07-26T07:45:53Z

Hi. You can do this using the row_range argument of the read function.

We are planning various future improvements that will make this easier to use and faster.

Yes, it can indeed be done using row_range or date_range，Can articdb implement a function like this, making it easier for users to operate on each batch instead of the entire dataframe, which would save a lot of memory?

def fetch_batch_from_arcticdb(
    symbol: str,
    start: str,
    end: str,
    batch_size: int = 1440,
    uri: str = 'lmdb://crypto_database.lmdb',
    library: str = 'binance',
):
    ac = adb.Arctic(uri)
    lib = ac[library]
    
    start_date = pd.Timestamp(start)
    end_date = pd.Timestamp(end)
    
    while start_date < end_date:
        batch_end = min(start_date + timedelta(minutes=batch_size), end_date)
        
        df = lib.read(symbol, date_range=(start_date, batch_end)).data
        
        yield df
        
        start_date = batch_end

River-Shi · 2024-08-01T04:10:00Z

Hi. You can do this using the row_range argument of the read function.

We are planning various future improvements that will make this easier to use and faster.

Hi, any new updates?

vasil-pashov · 2024-08-05T14:53:07Z

Hi. You can do this using the row_range argument of the read function.
We are planning various future improvements that will make this easier to use and faster.

Hi, any new updates?

Hi, the road map is not completely sorted out. @DrNickClarke will get back to you with more info.

DrNickClarke · 2024-08-14T08:58:16Z

Hi. Sorry for the delay coming back on this. Thank you for your suggestion. We definitely have plans to make chunking easier going forward. It has not reached the top of the priority list at this time but I hope you will be pleased to see the announcements we will be making in the near future.

River-Shi · 2024-08-16T23:34:54Z

Hi. Sorry for the delay coming back on this. Thank you for your suggestion. We definitely have plans to make chunking easier going forward. It has not reached the top of the priority list at this time but I hope you will be pleased to see the announcements we will be making in the near future.

I find some great features in dask. It would be great if arctic can implement it.

DrNickClarke · 2024-09-25T15:09:30Z

Hi. We managed to find time to add a basic chunking api. Here is the PR with the new api and tests. We would value your feedback.

#1853

River-Shi added the enhancement New feature or request label Jul 24, 2024

poodlewars assigned DrNickClarke Jul 25, 2024

maxim-morozov added replicated and removed replicated labels Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: read a large DataFrame in chunks #1709

Feature request: read a large DataFrame in chunks #1709

River-Shi commented Jul 24, 2024

DrNickClarke commented Jul 25, 2024

River-Shi commented Jul 26, 2024 •

edited

Loading

River-Shi commented Aug 1, 2024

vasil-pashov commented Aug 5, 2024

DrNickClarke commented Aug 14, 2024

River-Shi commented Aug 16, 2024

DrNickClarke commented Sep 25, 2024

Feature request: read a large DataFrame in chunks #1709

Feature request: read a large DataFrame in chunks #1709

Comments

River-Shi commented Jul 24, 2024

DrNickClarke commented Jul 25, 2024

River-Shi commented Jul 26, 2024 • edited Loading

River-Shi commented Aug 1, 2024

vasil-pashov commented Aug 5, 2024

DrNickClarke commented Aug 14, 2024

River-Shi commented Aug 16, 2024

DrNickClarke commented Sep 25, 2024

River-Shi commented Jul 26, 2024 •

edited

Loading