Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: read a large DataFrame in chunks #1709

Open
River-Shi opened this issue Jul 24, 2024 · 7 comments
Open

Feature request: read a large DataFrame in chunks #1709

River-Shi opened this issue Jul 24, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@River-Shi
Copy link

Is your feature request related to a problem? Please describe.
If we have a really large dataframe that exceeds memory, and we need to process each part of it, parquet supports batch_size. I'm wondering if lib.read has similar functionality.

def read_parquet_in_batches(file_path, batch_size=10000):
    parquet_file = pq.ParquetFile(file_path)
    for batch in parquet_file.iter_batches(batch_size=batch_size):
        yield batch.to_pandas()
@River-Shi River-Shi added the enhancement New feature or request label Jul 24, 2024
@DrNickClarke
Copy link
Contributor

Hi. You can do this using the row_range argument of the read function.

We are planning various future improvements that will make this easier to use and faster.

@River-Shi
Copy link
Author

River-Shi commented Jul 26, 2024

Hi. You can do this using the row_range argument of the read function.

We are planning various future improvements that will make this easier to use and faster.

Yes, it can indeed be done using row_range or date_range,Can articdb implement a function like this, making it easier for users to operate on each batch instead of the entire dataframe, which would save a lot of memory?

def fetch_batch_from_arcticdb(
    symbol: str,
    start: str,
    end: str,
    batch_size: int = 1440,
    uri: str = 'lmdb://crypto_database.lmdb',
    library: str = 'binance',
):
    ac = adb.Arctic(uri)
    lib = ac[library]
    
    start_date = pd.Timestamp(start)
    end_date = pd.Timestamp(end)
    
    while start_date < end_date:
        batch_end = min(start_date + timedelta(minutes=batch_size), end_date)
        
        df = lib.read(symbol, date_range=(start_date, batch_end)).data
        
        yield df
        
        start_date = batch_end

@River-Shi
Copy link
Author

Hi. You can do this using the row_range argument of the read function.

We are planning various future improvements that will make this easier to use and faster.

Hi, any new updates?

@vasil-pashov
Copy link
Collaborator

Hi. You can do this using the row_range argument of the read function.
We are planning various future improvements that will make this easier to use and faster.

Hi, any new updates?

Hi, the road map is not completely sorted out. @DrNickClarke will get back to you with more info.

@DrNickClarke
Copy link
Contributor

Hi. Sorry for the delay coming back on this. Thank you for your suggestion. We definitely have plans to make chunking easier going forward. It has not reached the top of the priority list at this time but I hope you will be pleased to see the announcements we will be making in the near future.

@River-Shi
Copy link
Author

Hi. Sorry for the delay coming back on this. Thank you for your suggestion. We definitely have plans to make chunking easier going forward. It has not reached the top of the priority list at this time but I hope you will be pleased to see the announcements we will be making in the near future.

I find some great features in dask. It would be great if arctic can implement it.

@DrNickClarke
Copy link
Contributor

Hi. We managed to find time to add a basic chunking api. Here is the PR with the new api and tests. We would value your feedback.

#1853

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants