Does Ballista have any form of caching for queries on S3 files? #822

collimarco · 2023-06-28T09:39:32Z

collimarco
Jun 28, 2023

I wonder if Ballista can cache (in memory or on disk) the files (or parts of the files) that are downloaded from S3.

For example, if I run the same query or a similar query on the same dataset of Parquet files stored on S3, does Ballista downloads the files every time or it has some form of caching (so that the next queries are faster and less downloads are required)?

mingmwang · 2023-06-28T23:36:52Z

mingmwang
Jun 28, 2023
Collaborator

@collimarco
Currently in the main branch there is no.
@yahoNanJing had implemented an adaptive data cache in his own branch and I think he will contribute that to the main branch soon.

0 replies

yahoNanJing · 2023-06-29T01:27:52Z

yahoNanJing
Jun 29, 2023
Collaborator

Hi @collimarco, we have implemented the data cache for the Ballista on my personal branch. It depends on another PR heavily for the cache aware task scheduling #823. After #823, I will raise another PR to contribute the data cache to the main branch.

If you are interested on this feature, you can follow #645

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Ballista have any form of caching for queries on S3 files? #822

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Does Ballista have any form of caching for queries on S3 files? #822

collimarco Jun 28, 2023

Replies: 2 comments

mingmwang Jun 28, 2023 Collaborator

yahoNanJing Jun 29, 2023 Collaborator

collimarco
Jun 28, 2023

mingmwang
Jun 28, 2023
Collaborator

yahoNanJing
Jun 29, 2023
Collaborator