Replies: 1 comment 4 replies
-
One think you could do is to create a You would have to run this for each row group of course, but if you have all the data and metadata in memory anyways, it probably isn't that bad (and you could scan them all in parallel) 🤔 |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi folks,
I'm essentially trying to implement an external bloom filter for columns with randomly distributed values (where min/max stats don't help) and where reading bloom filter information from all files may be too expensive (100s of high latency object store requests), hence why I want to store this information externally.
The roadblock I'm running into is getting access to values per row group like bloom filters do.
Given that I have a DataFrame that I'm going to write out as I see it my options are:
Any suggestions on APIs or hook points I may be overlooking?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions