Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake Info] Update Overview and HTML to not break #1460

Open
4 tasks
idiom-bytes opened this issue Jul 25, 2024 · 0 comments
Open
4 tasks

[Lake Info] Update Overview and HTML to not break #1460

idiom-bytes opened this issue Jul 25, 2024 · 0 comments
Assignees
Labels
Type: Enhancement New feature or request

Comments

@idiom-bytes
Copy link
Member

idiom-bytes commented Jul 25, 2024

Background / motivation

Issue 1 - Lake Info should use summary queries, samples, and other approaches to handle larger datasets. Such that it does not select * all from lake and still provides the user with the insights they need
(a) overview.py fetches all data from all tables and caches it in memory
(b) HTML uses a _filter() function that rather looking at cached data, it fetches again from db...
(c) issue (3) takes this further and fetches data many times rather than once, caching the result, and displaying it everywhere

image

Issue 2 - _filter_table is implemented in a "generic manner" but underlying logic get_filtered_result doesn't actually use what's cached in memory, and instead does another query to the DB w/ a filter selection only on user column...

image
image

Issue 3 - html.py has to call _filter_table many times, causing multiple fetches/computes to be done rather than 1
image
image

TODOs / DoD

  1. fix lake implementation to be more memory/compute aware...
  • overview.py should not fetch all data from all tables from lake
  • html/frontend->lake should implement basic practices of pagination, sampling, row_count, and performing basic summaries such that it's considerate of data size
  1. get_filtered_result should be implemented propery
  • get_filered_result should work as as expected
  1. html.py calls _filter_table multiple tables to provide the same overview... this costs n-times the memory/computation/etc... rather than once
  • filter_table should only be done once across the whole page... perhaps across the whole app by caching this information somewhere that it can be accessed
@idiom-bytes idiom-bytes added the Type: Enhancement New feature or request label Jul 25, 2024
@idiom-bytes idiom-bytes changed the title [Lake Info] Overview + HTML fetch all data from lake without any pagination, sampling, or limit considerations [Lake Info] Overview and HTML is implemented using many assumptions that will eventually break Jul 25, 2024
@idiom-bytes idiom-bytes changed the title [Lake Info] Overview and HTML is implemented using many assumptions that will eventually break [Lake Info] Improve Overview and HTML by using summaries, samples, and other methods such that it works with larger datasets Jul 29, 2024
@idiom-bytes idiom-bytes changed the title [Lake Info] Improve Overview and HTML by using summaries, samples, and other methods such that it works with larger datasets [Lake Info] Update Overview and HTML to work with larger datasets Jul 29, 2024
@idiom-bytes idiom-bytes changed the title [Lake Info] Update Overview and HTML to work with larger datasets [Lake Info] Update Overview and HTML to not break Jul 30, 2024
@calina-c calina-c self-assigned this Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants