-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(hogql): implement basic caching for hogql insight queries #17483
Conversation
905f1ec
to
1b943cb
Compare
632cae6
to
9054454
Compare
# Conflicts: # frontend/__snapshots__/scenes-app-recordings--recordings-play-list-no-pinned-recordings.png
This is ready for a review. I'm going to add some more tests for the caching behaviour later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, and the gameplan sound solid to me. However a thought appeared: what if we moved the caching to the AST/HogQL/SQL level? Would that make anything simpler (more predictible), or would that cause problems with future plans such as partial reloading? Currently we'd lose in time as the query still needs to be parsed and generated, but assuming those things get taken care of, would there be any point in moving this caching up (or down? 🙃) a layer?
@@ -16,15 +20,12 @@ | |||
|
|||
class LifecycleQueryRunner(QueryRunner): | |||
query: LifecycleQuery | |||
query_type = LifecycleQuery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
def _is_stale(self, cached_result_package): | ||
date_to = self.query_date_range.date_to() | ||
interval = self.query_date_range.interval_name | ||
return is_stale(self.team, date_to, interval, cached_result_package) | ||
|
||
def _refresh_frequency(self): | ||
date_to = self.query_date_range.date_to() | ||
date_from = self.query_date_range.date_from() | ||
interval = self.query_date_range.interval_name | ||
|
||
delta_days: Optional[int] = None | ||
if date_from and date_to: | ||
delta = date_to - date_from | ||
delta_days = ceil(delta.total_seconds() / timedelta(days=1).total_seconds()) | ||
|
||
refresh_frequency = BASE_MINIMUM_INSIGHT_REFRESH_INTERVAL | ||
if interval == "hour" or (delta_days is not None and delta_days <= 7): | ||
# The interval is shorter for short-term insights | ||
refresh_frequency = REDUCED_MINIMUM_INSIGHT_REFRESH_INTERVAL | ||
|
||
return refresh_frequency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like something we could standardise across all queries into query_runner.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. There are a few special cases e.g. RetentionFilter uses period
instead of interval
, so I thought it's best to look at each query individually first and the unify the handling. Easy to forget this place otherwise.
Also, should we cache other HogQL queries e.g. those based on a date range?
def is_stale_filter( | ||
team: Team, filter: Filter | RetentionFilter | StickinessFilter | PathFilter, cached_result: Any | ||
) -> bool: | ||
interval = filter.period.lower() if isinstance(filter, RetentionFilter) else filter.interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😬
You mean cache the ClickHouse response instead of the serialized output? I've thought about it before as a way to make persons modal responses more reliable (by caching the event query base - doesn't work, cache gets blown up). We could cache the final query output, but I don't see where that would improve things (I imagine we almost have a 1-to-1 mapping of query node to ClickHouse query). Wouldn't be difficult if there's a good reason to do it though. |
Yup, I imagine this as well. The case where it's not 1:1 will be when different query nodes with |
Problem
We're not caching calls to
/query
.Changes
This PR implements a basic caching mechanism for insight queries on the
/query
endpoint.Compared to current implementation:
InsightCachingState
) - tbd if we still want background refreshes going forward.properties: null
andproperties: []
have a different cache key. We can improve on that later on and the best way to do so would be to move the schema generation to the backend, so that we can use Pydantic validators.In detail, this PR:
refresh
param withapi.query
for HogQL queriesQueryResponse
interface withis_cached
andlast_refresh
attributesQueryRunner
an abstract base class, adds methods for caching and adds testsmodel_dump_json
process_query
accept an optionalrequest
for determining wether we want to refreshTodos:
How did you test this code?
Added tests