Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(hogql): implement basic caching for hogql insight queries #17483

Merged
merged 26 commits into from
Sep 19, 2023

Conversation

thmsobrmlr
Copy link
Contributor

@thmsobrmlr thmsobrmlr commented Sep 16, 2023

Problem

We're not caching calls to /query.

Changes

This PR implements a basic caching mechanism for insight queries on the /query endpoint.

Compared to current implementation:

  • Does not handle insight results. At the moment these still respond with a filter-based calculation, since we do not have the query on the backend. The game plan here is to (1) write a backend-side conversion function from filters -> query (2) make an internal request to the query endpoint with this query, for obtaining the cached/fresh result (our current implementation behaves differently depending on which api endpoint you call and I'd like to avoid that).
  • Does not sync the caching state with Postgres (InsightCachingState) - tbd if we still want background refreshes going forward.
  • Adds the team's timezone to the cache key payload, so that results are invalidated when changing the timezone.
  • Is not as robust against cache key changes, as it could be. For example properties: null and properties: [] have a different cache key. We can improve on that later on and the best way to do so would be to move the schema generation to the backend, so that we can use Pydantic validators.

In detail, this PR:

  • Sends a refresh param with api.query for HogQL queries
  • Adds a QueryResponse interface with is_cached and last_refresh attributes
  • Makes QueryRunner an abstract base class, adds methods for caching and adds tests
    • The cache key is based on Pydantic v2 model_dump_json
    • Instruments cache reads and writes with prometheus
    • Makes process_query accept an optional request for determining wether we want to refresh
    • Implements caching behaviour similar to what we have for insights minus everything where we need the insight context e.g. longer TTL for shared insights

Todos:

  • Add more tests

How did you test this code?

Added tests

@mariusandra mariusandra mentioned this pull request Sep 18, 2023
48 tasks
Base automatically changed from hogql-extend-schema to master September 18, 2023 13:05
@PostHog PostHog deleted a comment from posthog-bot Sep 18, 2023
@PostHog PostHog deleted a comment from posthog-bot Sep 18, 2023
@PostHog PostHog deleted a comment from posthog-bot Sep 18, 2023
# Conflicts:
#	frontend/__snapshots__/scenes-app-recordings--recordings-play-list-no-pinned-recordings.png
@thmsobrmlr thmsobrmlr marked this pull request as ready for review September 19, 2023 14:41
@thmsobrmlr
Copy link
Contributor Author

This is ready for a review. I'm going to add some more tests for the caching behaviour later.

@thmsobrmlr thmsobrmlr changed the title feat(hogql): add cache_key for hogql queries feat(hogql): implement basic caching for hogql insight queries Sep 19, 2023
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, and the gameplan sound solid to me. However a thought appeared: what if we moved the caching to the AST/HogQL/SQL level? Would that make anything simpler (more predictible), or would that cause problems with future plans such as partial reloading? Currently we'd lose in time as the query still needs to be parsed and generated, but assuming those things get taken care of, would there be any point in moving this caching up (or down? 🙃) a layer?

@@ -16,15 +20,12 @@

class LifecycleQueryRunner(QueryRunner):
query: LifecycleQuery
query_type = LifecycleQuery
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Comment on lines +255 to +275
def _is_stale(self, cached_result_package):
date_to = self.query_date_range.date_to()
interval = self.query_date_range.interval_name
return is_stale(self.team, date_to, interval, cached_result_package)

def _refresh_frequency(self):
date_to = self.query_date_range.date_to()
date_from = self.query_date_range.date_from()
interval = self.query_date_range.interval_name

delta_days: Optional[int] = None
if date_from and date_to:
delta = date_to - date_from
delta_days = ceil(delta.total_seconds() / timedelta(days=1).total_seconds())

refresh_frequency = BASE_MINIMUM_INSIGHT_REFRESH_INTERVAL
if interval == "hour" or (delta_days is not None and delta_days <= 7):
# The interval is shorter for short-term insights
refresh_frequency = REDUCED_MINIMUM_INSIGHT_REFRESH_INTERVAL

return refresh_frequency
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like something we could standardise across all queries into query_runner.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. There are a few special cases e.g. RetentionFilter uses period instead of interval, so I thought it's best to look at each query individually first and the unify the handling. Easy to forget this place otherwise.

Also, should we cache other HogQL queries e.g. those based on a date range?

def is_stale_filter(
team: Team, filter: Filter | RetentionFilter | StickinessFilter | PathFilter, cached_result: Any
) -> bool:
interval = filter.period.lower() if isinstance(filter, RetentionFilter) else filter.interval
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬

@thmsobrmlr
Copy link
Contributor Author

This looks great, and the gameplan sound solid to me. However a thought appeared: what if we moved the caching to the AST/HogQL/SQL level?

You mean cache the ClickHouse response instead of the serialized output? I've thought about it before as a way to make persons modal responses more reliable (by caching the event query base - doesn't work, cache gets blown up). We could cache the final query output, but I don't see where that would improve things (I imagine we almost have a 1-to-1 mapping of query node to ClickHouse query). Wouldn't be difficult if there's a good reason to do it though.

@thmsobrmlr thmsobrmlr merged commit c61a9d0 into master Sep 19, 2023
81 checks passed
@thmsobrmlr thmsobrmlr deleted the hogql-query-cache branch September 19, 2023 18:59
@mariusandra
Copy link
Collaborator

I imagine we almost have a 1-to-1 mapping of query node to ClickHouse query

Yup, I imagine this as well. The case where it's not 1:1 will be when different query nodes with properties: null and properties: [] map both to the same ClickHouse query. Caching at the query level could help unify these. 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants