-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(batch-exports): Use events_recent
for more exports
#27471
base: master
Are you sure you want to change the base?
Conversation
ddb00e6
to
7e7e57b
Compare
events_recent
for all exportsevents_recent
for more exports
Hey @rossgray! 👋 |
33aa02d
to
1f84853
Compare
d1640e4
to
dc78296
Compare
posthog/batch_exports/sql.py
Outdated
@@ -289,6 +289,37 @@ | |||
) | |||
""" | |||
|
|||
CREATE_EVENTS_BATCH_EXPORT_VIEW_RECENT_DISTRIBUTED = f""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like we discussed, this could be a good chance to start not doing these views anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will try to see if I can remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I've replaced it with a single query now :)
@@ -449,6 +476,7 @@ def __init__(self, task: str): | |||
super().__init__(f"Expected task '{task}' to be done by now") | |||
|
|||
|
|||
# TODO - not sure this is being used anymore? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, should be fine to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, have removed it. Think there's quite a lot of other code that can be removed from this file too but will save that for a future PR
# Data can sometimes take a while to settle, so for 5 min batch exports we wait several seconds just to be safe. | ||
# For all other batch exports we wait for 1 minute since we're querying the events_recent table using a | ||
# distributed table and setting `max_replica_delay_for_distributed_queries` to 1 minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we couldn't just use max_replica_delay_for_distributed_queries=60 seconds
for everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well once we have rolled this out to 100% of teams, there should only be 3 different queries:
SELECT_FROM_EVENTS_VIEW_RECENT
for 5 min batch exportsSELECT_FROM_EVENTS_VIEW_BACKFILL
for backfillsSELECT_FROM_DISTRIBUTED_EVENTS_RECENT
for all other batch exports
i.e. we're going to be removing the following two (at least I think this is the purpose of these changes)
SELECT_FROM_EVENTS_VIEW_UNBOUNDED
SELECT_FROM_EVENTS_VIEW
SELECT_FROM_EVENTS_VIEW_RECENT
already uses max_replica_delay_for_distributed_queries=1
. It doesn't use fallback_to_stale_replicas_for_distributed_queries=0
though (which when I was testing I found I needed to use, otherwise the default value is 1
which means it can fallback to a stale replica). However since we're always hitting a single node I'm not sure this matters?
We could add it to SELECT_FROM_EVENTS_VIEW_BACKFILL
though I don't think it matters too much for backfills, unless it's backfilling right up to the present time, but then again, it uses timestamp
rather than inserted_at
when querying anyway, so not sure if it will make too much difference? 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't been able to test what happens if no replica is found. I believe we would raise a ClickHouseError like we do for other such errors such as when we get a TOO_MANY_SIMULTANEOUS_QUERIES error back from ClickHouse. If this is the case, we should retry the Temporal activity.
I think it's critical that we understand this failure mode before merging this PR. It doesn't necessarily mean we need a unit test to reproduce the error, but we should confirm exactly what would happen by querying production ourselves.
end_at = full_range[1] | ||
await wait_for_delta_past_data_interval_end(end_at, delta) | ||
|
||
async with get_client(team_id=team_id, clickhouse_url=clickhouse_url) as client: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this move, there is more configuration stuff we can shuffle around some more. Good work.
860e0e2
to
5082902
Compare
Problem
We currently use
events_recent
for 5 min batch exports butevents
elsewhere. Since moving to usingevents_recent
we haven't seen any missed events in normal operation so makes sense to use it everywhere.Changes
After discussing internally, the overall approach here is:
events_recent
directlydistributed_events_recent
table which sits in front of theevents_recent
table.max_replica_delay_for_distributed_queries=60
andfallback_to_stale_replicas_for_distributed_queries=0
to ensure we only query a node with maximum 60s lag, otherwise we failI haven't been able to test what happens if no replica is found. I believe we would raise aClickHouseError
like we do for other such errors such as when we get aTOO_MANY_SIMULTANEOUS_QUERIES
error back from ClickHouse. If this is the case, we should retry the Temporal activity.In order to roll this out gradually, I have added a
BATCH_EXPORT_DISTRIBUTED_EVENTS_RECENT_ROLLOUT
setting. This defaults to 0 so by default it won't be used by any batch exports. We can then increase this in 0.1 increments to rollout it out to more teams. This means the current behaviour should remain unchanged as soon as this PR is merged.Also, the way we instantiate the ClickHouse client in batch exports has been refactored (moved from individual activities into the SPMC
Producer
) to make the code simpler.I decided to create a view for queryingdistributed_events_recent
just to be consistent with other queries and to prevent any bugs by trying to remove the need for a view.Perhaps it's worth moving the creation of the view into a separate PR?Does this work well for both Cloud and self-hosted?
Should do.
How did you test this code?
Using our automated tests