feat(batch-exports): Use `events_recent` for more exports #27471

rossgray · 2025-01-13T16:05:48Z

Problem

We currently use events_recent for 5 min batch exports but events elsewhere. Since moving to using events_recent we haven't seen any missed events in normal operation so makes sense to use it everywhere.

Changes

After discussing internally, the overall approach here is:

For 5 min batch exports:
- Continue to query events_recent directly
- Connect directly to the node where events are inserted to ensure replication lag is zero
For all other batch exports, use the distributed_events_recent table which sits in front of the events_recent table.
- These queries could hit a node with replication lag, so we use max_replica_delay_for_distributed_queries=60 and fallback_to_stale_replicas_for_distributed_queries=0 to ensure we only query a node with maximum 60s lag, otherwise we fail
- At the start of the Temporal workflow we add a conditional wait to ensure we don't query the table until at least 60s has passed since the interval end.
- I haven't been able to test what happens if no replica is found. I believe we would raise a ClickHouseError like we do for other such errors such as when we get a TOO_MANY_SIMULTANEOUS_QUERIES error back from ClickHouse. If this is the case, we should retry the Temporal activity.

In order to roll this out gradually, I have added a BATCH_EXPORT_DISTRIBUTED_EVENTS_RECENT_ROLLOUT setting. This defaults to 0 so by default it won't be used by any batch exports. We can then increase this in 0.1 increments to rollout it out to more teams. This means the current behaviour should remain unchanged as soon as this PR is merged.

Also, the way we instantiate the ClickHouse client in batch exports has been refactored (moved from individual activities into the SPMC Producer) to make the code simpler.

~~I decided to create a view for querying distributed_events_recent just to be consistent with other queries and to prevent any bugs by trying to remove the need for a view.~~

~~Perhaps it's worth moving the creation of the view into a separate PR?~~

Does this work well for both Cloud and self-hosted?

Should do.

How did you test this code?

Using our automated tests

posthog-bot · 2025-01-14T15:27:08Z

Hey @rossgray! 👋
This pull request seems to contain no description. Please add useful context, rationale, and/or any other information that will help make sense of this change now and in the distant Mars-based future.

tomasfarias · 2025-01-17T09:06:07Z

posthog/batch_exports/sql.py

@@ -289,6 +289,37 @@
 )
 """

+CREATE_EVENTS_BATCH_EXPORT_VIEW_RECENT_DISTRIBUTED = f"""


Like we discussed, this could be a good chance to start not doing these views anymore.

ok, will try to see if I can remove it

ok, I've replaced it with a single query now :)

tomasfarias · 2025-01-17T09:07:25Z

posthog/temporal/batch_exports/batch_exports.py

@@ -449,6 +476,7 @@ def __init__(self, task: str):
        super().__init__(f"Expected task '{task}' to be done by now")


+# TODO - not sure this is being used anymore?


Nope, should be fine to remove.

cool, have removed it. Think there's quite a lot of other code that can be removed from this file too but will save that for a future PR

tomasfarias · 2025-01-17T09:12:40Z

posthog/temporal/batch_exports/spmc.py

+        # Data can sometimes take a while to settle, so for 5 min batch exports we wait several seconds just to be safe.
+        # For all other batch exports we wait for 1 minute since we're querying the events_recent table using a
+        # distributed table and setting `max_replica_delay_for_distributed_queries` to 1 minute


I wonder if we couldn't just use max_replica_delay_for_distributed_queries=60 seconds for everyone.

well once we have rolled this out to 100% of teams, there should only be 3 different queries:

SELECT_FROM_EVENTS_VIEW_RECENT for 5 min batch exports

SELECT_FROM_EVENTS_VIEW_BACKFILL for backfills

SELECT_FROM_DISTRIBUTED_EVENTS_RECENT for all other batch exports

i.e. we're going to be removing the following two (at least I think this is the purpose of these changes)

SELECT_FROM_EVENTS_VIEW_UNBOUNDED

SELECT_FROM_EVENTS_VIEW

SELECT_FROM_EVENTS_VIEW_RECENT already uses max_replica_delay_for_distributed_queries=1. It doesn't use fallback_to_stale_replicas_for_distributed_queries=0 though (which when I was testing I found I needed to use, otherwise the default value is 1 which means it can fallback to a stale replica). However since we're always hitting a single node I'm not sure this matters?

We could add it to SELECT_FROM_EVENTS_VIEW_BACKFILL though I don't think it matters too much for backfills, unless it's backfilling right up to the present time, but then again, it uses timestamp rather than inserted_at when querying anyway, so not sure if it will make too much difference? 🤷

tomasfarias

I haven't been able to test what happens if no replica is found. I believe we would raise a ClickHouseError like we do for other such errors such as when we get a TOO_MANY_SIMULTANEOUS_QUERIES error back from ClickHouse. If this is the case, we should retry the Temporal activity.

I think it's critical that we understand this failure mode before merging this PR. It doesn't necessarily mean we need a unit test to reproduce the error, but we should confirm exactly what would happen by querying production ourselves.

tomasfarias · 2025-01-17T09:16:58Z

posthog/temporal/batch_exports/spmc.py

+        end_at = full_range[1]
+        await wait_for_delta_past_data_interval_end(end_at, delta)
+
+        async with get_client(team_id=team_id, clickhouse_url=clickhouse_url) as client:


I like this move, there is more configuration stuff we can shuffle around some more. Good work.

rossgray force-pushed the use-events-recent-for-all-be branch from ddb00e6 to 7e7e57b Compare January 14, 2025 13:37

rossgray changed the title ~~feat(batch-exports): Optionally use events_recent for all exports~~ feat(batch-exports): Use events_recent for more exports Jan 14, 2025

rossgray marked this pull request as ready for review January 14, 2025 15:26

rossgray force-pushed the use-events-recent-for-all-be branch from 33aa02d to 1f84853 Compare January 16, 2025 14:06

rossgray requested a review from a team as a code owner January 16, 2025 14:06

rossgray requested a review from tomasfarias January 16, 2025 14:39

rossgray force-pushed the use-events-recent-for-all-be branch from d1640e4 to dc78296 Compare January 16, 2025 16:49

tomasfarias reviewed Jan 17, 2025

View reviewed changes

tomasfarias requested changes Jan 17, 2025

View reviewed changes

tomasfarias reviewed Jan 17, 2025

View reviewed changes

rossgray added 12 commits January 17, 2025 12:39

Add setting for using events_recent

6db816f

Refactor SPMC producer to get CH client

94c3002

Update all destinations

175bbb1

Fix test

2876327

Add method to roll out incrementally

9446e47

Use distributed_events_recent table

67e9992

Fix mypy error

2f5d664

Try to fix test

d53fe9a

Handle ALL_REPLICAS_ARE_STALE error

9dfacf4

Remove unused code

42fe989

WIP

3a602d8

Remove view

5082902

rossgray force-pushed the use-events-recent-for-all-be branch from 860e0e2 to 5082902 Compare January 17, 2025 12:52

rossgray requested a review from tomasfarias January 17, 2025 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(batch-exports): Use `events_recent` for more exports #27471

feat(batch-exports): Use `events_recent` for more exports #27471

rossgray commented Jan 13, 2025 •

edited

Loading

posthog-bot commented Jan 14, 2025

tomasfarias Jan 17, 2025

rossgray Jan 17, 2025

rossgray Jan 17, 2025

tomasfarias Jan 17, 2025

rossgray Jan 17, 2025

tomasfarias Jan 17, 2025

rossgray Jan 17, 2025

tomasfarias left a comment

tomasfarias Jan 17, 2025

		@@ -449,6 +476,7 @@ def __init__(self, task: str):
		super().__init__(f"Expected task '{task}' to be done by now")


		# TODO - not sure this is being used anymore?

feat(batch-exports): Use events_recent for more exports #27471

Are you sure you want to change the base?

feat(batch-exports): Use events_recent for more exports #27471

Conversation

rossgray commented Jan 13, 2025 • edited Loading

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

posthog-bot commented Jan 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasfarias left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feat(batch-exports): Use `events_recent` for more exports #27471

feat(batch-exports): Use `events_recent` for more exports #27471

rossgray commented Jan 13, 2025 •

edited

Loading