Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(batch-exports): Use events_recent for more exports #27471

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

rossgray
Copy link
Contributor

@rossgray rossgray commented Jan 13, 2025

Problem

We currently use events_recent for 5 min batch exports but events elsewhere. Since moving to using events_recent we haven't seen any missed events in normal operation so makes sense to use it everywhere.

Changes

After discussing internally, the overall approach here is:

  • For 5 min batch exports:
    • Continue to query events_recent directly
    • Connect directly to the node where events are inserted to ensure replication lag is zero
  • For all other batch exports, use the distributed_events_recent table which sits in front of the events_recent table.
    • These queries could hit a node with replication lag, so we use max_replica_delay_for_distributed_queries=60 and fallback_to_stale_replicas_for_distributed_queries=0 to ensure we only query a node with maximum 60s lag, otherwise we fail
    • At the start of the Temporal workflow we add a conditional wait to ensure we don't query the table until at least 60s has passed since the interval end.
    • I haven't been able to test what happens if no replica is found. I believe we would raise a ClickHouseError like we do for other such errors such as when we get a TOO_MANY_SIMULTANEOUS_QUERIES error back from ClickHouse. If this is the case, we should retry the Temporal activity.

In order to roll this out gradually, I have added a BATCH_EXPORT_DISTRIBUTED_EVENTS_RECENT_ROLLOUT setting. This defaults to 0 so by default it won't be used by any batch exports. We can then increase this in 0.1 increments to rollout it out to more teams. This means the current behaviour should remain unchanged as soon as this PR is merged.

Also, the way we instantiate the ClickHouse client in batch exports has been refactored (moved from individual activities into the SPMC Producer) to make the code simpler.

I decided to create a view for querying distributed_events_recent just to be consistent with other queries and to prevent any bugs by trying to remove the need for a view.

Perhaps it's worth moving the creation of the view into a separate PR?

Does this work well for both Cloud and self-hosted?

Should do.

How did you test this code?

Using our automated tests

@rossgray rossgray force-pushed the use-events-recent-for-all-be branch from ddb00e6 to 7e7e57b Compare January 14, 2025 13:37
@rossgray rossgray changed the title feat(batch-exports): Optionally use events_recent for all exports feat(batch-exports): Use events_recent for more exports Jan 14, 2025
@rossgray rossgray marked this pull request as ready for review January 14, 2025 15:26
@posthog-bot
Copy link
Contributor

Hey @rossgray! 👋
This pull request seems to contain no description. Please add useful context, rationale, and/or any other information that will help make sense of this change now and in the distant Mars-based future.

@rossgray rossgray force-pushed the use-events-recent-for-all-be branch from 33aa02d to 1f84853 Compare January 16, 2025 14:06
@rossgray rossgray requested a review from a team as a code owner January 16, 2025 14:06
@rossgray rossgray requested a review from tomasfarias January 16, 2025 14:39
@rossgray rossgray force-pushed the use-events-recent-for-all-be branch from d1640e4 to dc78296 Compare January 16, 2025 16:49
@@ -289,6 +289,37 @@
)
"""

CREATE_EVENTS_BATCH_EXPORT_VIEW_RECENT_DISTRIBUTED = f"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like we discussed, this could be a good chance to start not doing these views anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will try to see if I can remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I've replaced it with a single query now :)

@@ -449,6 +476,7 @@ def __init__(self, task: str):
super().__init__(f"Expected task '{task}' to be done by now")


# TODO - not sure this is being used anymore?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, should be fine to remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, have removed it. Think there's quite a lot of other code that can be removed from this file too but will save that for a future PR

Comment on lines +672 to +674
# Data can sometimes take a while to settle, so for 5 min batch exports we wait several seconds just to be safe.
# For all other batch exports we wait for 1 minute since we're querying the events_recent table using a
# distributed table and setting `max_replica_delay_for_distributed_queries` to 1 minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we couldn't just use max_replica_delay_for_distributed_queries=60 seconds for everyone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well once we have rolled this out to 100% of teams, there should only be 3 different queries:

  • SELECT_FROM_EVENTS_VIEW_RECENT for 5 min batch exports
  • SELECT_FROM_EVENTS_VIEW_BACKFILL for backfills
  • SELECT_FROM_DISTRIBUTED_EVENTS_RECENT for all other batch exports

i.e. we're going to be removing the following two (at least I think this is the purpose of these changes)

  • SELECT_FROM_EVENTS_VIEW_UNBOUNDED
  • SELECT_FROM_EVENTS_VIEW

SELECT_FROM_EVENTS_VIEW_RECENT already uses max_replica_delay_for_distributed_queries=1. It doesn't use fallback_to_stale_replicas_for_distributed_queries=0 though (which when I was testing I found I needed to use, otherwise the default value is 1 which means it can fallback to a stale replica). However since we're always hitting a single node I'm not sure this matters?

We could add it to SELECT_FROM_EVENTS_VIEW_BACKFILL though I don't think it matters too much for backfills, unless it's backfilling right up to the present time, but then again, it uses timestamp rather than inserted_at when querying anyway, so not sure if it will make too much difference? 🤷

Copy link
Contributor

@tomasfarias tomasfarias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to test what happens if no replica is found. I believe we would raise a ClickHouseError like we do for other such errors such as when we get a TOO_MANY_SIMULTANEOUS_QUERIES error back from ClickHouse. If this is the case, we should retry the Temporal activity.

I think it's critical that we understand this failure mode before merging this PR. It doesn't necessarily mean we need a unit test to reproduce the error, but we should confirm exactly what would happen by querying production ourselves.

end_at = full_range[1]
await wait_for_delta_past_data_interval_end(end_at, delta)

async with get_client(team_id=team_id, clickhouse_url=clickhouse_url) as client:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this move, there is more configuration stuff we can shuffle around some more. Good work.

@rossgray rossgray force-pushed the use-events-recent-for-all-be branch from 860e0e2 to 5082902 Compare January 17, 2025 12:52
@rossgray rossgray requested a review from tomasfarias January 17, 2025 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants