feat: Add a debug table for clickhouse_events_json #23377

fuziontech · 2024-07-01T20:12:05Z

Problem

We currently don't have any metrics to compare the number of events coming into kafka and the number of metrics that are at rest in sharded_events tables.

Changes

This adds on a kafka debug table that:

Does not deduplicate (important for keeping tabs on duplicates and where they are coming from, the topic or the CH Kafka consumer)
Stores the raw event payload as a string. As long as we are JSON encoding events we run the risk of having malformed JSON payloads. This let's us introspect and replay if needed straight from CH.

Does this work well for both Cloud and self-hosted?

How did you test this code?

posthog-bot · 2024-07-02T13:12:07Z

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

chromium: 0 added, 2 modified, 0 deleted (diff for shard 1)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

posthog-bot · 2024-07-02T13:27:32Z

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

chromium: 0 added, 2 modified, 0 deleted (diff for shard 1)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

github-actions · 2024-07-02T13:37:14Z

Size Change: 0 B

Total Size: 1.06 MB

ℹ️ View Unchanged

Filename	Size
`frontend/dist/toolbar.js`	1.06 MB

_{compressed-size-action}

posthog-bot · 2024-07-02T13:43:39Z

📸 UI snapshots have been updated

3 snapshot changes in total. 0 added, 3 modified, 0 deleted:

chromium: 0 added, 3 modified, 0 deleted (diff for shard 1, diff for shard 2)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

…ects (#23394)

posthog-bot · 2024-07-02T14:00:08Z

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

chromium: 0 added, 2 modified, 0 deleted (diff for shard 2)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

posthog-bot · 2024-07-02T14:17:00Z

📸 UI snapshots have been updated

4 snapshot changes in total. 0 added, 4 modified, 0 deleted:

chromium: 0 added, 4 modified, 0 deleted (diff for shard 1, diff for shard 2)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Daesgar

The kafka_handle_error_mode is just a suggestion.

I'd just review the ver field which I don't think is needed in this case!

Great having this debug table available from now on 🙏 🚀

Daesgar · 2024-07-02T14:56:10Z

posthog/models/kafka_debug/sql.py

+        payload String
+      )
+      ENGINE={kafka_engine(kafka_host=",".join(self.brokers), topic=self.topic, group=self.consumer_group)}
+      SETTINGS input_format_values_interpret_expressions=0, kafka_skip_broken_messages = 100


Do you think we can enable kafka_handle_error_mode='stream' to store the failed parsed messages in the _error and _raw_message columns? That way we don't need to skip messages.

I don't think we should ever skip messages here since we are just grabbing the payload as a string (no JSON deserialization) I'll remove the kafka_skip_broken_messages setting

Daesgar · 2024-07-02T14:57:29Z

posthog/models/kafka_debug/sql.py

+        return f"{self.topic}_debug"
+
+    def get_create_table_sql(self) -> str:
+        engine = MergeTreeEngine(self.table_name, ver="timestamp")


I think we don't need the ver field if we go for a MergeTree table.

Aren't we replicating this table data to all the nodes? Or do you want to keep just the consumed data inside every instance?

oh lol. This is what I get for doing copilot too fast

Daesgar · 2024-07-02T14:58:39Z

posthog/models/kafka_debug/sql.py

+        _timestamp DateTime,
+        _timestamp_ms Nullable(DateTime64(3)),
+        _partition UInt64,
+        _offset UInt64


Depending on whether we finally use the kafka_handle_error_mode='stream' mode, we would need two additional columns here for the _error and _raw_message.

That shouldn't be required here since we are just grabbing the payload right? It's not actually going to try to deserialize the JSON

I'm not sure to be honest, since we are defining the Kafka engine with a JSONEachRow. I don't know if ClickHouse will try to validate that every row is a valid JSON before ingesting it.

I'm pretty sure it is safe because I've used it to debug bad payloads before. But, let's just enable streaming for fun!

Oh ok! 💯 In that case then no problem, I just was not completely sure.

posthog/models/kafka_debug/sql.py

* master: chore(data-warehouse): make sure exception is passed through at workflow step (#23409) feat: add a launch compound for posthog with local billing (#23410) chore: add backfill_personless_distinct_ids command (#23404) fix: add missing billing tests (#23408) Schema-Enforcer plugin not global (#23412) chore: Enable person batch exports only on supported destinations (#23354) fix: allow entering a custom value while property values load (#23405) perf: Materialize elements_chain (#23170) fix(experiments): provide `required_scope` for experiments API (#23385) feat(survey): Allow events to repeatedly activate surveys (#23238) chore: maybe this will stop them flapping (#23401) chore(data-warehouse): Added number formatting for source settings (#23221) fix(multi project flags): remove flag id from URL when switching projects (#23394)

posthog-bot · 2024-07-02T19:40:20Z

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

chromium: 0 added, 1 modified, 0 deleted (diff for shard 1)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

fuziontech added 9 commits July 1, 2024 13:09

feat: Add a debug table for clickhouse_events_json

b101422

get_create_table => get_create_view

24215ed

use DateTime for _timestamp for TTL

11a752c

fix type issue

ff61bee

commas

d257ad2

not nullable

03d036c

reorder tables

e7c14d9

update sequencing for migrations

ba0740f

Merge branch 'master' into debug_events

32cc9b0

fuziontech force-pushed the debug_events branch from 41af274 to 32cc9b0 Compare July 2, 2024 13:01

Update UI snapshots for chromium (1)

dc5195b

github-actions bot added 2 commits July 2, 2024 13:17

Update query snapshots

afbbaac

Update UI snapshots for chromium (1)

f3a062c

github-actions bot added 2 commits July 2, 2024 13:41

Update UI snapshots for chromium (1)

dfc4269

Update UI snapshots for chromium (2)

e18bbb9

jurajmajerik and others added 2 commits July 2, 2024 06:49

fix(multi project flags): remove flag id from URL when switching proj…

652fb4c

…ects (#23394)

Update UI snapshots for chromium (2)

d47c704

github-actions bot added 2 commits July 2, 2024 14:14

Update UI snapshots for chromium (1)

9337ecd

Update UI snapshots for chromium (2)

78a6afc

Daesgar reviewed Jul 2, 2024

View reviewed changes

fuziontech added 2 commits July 2, 2024 08:12

remove version and don't skip bad messages

57052ad

Add _error and _raw_message and set kafka_handle_error_mode=stream

a9a7c98

Daesgar approved these changes Jul 2, 2024

View reviewed changes

Update UI snapshots for chromium (1)

675388e

typos

46cedae

fuziontech merged commit 3813df9 into master Jul 2, 2024
85 checks passed

fuziontech deleted the debug_events branch July 2, 2024 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add a debug table for clickhouse_events_json #23377

feat: Add a debug table for clickhouse_events_json #23377

fuziontech commented Jul 1, 2024

posthog-bot commented Jul 2, 2024

posthog-bot commented Jul 2, 2024

github-actions bot commented Jul 2, 2024 •

edited

Loading

posthog-bot commented Jul 2, 2024

posthog-bot commented Jul 2, 2024

posthog-bot commented Jul 2, 2024

Daesgar left a comment

Daesgar Jul 2, 2024

fuziontech Jul 2, 2024

Daesgar Jul 2, 2024

fuziontech Jul 2, 2024

Daesgar Jul 2, 2024

fuziontech Jul 2, 2024

Daesgar Jul 2, 2024

fuziontech Jul 2, 2024

Daesgar Jul 2, 2024

posthog-bot commented Jul 2, 2024

feat: Add a debug table for clickhouse_events_json #23377

feat: Add a debug table for clickhouse_events_json #23377

Conversation

fuziontech commented Jul 1, 2024

Problem

Changes

Does this work well for both Cloud and self-hosted?

How did you test this code?

posthog-bot commented Jul 2, 2024

📸 UI snapshots have been updated

posthog-bot commented Jul 2, 2024

📸 UI snapshots have been updated

github-actions bot commented Jul 2, 2024 • edited Loading

posthog-bot commented Jul 2, 2024

📸 UI snapshots have been updated

posthog-bot commented Jul 2, 2024

📸 UI snapshots have been updated

posthog-bot commented Jul 2, 2024

📸 UI snapshots have been updated

Daesgar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

posthog-bot commented Jul 2, 2024

📸 UI snapshots have been updated

github-actions bot commented Jul 2, 2024 •

edited

Loading