Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add a debug table for clickhouse_events_json #23377

Merged
merged 23 commits into from
Jul 2, 2024
Merged

Conversation

fuziontech
Copy link
Member

Problem

We currently don't have any metrics to compare the number of events coming into kafka and the number of metrics that are at rest in sharded_events tables.

Changes

This adds on a kafka debug table that:

  1. Does not deduplicate (important for keeping tabs on duplicates and where they are coming from, the topic or the CH Kafka consumer)
  2. Stores the raw event payload as a string. As long as we are JSON encoding events we run the risk of having malformed JSON payloads. This let's us introspect and replay if needed straight from CH.

Does this work well for both Cloud and self-hosted?

How did you test this code?

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

  • chromium: 0 added, 2 modified, 0 deleted (diff for shard 1)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

  • chromium: 0 added, 2 modified, 0 deleted (diff for shard 1)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Copy link
Contributor

github-actions bot commented Jul 2, 2024

Size Change: 0 B

Total Size: 1.06 MB

ℹ️ View Unchanged
Filename Size
frontend/dist/toolbar.js 1.06 MB

compressed-size-action

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

3 snapshot changes in total. 0 added, 3 modified, 0 deleted:

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

  • chromium: 0 added, 2 modified, 0 deleted (diff for shard 2)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

4 snapshot changes in total. 0 added, 4 modified, 0 deleted:

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Copy link
Contributor

@Daesgar Daesgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kafka_handle_error_mode is just a suggestion.

I'd just review the ver field which I don't think is needed in this case!

Great having this debug table available from now on 🙏 🚀

payload String
)
ENGINE={kafka_engine(kafka_host=",".join(self.brokers), topic=self.topic, group=self.consumer_group)}
SETTINGS input_format_values_interpret_expressions=0, kafka_skip_broken_messages = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can enable kafka_handle_error_mode='stream' to store the failed parsed messages in the _error and _raw_message columns? That way we don't need to skip messages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should ever skip messages here since we are just grabbing the payload as a string (no JSON deserialization) I'll remove the kafka_skip_broken_messages setting

return f"{self.topic}_debug"

def get_create_table_sql(self) -> str:
engine = MergeTreeEngine(self.table_name, ver="timestamp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need the ver field if we go for a MergeTree table.

Aren't we replicating this table data to all the nodes? Or do you want to keep just the consumed data inside every instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh lol. This is what I get for doing copilot too fast

_timestamp DateTime,
_timestamp_ms Nullable(DateTime64(3)),
_partition UInt64,
_offset UInt64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on whether we finally use the kafka_handle_error_mode='stream' mode, we would need two additional columns here for the _error and _raw_message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That shouldn't be required here since we are just grabbing the payload right? It's not actually going to try to deserialize the JSON

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to be honest, since we are defining the Kafka engine with a JSONEachRow. I don't know if ClickHouse will try to validate that every row is a valid JSON before ingesting it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it is safe because I've used it to debug bad payloads before. But, let's just enable streaming for fun!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok! 💯 In that case then no problem, I just was not completely sure.

posthog/models/kafka_debug/sql.py Outdated Show resolved Hide resolved
* master:
  chore(data-warehouse): make sure exception is passed through at workflow step (#23409)
  feat: add a launch compound for posthog with local billing (#23410)
  chore: add backfill_personless_distinct_ids command (#23404)
  fix: add missing billing tests (#23408)
  Schema-Enforcer plugin not global (#23412)
  chore: Enable person batch exports only on supported destinations (#23354)
  fix: allow entering a custom value while property values load (#23405)
  perf: Materialize elements_chain (#23170)
  fix(experiments): provide `required_scope` for experiments API (#23385)
  feat(survey): Allow events to repeatedly activate surveys (#23238)
  chore: maybe this will stop them flapping (#23401)
  chore(data-warehouse): Added number formatting for source settings (#23221)
  fix(multi project flags): remove flag id from URL when switching projects (#23394)
@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 1 modified, 0 deleted (diff for shard 1)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@fuziontech fuziontech merged commit 3813df9 into master Jul 2, 2024
85 checks passed
@fuziontech fuziontech deleted the debug_events branch July 2, 2024 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants