feat: HogWatcher #23053

benjackwhite · 2024-06-18T15:03:24Z

Problem

We predict (based on existing pipeline work) that there will be cases of rogue functions or teams that clog up the execution pipeline.

To get ahead of this we want to have a way of detecting and marking functions or teams as inefficient so that they are moved to the "slow lane", temporarily disabled and eventually permanently disabled. At first I was going to do the "simple" thing and have a manual button for disabling functions, but where is the fun in that...

Changes

Adds an overflow consumer that can handle both events and callbacks for functions that have been determined to be behaving badly
Adds HogWatcher - a somewhat complex but hopefully useful service. It:
- Observes all of the results from functions and async responses, counting up failures and successes
- Calculates ratings over different time periods
- Based on these ratings, and previous states of the function sets the "state" of the function which can be
  - 1. Healthy - normal execution
  - 1. Overflowed - both async responses and initial invocations are moved to the "overflow" topic
  - 1. Temporarily disabled - the function is disabled for a temporary period after which it is moved back to overflowed
  - 1. Disabled permanently - the function still doesn't seem to manage to stay out of temporary disabled so it gets permanently disabled
Crucially all of this is done in a way that should work across multiple consumers using redis as a state broadcaster as well as for persisting the overall states of all the functions
One worker locks itself as a leader. The tradeoff of having a lock is having a much simpler reasoning on how to compact observations and trigger state changes, with pubsub used to sync everything else up
Adds to the UI for displaying the current state of the function

TODO

Detect "bad" functions and mark them as such so that the incoming events are moved to overflow
Ensure that only the "bad" functions are skipped and put on the overflow list (good ones should continue to be processed in the standard queue
Make this work with distributed workers (redis?)
Load the initial state from redis
Move state transition into the same loop as checking observation (makes it easier to manage)
Change it so that we only have one instance responsible for writing
- One nominated instance (probably using a lock) that is responsible for gathering all the data and deciding when something is blocked or not
- For now I can just nominate a process, later we can make it its own thing

Follow ip

Users can remove the permanently disabled state by modifying the function

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

How did you test this code?

…to feat/proper-templates

# Conflicts: # posthog/api/hog_function.py # posthog/cdp/validation.py

# Conflicts: # posthog/api/hog_function.py # posthog/cdp/templates/__init__.py # posthog/cdp/templates/hog_function_template.py # posthog/cdp/templates/slack/template_slack.py # posthog/cdp/validation.py

# Conflicts: # frontend/__snapshots__/scenes-app-insights--funnel-top-to-bottom-breakdown-edit--dark.png

mariusandra

Dumping thoughts mid-review. I effectively only looked through the frontend and Django parts... so just the plugin server part to go... 😅

plugin-server/src/cdp/hog-watcher/README.md

frontend/src/scenes/pipeline/hogfunctions/PipelineHogFunctionConfiguration.tsx

frontend/src/scenes/pipeline/hogfunctions/HogFunctionStatusIndicator.tsx

plugin-server/tests/cdp/cdp-function-callbacks-consumer.test.ts

frontend/src/scenes/pipeline/hogfunctions/HogFunctionStatusIndicator.tsx

mariusandra

Overall very solid 👍 and I'm coming around to the separate watcher idea. However why not take it one step further and isolate the watcher into its own pod/service, especially if it's going to get more cleanup duties soon? I left a longer inline comment about it.

plugin-server/src/cdp/cdp-consumers.ts

plugin-server/src/cdp/hog-executor.ts

mariusandra · 2024-06-27T09:45:12Z

plugin-server/src/cdp/hog-watcher/hog-watcher.ts

+    private async checkIsLeader() {
+        const leaderId = await runRedis(this.hub.redisPool, 'getLeader', async (client) => {
+            // Set the leader to this instance if it is not set and add an expiry to it of twice our observation period
+            const pipeline = client.pipeline()
+
+            // TODO: This can definitely be done in a single command - just need to make sure the ttl is always extended if the ID is the same
+
+            // @ts-expect-error - IORedis types don't allow for NX and EX in the same command
+            pipeline.set(`${BASE_REDIS_KEY}/leader`, this.instanceId, 'NX', 'EX', (OBSERVATION_PERIOD * 3) / 1000)
+            pipeline.get(`${BASE_REDIS_KEY}/leader`)
+            const [_, res] = await pipeline.exec()
+
+            // NOTE: IORedis types don't allow for NX and GET in the same command so we have to cast it to any
+            return res[1] as string
+        })
+
+        this.isLeader = leaderId === this.instanceId
+
+        if (this.isLeader) {
+            status.info('👀', '[HogWatcher] I am the leader')
+        }
+    }


I wonder... if we should not just have an explicitly deployed leader? This would remove all that coordination noise, and give it breathing room from running hog code as well.

This happened before. The plugin server has a scheduler, which used something called redlock to make sure there's only one pod in the fleet running the scheduling commands. This made sense when there were 2 scheduled tasks running per hour on a self hosted instance, but the Cloud required a different approach, as these scheduled bursts in ingestion nodes were causing problems.

Now we have a lot of schedulers:

Thus, I think we'd make this system more robust if we'd remove the redis lock and make it just a single node service. We would immediately buy some vertical scaling room as well.

I think this is a good idea tbh. I didn't do it as there was pushback on the leader idea in general and I didn't want to go too far down that whole if there might have been an alternative I realised along the way.

Should be easy enough to setup, but again I might do that in follow up as its more of an improvmenet

mariusandra · 2024-06-27T09:48:14Z

plugin-server/src/cdp/hog-watcher/hog-watcher.ts

+            const pipeline = client.pipeline()
+
+            changes.observations.forEach(({ id, observation }) => {
+                // We key the observations by observerId and timestamp with a ttl of the max period we want to keep the data for
+                const subKey = `observation:${id}:${this.instanceId}:${observation.timestamp}`
+                pipeline.hset(`${BASE_REDIS_KEY}/state`, subKey, JSON.stringify(observation))
+            })
+
+            return pipeline.exec()


Where do we set the TTL for these keys... or is that the TODO: Implement this part? Isn't the easy solution to just use normal TTLs if this moves from hset to set?

No the todo is just noise I removed in the follow up PR.

The issue is redis 6 (what we use) doesn't support ttl-ing hash fields, only the whole hash.
We could do this as a separate set but for now it just felt easier to have it all in one hash so we can load the whole thing in one go and clean up after.

In practice it functions the same so lets try it and see if it is cleaning up and then we cna move it out after if it still makes sense

mariusandra

Let's rock'n'roll

posthog-bot · 2024-06-27T12:12:09Z

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

chromium: 0 added, 1 modified, 0 deleted (diff for shard 2)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

posthog-bot · 2024-06-27T12:28:14Z

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

chromium: 0 added, 1 modified, 0 deleted (diff for shard 2)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

posthog-bot · 2024-06-27T14:01:44Z

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

chromium: 0 added, 2 modified, 0 deleted (diff for shard 2)
webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

github-actions bot and others added 30 commits June 13, 2024 11:02

Update UI snapshots for chromium (2)

1deeea6

Fixes?

cd9809c

Merge branch 'feat/feedback-loop' into feat/proper-templates

8981f04

Updated notes

3ec0fa7

Update UI snapshots for chromium (2)

10c41a3

Fix padding

751b1f1

Merge branch 'feat/proper-templates' of github.com:PostHog/posthog in…

aacc13a

…to feat/proper-templates

Fix up

f6a467a

Update query snapshots

940b94c

Update UI snapshots for chromium (2)

378e861

Fix deletions

d619005

Update UI snapshots for chromium (2)

84b7681

Update UI snapshots for chromium (2)

47975de

Update query snapshots

160ecf1

merge

22863be

Merge branch 'feat/proper-templates' into fix/hog-fn-deletions

ccfc296

Update UI snapshots for chromium (2)

d0732dc

Update UI snapshots for chromium (2)

9c4fc5f

Update UI snapshots for chromium (2)

22f238d

Update UI snapshots for chromium (2)

ce7853c

Update UI snapshots for chromium (2)

0f273ba

Update UI snapshots for chromium (2)

78a63f1

Started adding hubspot template

d7db828

Merge branch 'master' into fix/hog-fn-deletions

0a972f5

# Conflicts: # posthog/api/hog_function.py # posthog/cdp/validation.py

Merge branch 'master' into feat/more-templates

2ff49ba

# Conflicts: # posthog/api/hog_function.py # posthog/cdp/templates/__init__.py # posthog/cdp/templates/hog_function_template.py # posthog/cdp/templates/slack/template_slack.py # posthog/cdp/validation.py

Update UI snapshots for chromium (2)

0176a21

Update UI snapshots for chromium (2)

c647495

Update UI snapshots for chromium (2)

1683670

Update UI snapshots for chromium (2)

185623d

Update UI snapshots for chromium (2)

144eab7

benjackwhite and others added 4 commits June 25, 2024 19:53

fix tests

39c921e

Added code for handling state changes

a8b5bd7

Update query snapshots

dc9f182

Merge branch 'master' into feat/cdp-overflow-consumer

4393856

# Conflicts: # frontend/__snapshots__/scenes-app-insights--funnel-top-to-bottom-breakdown-edit--dark.png

benjackwhite requested a review from mariusandra June 26, 2024 07:05

github-actions bot and others added 4 commits June 26, 2024 07:08

Update query snapshots

bb21849

Update query snapshots

f5cdc96

tidying

24bb6ae

Fixes

1b7fa28

mariusandra reviewed Jun 26, 2024

View reviewed changes

benjackwhite requested a review from mariusandra June 27, 2024 08:53

mariusandra reviewed Jun 27, 2024

View reviewed changes

mariusandra approved these changes Jun 27, 2024

View reviewed changes

benjackwhite and others added 4 commits June 27, 2024 13:40

feat(cdp): Re-enable functions (#23255)

13ee3ae

Update query snapshots

d192835

Update query snapshots

1c55e45

Update UI snapshots for chromium (2)

ff13023

github-actions bot added 2 commits June 27, 2024 12:20

Update query snapshots

14d0d6f

Update UI snapshots for chromium (2)

44c3c18

benjackwhite and others added 6 commits June 27, 2024 15:29

feat(cdp): Add a bunch more metrics counting for hog functions (#23281)

d82a913

Merge branch 'master' into feat/cdp-overflow-consumer

be772fd

Fix

912e2d5

Fix up tests

6b3fce6

Update query snapshots

e2a6d16

Update UI snapshots for chromium (2)

8572e6d

Update query snapshots

3a80603

benjackwhite closed this Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: HogWatcher #23053

feat: HogWatcher #23053

benjackwhite commented Jun 18, 2024 •

edited

Loading

mariusandra left a comment

mariusandra left a comment

mariusandra Jun 27, 2024

benjackwhite Jun 27, 2024

mariusandra Jun 27, 2024

benjackwhite Jun 27, 2024

mariusandra left a comment

posthog-bot commented Jun 27, 2024

posthog-bot commented Jun 27, 2024

posthog-bot commented Jun 27, 2024

feat: HogWatcher #23053

feat: HogWatcher #23053

Conversation

benjackwhite commented Jun 18, 2024 • edited Loading

Problem

Changes

TODO

Follow ip

Does this work well for both Cloud and self-hosted?

How did you test this code?

mariusandra left a comment

Choose a reason for hiding this comment

mariusandra left a comment

Choose a reason for hiding this comment

mariusandra Jun 27, 2024

Choose a reason for hiding this comment

benjackwhite Jun 27, 2024

Choose a reason for hiding this comment

mariusandra Jun 27, 2024

Choose a reason for hiding this comment

benjackwhite Jun 27, 2024

Choose a reason for hiding this comment

mariusandra left a comment

Choose a reason for hiding this comment

posthog-bot commented Jun 27, 2024

📸 UI snapshots have been updated

posthog-bot commented Jun 27, 2024

📸 UI snapshots have been updated

posthog-bot commented Jun 27, 2024

📸 UI snapshots have been updated

benjackwhite commented Jun 18, 2024 •

edited

Loading