Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: HogWatcher #23053

Closed
wants to merge 386 commits into from
Closed

feat: HogWatcher #23053

wants to merge 386 commits into from

Conversation

benjackwhite
Copy link
Contributor

@benjackwhite benjackwhite commented Jun 18, 2024

Problem

We predict (based on existing pipeline work) that there will be cases of rogue functions or teams that clog up the execution pipeline.

To get ahead of this we want to have a way of detecting and marking functions or teams as inefficient so that they are moved to the "slow lane", temporarily disabled and eventually permanently disabled. At first I was going to do the "simple" thing and have a manual button for disabling functions, but where is the fun in that...

Changes

  • Adds an overflow consumer that can handle both events and callbacks for functions that have been determined to be behaving badly
  • Adds HogWatcher - a somewhat complex but hopefully useful service. It:
    • Observes all of the results from functions and async responses, counting up failures and successes
    • Calculates ratings over different time periods
    • Based on these ratings, and previous states of the function sets the "state" of the function which can be
        1. Healthy - normal execution
        1. Overflowed - both async responses and initial invocations are moved to the "overflow" topic
        1. Temporarily disabled - the function is disabled for a temporary period after which it is moved back to overflowed
        1. Disabled permanently - the function still doesn't seem to manage to stay out of temporary disabled so it gets permanently disabled
  • Crucially all of this is done in a way that should work across multiple consumers using redis as a state broadcaster as well as for persisting the overall states of all the functions
  • One worker locks itself as a leader. The tradeoff of having a lock is having a much simpler reasoning on how to compact observations and trigger state changes, with pubsub used to sync everything else up
  • Adds to the UI for displaying the current state of the function

TODO

  • Detect "bad" functions and mark them as such so that the incoming events are moved to overflow
  • Ensure that only the "bad" functions are skipped and put on the overflow list (good ones should continue to be processed in the standard queue
  • Make this work with distributed workers (redis?)
  • Load the initial state from redis
  • Move state transition into the same loop as checking observation (makes it easier to manage)
  • Change it so that we only have one instance responsible for writing
    • One nominated instance (probably using a lock) that is responsible for gathering all the data and deciding when something is blocked or not
    • For now I can just nominate a process, later we can make it its own thing

Follow ip

  • Users can remove the permanently disabled state by modifying the function

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

How did you test this code?

github-actions bot and others added 30 commits June 13, 2024 11:02
# Conflicts:
#	posthog/api/hog_function.py
#	posthog/cdp/validation.py
# Conflicts:
#	posthog/api/hog_function.py
#	posthog/cdp/templates/__init__.py
#	posthog/cdp/templates/hog_function_template.py
#	posthog/cdp/templates/slack/template_slack.py
#	posthog/cdp/validation.py
benjackwhite and others added 4 commits June 25, 2024 19:53
# Conflicts:
#	frontend/__snapshots__/scenes-app-insights--funnel-top-to-bottom-breakdown-edit--dark.png
@benjackwhite benjackwhite requested a review from mariusandra June 26, 2024 07:05
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumping thoughts mid-review. I effectively only looked through the frontend and Django parts... so just the plugin server part to go... 😅

@benjackwhite benjackwhite requested a review from mariusandra June 27, 2024 08:53
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall very solid 👍 and I'm coming around to the separate watcher idea. However why not take it one step further and isolate the watcher into its own pod/service, especially if it's going to get more cleanup duties soon? I left a longer inline comment about it.

plugin-server/src/cdp/cdp-consumers.ts Outdated Show resolved Hide resolved
plugin-server/src/cdp/hog-executor.ts Show resolved Hide resolved
Comment on lines +166 to +187
private async checkIsLeader() {
const leaderId = await runRedis(this.hub.redisPool, 'getLeader', async (client) => {
// Set the leader to this instance if it is not set and add an expiry to it of twice our observation period
const pipeline = client.pipeline()

// TODO: This can definitely be done in a single command - just need to make sure the ttl is always extended if the ID is the same

// @ts-expect-error - IORedis types don't allow for NX and EX in the same command
pipeline.set(`${BASE_REDIS_KEY}/leader`, this.instanceId, 'NX', 'EX', (OBSERVATION_PERIOD * 3) / 1000)
pipeline.get(`${BASE_REDIS_KEY}/leader`)
const [_, res] = await pipeline.exec()

// NOTE: IORedis types don't allow for NX and GET in the same command so we have to cast it to any
return res[1] as string
})

this.isLeader = leaderId === this.instanceId

if (this.isLeader) {
status.info('👀', '[HogWatcher] I am the leader')
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder... if we should not just have an explicitly deployed leader? This would remove all that coordination noise, and give it breathing room from running hog code as well.

This happened before. The plugin server has a scheduler, which used something called redlock to make sure there's only one pod in the fleet running the scheduling commands. This made sense when there were 2 scheduled tasks running per hour on a self hosted instance, but the Cloud required a different approach, as these scheduled bursts in ingestion nodes were causing problems.

Now we have a lot of schedulers:

image

Thus, I think we'd make this system more robust if we'd remove the redis lock and make it just a single node service. We would immediately buy some vertical scaling room as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea tbh. I didn't do it as there was pushback on the leader idea in general and I didn't want to go too far down that whole if there might have been an alternative I realised along the way.

Should be easy enough to setup, but again I might do that in follow up as its more of an improvmenet

Comment on lines 225 to 233
const pipeline = client.pipeline()

changes.observations.forEach(({ id, observation }) => {
// We key the observations by observerId and timestamp with a ttl of the max period we want to keep the data for
const subKey = `observation:${id}:${this.instanceId}:${observation.timestamp}`
pipeline.hset(`${BASE_REDIS_KEY}/state`, subKey, JSON.stringify(observation))
})

return pipeline.exec()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we set the TTL for these keys... or is that the TODO: Implement this part? Isn't the easy solution to just use normal TTLs if this moves from hset to set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the todo is just noise I removed in the follow up PR.

The issue is redis 6 (what we use) doesn't support ttl-ing hash fields, only the whole hash.
We could do this as a separate set but for now it just felt easier to have it all in one hash so we can load the whole thing in one go and clean up after.

In practice it functions the same so lets try it and see if it is cleaning up and then we cna move it out after if it still makes sense

Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rock'n'roll

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 1 modified, 0 deleted (diff for shard 2)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

1 snapshot changes in total. 0 added, 1 modified, 0 deleted:

  • chromium: 0 added, 1 modified, 0 deleted (diff for shard 2)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

@posthog-bot
Copy link
Contributor

📸 UI snapshots have been updated

2 snapshot changes in total. 0 added, 2 modified, 0 deleted:

  • chromium: 0 added, 2 modified, 0 deleted (diff for shard 2)
  • webkit: 0 added, 0 modified, 0 deleted

Triggered by this commit.

👉 Review this PR's diff of snapshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants