-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(alerts): add a regular job to detect anomalies #22762
Conversation
Hi @mariusandra, PTAL! |
@@ -288,7 +288,7 @@ export const insightNavLogic = kea<insightNavLogicType>([ | |||
}, | |||
})), | |||
urlToAction(({ actions }) => ({ | |||
'/insights/:shortId(/:mode)(/:subscriptionId)': ( | |||
'/insights/:shortId(/:mode)(/:itemId)': ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per #22554 (comment)
if not insight.query: | ||
insight.query = filter_to_query(insight.filters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a bit dirty, I wonder if there's a better way to do what I want to do here. I just want to get the aggregated_value for an insight.
IIUC there are two ways to represent an insight: one through filters (old) and one through query (new). When I create an insight locally, the old way is used. But I think it's better to use the new approach so I convert the filters to a query. This all is mainly based on compare_hogql_insights.py file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is correct 👍 . Currently we still have several insights floating around that only have filters (and no query), but the plan is to migrate everything over eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic seems reasonable to me, and since it's behind a flag I think we can merge it as is.
However I do have some concerns about the longer term plan and would like to get a second opinion from @PostHog/team-product-analytics and also @benjackwhite (hog question below) and @pauldambra (loves dashboard reload cron jobs)
- Currently this will run once per hour at
x:20
and schedule a query to run for each alert immediately. Assuming we have 1000 alerts set up, this is a 1000 simultaneous queries every hour at the same time. We will need to stagger them somehow. For example cohort and dashboard reloads run more frequently, but then only runn
oldest items every run leading to eventual "good enough" consistency. - The problem with dashboard and cohort calculations is nobody checks in. We periodically discover things have gotten worse when users complain. This will be worse if users will start to rely on alerts for their business. We'd need to establish some practices around this, hence all the @ tagging above.
- Finally, we're hard at work on Hog and our CDP. It would be really cool to hook alerts into this system. @benjackwhite any thoughts on how to build the bridge?
<LemonField name="upper" label="Upper threshold"> | ||
<LemonField | ||
name="upper" | ||
label="Upper threshold " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
label="Upper threshold " | |
label="Upper threshold" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks, done
if not insight.query: | ||
insight.query = filter_to_query(insight.filters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is correct 👍 . Currently we still have several insights floating around that only have filters (and no query), but the plan is to migrate everything over eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the work here, but given how important and tricky this will be I'd like to consider a more minimal solution with input from @PostHog/team-product-analytics to make sure this is something we can actually scale.
campaign_key = f"alert-anomaly-notification-{alert.id}-{timezone.now().timestamp()}" | ||
insight_url = f"/project/{alert.team.pk}/insights/{alert.insight.short_id}" | ||
alert_url = f"{insight_url}/alerts/{alert.id}" | ||
message = EmailMessage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is definitely not what we should do for a bunch of reasons:
- No way to configure rate of delivery, backoffs, etc.
- Email only is not the typical way people want to get alerted of this
We are building a new generic delivery system for the CDP (webhooks etc.) which would be the right place to have a destination and I think this could play into that.
I don't want to pour water on the fire that is getting this work done as its super cool 😅 but I know that immediately we will have configuration and scaling issues here that I'm not sure we want to support.
I'm wondering if instead we could have an in-app only alert for now which we can then later hook up to the delivery service instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'd argue here.
No way to configure rate of delivery, backoffs, etc.
It's in my plans - to allow changing the frequency of the notifications. You can check the TODO list here.
Email only is not the typical way people want to get alerted of this
1. Users want email, slack and webhooks. Why not start with email then.
2. Mixpanel provides emails+slack, Amplitude provides emails and webhooks.
3. In my commercial experience emails were the way to notify about alerts.
IMO emails is a good starting point, it's cheap af, but also it's a necessary communication channel for this.
Ok, I misinterpreted this in the first place, you suggest email only is not a typical way. Can't agree or disagree here, I don't know.
I'm wondering if instead we could have an in-app only alert for now which we can then later hook up to the delivery service instead?
Don't quite understand, wdym here? A screen of ongoing alerts? I'd argue that the notifications is the most important part of the alerts module, and honestly I really wouldn't want to be blocked on the CDP development, especially given how cheap sending emails is. Once CDP is launched, I don't think it'd be difficult to migrate, right? I'll do it myself when needed. OTOH, if it's planned to launch soon (this month), I could wait.
I don't want to pour water on the fire that is getting this work done as its super cool
No worries at all, thanks for looking at this!
|
||
def check_all_alerts() -> None: | ||
alerts = Alert.objects.all().only("id") | ||
for alert in alerts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know for sure but this also feels like a scaling nightmare... We struggle sometimes to keep up with dashboard / insight refreshes in general and this is another form of refresh, just with a higher demand on reliability. I think this would require strong co-ordination with @PostHog/team-product-analytics to make sure this fits in with their existing plans for improving background refreshing otherwise this will hit scaling issues fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the internals of Posthog, but in my experience this is the way to do this. I don't have experience with celery, but I have experience with similar tools, it should scale horizontally pretty easily - add a separate queue for these events, increase the number of parallel tasks in flight and add more servers if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would require strong co-ordination with @PostHog/team-product-analytics to make sure this fits in with their existing plans for improving background refreshing otherwise this will hit scaling issues fast.
Just wanted to chime in here. I can take a look at this, but am currently busy with being on support for this sprint. I'll see what we can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scaling celery is not the issue, but ClickHouse will struggle and ultimately go down if suddenly 1000 simultaneous queries appear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should scale horizontally pretty easily - add a separate queue for these events, increase the number of parallel tasks in flight and add more servers if needed.
yep, was going to add "Should" is doing a lot of work in this sentence 😅
@webjunkie I'm too far removed from how query code and caching interacts here
we already have one set of jobs that (is|should be) staying on top of having insight results readily available. does this use that cache? we should really overlap them so we have one set of tasks keeping a cache warm and then another that reads the fast access data in that cache for anomaly detection
humans aren't visiting insights once a minute so we know this will generate sustained load.
we should totally, totally build this feature - it's long overdue
i'm not opposed to getting a simple version in just for our team or select beta testers so we can validate the flow, but this 100% needs an internal sponsor since the work of rolling this out and scaling it simply can't be given to an external contributor (it wouldn't be fair or possible 🙈)
i would love to be the internal sponsor but it's both not possible and completely outside of my current wheelhouse
(these concerns might be addressed elsewhere - i've not dug in here at all 🙈)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but ClickHouse will struggle and ultimately go down if suddenly 1000 simultaneous queries appear
Can't I limit the number of celery queries in flight? I understand this will introduce a problem of throughput, but then if the servers can't process N alerts each hour, maybe more read replicas or more servers are needed. I don't have much experience with column oriented databases though, so it's just a speculation.
we already have one set of jobs that (is|should be) staying on top of having insight results readily available. does this use that cache?
🤷 , well the query_runner has some "cache" substrings in it's code, so one could assume... But I don't know
humans aren't visiting insights once a minute so we know this will generate sustained load.
Just to clarify, it's once an hour
but this 100% needs an internal sponsor since the work of rolling this out and scaling it simply can't be given to an external contributor (it wouldn't be fair or possible 🙈)
I completely agree and I would be really happy to have a mentor on this task.
BTW, an interesting data point - Mixpanel limits their number of alerts to 50 per project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will talk among @PostHog/team-product-analytics next week and discuss this regarding ownership and so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
humans aren't visiting insights once a minute so we know this will generate sustained load.
Just to clarify, it's once an hour
👍
(same point but thanks for clarification :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking at it!
Assuming we have 1000 alerts set up, this is a 1000 simultaneous queries every hour at the same time.
There's a way to set maximum parallel requests for celery, I think it should help to spread the load of this, no?
This will be worse if users will start to rely on alerts for their business. We'd need to establish some practices around this
I completely agree with that, it's not the final solution, just a skeleton. I'll need some help with this, but we need metrics and alerts about the job execution time to notice problems. I understand people will rely on alerts and it should be reliable alright.
|
||
def check_all_alerts() -> None: | ||
alerts = Alert.objects.all().only("id") | ||
for alert in alerts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the internals of Posthog, but in my experience this is the way to do this. I don't have experience with celery, but I have experience with similar tools, it should scale horizontally pretty easily - add a separate queue for these events, increase the number of parallel tasks in flight and add more servers if needed.
campaign_key = f"alert-anomaly-notification-{alert.id}-{timezone.now().timestamp()}" | ||
insight_url = f"/project/{alert.team.pk}/insights/{alert.insight.short_id}" | ||
alert_url = f"{insight_url}/alerts/{alert.id}" | ||
message = EmailMessage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'd argue here.
No way to configure rate of delivery, backoffs, etc.
It's in my plans - to allow changing the frequency of the notifications. You can check the TODO list here.
Email only is not the typical way people want to get alerted of this
1. Users want email, slack and webhooks. Why not start with email then.
2. Mixpanel provides emails+slack, Amplitude provides emails and webhooks.
3. In my commercial experience emails were the way to notify about alerts.
IMO emails is a good starting point, it's cheap af, but also it's a necessary communication channel for this.
Ok, I misinterpreted this in the first place, you suggest email only is not a typical way. Can't agree or disagree here, I don't know.
I'm wondering if instead we could have an in-app only alert for now which we can then later hook up to the delivery service instead?
Don't quite understand, wdym here? A screen of ongoing alerts? I'd argue that the notifications is the most important part of the alerts module, and honestly I really wouldn't want to be blocked on the CDP development, especially given how cheap sending emails is. Once CDP is launched, I don't think it'd be difficult to migrate, right? I'll do it myself when needed. OTOH, if it's planned to launch soon (this month), I could wait.
I don't want to pour water on the fire that is getting this work done as its super cool
No worries at all, thanks for looking at this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
I think the general direction and scope of the PR is valid and something we can work with. Areas that needs to be improved before this can be merged is the workings of the Celery task and additional models and fields we need to sufficiently guide the execution.
I wrote up an RFC draft with how the Celery and model architecture could work:
PostHog/meta#216
Let me know if this helps or needs discussion. (Either here or in Slack).
<LemonField name="upper" label="Upper threshold"> | ||
<LemonField | ||
name="upper" | ||
label="Upper threshold " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks, done
@@ -24,7 +23,6 @@ | |||
@patch("ee.tasks.subscriptions.generate_assets") | |||
@freeze_time("2022-02-02T08:55:00.000Z") | |||
class TestSubscriptionsTasks(APIBaseTest): | |||
subscriptions: list[Subscription] = None # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a redundant field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine now considering the scope, but needs work in subsequent PRs as discussed.
posthog/tasks/alerts/checks.py
Outdated
# Note, check_alert_task is used in Celery chains. Celery chains pass the previous | ||
# function call result to the next function as an argument, hence args and kwargs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do check_alert_task.si
(for immutable) above, then this doesn't happen/matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it worked, thanks!
@webjunkie Could you please merge this? |
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the |
@webjunkie, gentle reminder, could you please merge this? |
# Conflicts: # frontend/src/scenes/insights/insightSceneLogic.tsx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of hanging TODOs that I believe should be removed and a couple of smaller comments
frontend/src/types.ts
Outdated
@@ -4370,6 +4370,7 @@ export type HogFunctionInvocationGlobals = { | |||
> | |||
} | |||
|
|||
// TODO: move to schema.ts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to do it in a follow up PR, but yeah, I can fix it here, done
@@ -0,0 +1,10 @@ | |||
{% extends "email/base.html" %} {% load posthog_assets %} {% block section %} | |||
<p> | |||
Uh-oh, the <a href="{% absolute_uri alert_url %}">{{ alert_name }}</a> alert detected following anomalies for <a href="{% absolute_uri insight_url %}">{{ insight_name }}</a>: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking but "uh-oh" feels unnecessarily negative. The alert could be a positive thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed 👍
posthog/tasks/alerts/checks.py
Outdated
check_alert(alert_id) | ||
|
||
|
||
# TODO: make it a task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be a task. The .send
function by default will queue a celery task for the actual sending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, thanks, removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjackwhite PTAL!
frontend/src/types.ts
Outdated
@@ -4370,6 +4370,7 @@ export type HogFunctionInvocationGlobals = { | |||
> | |||
} | |||
|
|||
// TODO: move to schema.ts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to do it in a follow up PR, but yeah, I can fix it here, done
posthog/tasks/alerts/checks.py
Outdated
check_alert(alert_id) | ||
|
||
|
||
# TODO: make it a task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, thanks, removed
@@ -0,0 +1,10 @@ | |||
{% extends "email/base.html" %} {% load posthog_assets %} {% block section %} | |||
<p> | |||
Uh-oh, the <a href="{% absolute_uri alert_url %}">{{ alert_name }}</a> alert detected following anomalies for <a href="{% absolute_uri insight_url %}">{{ insight_name }}</a>: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed 👍
@webjunkie a gentle reminder, could you please take a look? |
@webjunkie do you know why all E2E tests might fail with the "Error: missing API token, please run |
@benjackwhite could you please review this? |
Dismissing after Slack discussion (and Ben OOO today)
Problem
#14331
Changes
This PR adds an initial version of the alerts notifications job. In the next PRs I'll introduce
Does this work well for both Cloud and self-hosted?
Probably
How did you test this code?
Automatic + manual testing