-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(alerts): add a regular job to detect anomalies #22762
Changes from 13 commits
eb4cb49
ba51409
bb71f5a
1d57843
15e0b6b
71c44d9
5e59a5d
29e613b
b1a0219
0b2e57a
29d9305
0cd1a01
b8b4fec
3475c37
16227b9
712d780
f179e37
5cf4e46
d940441
d91bcc9
ed27736
a1966b4
4e97160
028d155
1920482
fa7dc4c
e5b44ca
b8d4e60
e3db36a
4183314
66f0c4a
d274bea
0e8d28a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -274,7 +274,7 @@ export const insightNavLogic = kea<insightNavLogicType>([ | |
}, | ||
})), | ||
urlToAction(({ actions }) => ({ | ||
'/insights/:shortId(/:mode)(/:subscriptionId)': ( | ||
'/insights/:shortId(/:mode)(/:itemId)': ( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As per #22554 (comment) |
||
_, // url params | ||
{ dashboard, ...searchParams }, // search params | ||
{ filters: _filters } // hash params | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4273,6 +4273,7 @@ export type HogFunctionStatus = { | |
}[] | ||
} | ||
|
||
// TODO: move to schema.ts | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was planning to do it in a follow up PR, but yeah, I can fix it here, done |
||
export interface AnomalyCondition { | ||
absoluteThreshold: { | ||
lower?: number | ||
|
nikitaevg marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
import structlog | ||
from celery import shared_task | ||
from django.utils import timezone | ||
|
||
from posthog.api.services.query import ExecutionMode | ||
from posthog.caching.calculate_results import calculate_for_query_based_insight | ||
from posthog.email import EmailMessage | ||
from posthog.hogql_queries.legacy_compatibility.flagged_conversion_manager import ( | ||
conversion_to_query_based, | ||
) | ||
from posthog.models import Alert | ||
from posthog.schema import AnomalyCondition | ||
|
||
logger = structlog.get_logger(__name__) | ||
|
||
|
||
def check_all_alerts() -> None: | ||
alerts = Alert.objects.all().only("id") | ||
for alert in alerts: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know for sure but this also feels like a scaling nightmare... We struggle sometimes to keep up with dashboard / insight refreshes in general and this is another form of refresh, just with a higher demand on reliability. I think this would require strong co-ordination with @PostHog/team-product-analytics to make sure this fits in with their existing plans for improving background refreshing otherwise this will hit scaling issues fast. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know the internals of Posthog, but in my experience this is the way to do this. I don't have experience with celery, but I have experience with similar tools, it should scale horizontally pretty easily - add a separate queue for these events, increase the number of parallel tasks in flight and add more servers if needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Just wanted to chime in here. I can take a look at this, but am currently busy with being on support for this sprint. I'll see what we can do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Scaling celery is not the issue, but ClickHouse will struggle and ultimately go down if suddenly 1000 simultaneous queries appear. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yep, was going to add "Should" is doing a lot of work in this sentence 😅 @webjunkie I'm too far removed from how query code and caching interacts here we already have one set of jobs that (is|should be) staying on top of having insight results readily available. does this use that cache? we should really overlap them so we have one set of tasks keeping a cache warm and then another that reads the fast access data in that cache for anomaly detection humans aren't visiting insights once a minute so we know this will generate sustained load. we should totally, totally build this feature - it's long overdue i'm not opposed to getting a simple version in just for our team or select beta testers so we can validate the flow, but this 100% needs an internal sponsor since the work of rolling this out and scaling it simply can't be given to an external contributor (it wouldn't be fair or possible 🙈) i would love to be the internal sponsor but it's both not possible and completely outside of my current wheelhouse (these concerns might be addressed elsewhere - i've not dug in here at all 🙈) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Can't I limit the number of celery queries in flight? I understand this will introduce a problem of throughput, but then if the servers can't process N alerts each hour, maybe more read replicas or more servers are needed. I don't have much experience with column oriented databases though, so it's just a speculation.
🤷 , well the query_runner has some "cache" substrings in it's code, so one could assume... But I don't know
Just to clarify, it's once an hour
I completely agree and I would be really happy to have a mentor on this task. BTW, an interesting data point - Mixpanel limits their number of alerts to 50 per project. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We will talk among @PostHog/team-product-analytics next week and discuss this regarding ownership and so on. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
👍 (same point but thanks for clarification :)) |
||
logger.info("scheduling alert", alert_id=alert.id) | ||
_check_alert_task.delay(alert.id) | ||
|
||
|
||
@shared_task(ignore_result=True) | ||
def _check_alert_task(id: int) -> None: | ||
_check_alert(id) | ||
|
||
|
||
def _check_alert(id: int) -> None: | ||
alert = Alert.objects.get(pk=id) | ||
nikitaevg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
insight = alert.insight | ||
with conversion_to_query_based(insight): | ||
calculation_result = calculate_for_query_based_insight( | ||
insight, | ||
execution_mode=ExecutionMode.RECENT_CACHE_CALCULATE_BLOCKING_IF_STALE, | ||
user=None, | ||
) | ||
|
||
if not calculation_result.result: | ||
raise RuntimeError(f"no results for alert {alert.id}") | ||
|
||
anomaly_condition = AnomalyCondition.model_validate(alert.anomaly_condition) | ||
thresholds = anomaly_condition.absoluteThreshold | ||
|
||
result = calculation_result.result[0] | ||
aggregated_value = result["aggregated_value"] | ||
anomalies_descriptions = [] | ||
|
||
if thresholds.lower is not None and aggregated_value < thresholds.lower: | ||
anomalies_descriptions += [ | ||
f"The trend value ({aggregated_value}) is below the lower threshold ({thresholds.lower})" | ||
] | ||
if thresholds.upper is not None and aggregated_value > thresholds.upper: | ||
anomalies_descriptions += [ | ||
f"The trend value ({aggregated_value}) is above the upper threshold ({thresholds.upper})" | ||
] | ||
|
||
if not anomalies_descriptions: | ||
logger.info("no anomalies", alert_id=alert.id) | ||
return | ||
|
||
_send_notifications(alert, anomalies_descriptions) | ||
|
||
|
||
# TODO: make it a task | ||
def _send_notifications(alert: Alert, anomalies_descriptions: list[str]) -> None: | ||
subject = f"PostHog alert {alert.name} has anomalies" | ||
nikitaevg marked this conversation as resolved.
Show resolved
Hide resolved
|
||
campaign_key = f"alert-anomaly-notification-{alert.id}-{timezone.now().timestamp()}" | ||
insight_url = f"/project/{alert.team.pk}/insights/{alert.insight.short_id}" | ||
alert_url = f"{insight_url}/alerts/{alert.id}" | ||
message = EmailMessage( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is definitely not what we should do for a bunch of reasons:
We are building a new generic delivery system for the CDP (webhooks etc.) which would be the right place to have a destination and I think this could play into that. I don't want to pour water on the fire that is getting this work done as its super cool 😅 but I know that immediately we will have configuration and scaling issues here that I'm not sure we want to support. I'm wondering if instead we could have an in-app only alert for now which we can then later hook up to the delivery service instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, I'd argue here.
It's in my plans - to allow changing the frequency of the notifications. You can check the TODO list here.
Ok, I misinterpreted this in the first place, you suggest email only is not a typical way. Can't agree or disagree here, I don't know.
Don't quite understand, wdym here? A screen of ongoing alerts? I'd argue that the notifications is the most important part of the alerts module, and honestly I really wouldn't want to be blocked on the CDP development, especially given how cheap sending emails is. Once CDP is launched, I don't think it'd be difficult to migrate, right? I'll do it myself when needed. OTOH, if it's planned to launch soon (this month), I could wait.
No worries at all, thanks for looking at this! |
||
campaign_key=campaign_key, | ||
subject=subject, | ||
template_name="alert_anomaly", | ||
template_context={ | ||
"anomalies_descriptions": anomalies_descriptions, | ||
"insight_url": insight_url, | ||
"insight_name": alert.insight.name, | ||
"alert_url": alert_url, | ||
"alert_name": alert.name, | ||
}, | ||
) | ||
targets = list(filter(len, alert.target_value.split(","))) | ||
if not targets: | ||
raise RuntimeError(f"no targets configured for the alert {alert.id}") | ||
for target in targets: | ||
message.add_recipient(email=target) | ||
|
||
logger.info(f"Send notifications about {len(anomalies_descriptions)} anomalies", alert_id=alert.id) | ||
message.send() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a redundant field