fix(persons): limit to maximum of 2500 distinct_ids for cross-db join #18414

thmsobrmlr · 2023-11-06T13:01:36Z

Problem

Changes

This adds a max limit for distinct_ids when they are used in a cross-db subquery. ClickHouse queries are limited to 256kb size, so we could fit around 5k distinct_ids (assuming 46 bytes average size). This PR limits the distinct_ids to 2500.

How did you test this code?

CI run & Django shell

qs = (
    PersonDistinctId.objects.filter(person=person, team_id=team.pk)
        .order_by("id")
        .values_list("distinct_id")[:MAX_LIMIT_DISTINCT_IDS]
)
qs.query.__str__()

SELECT "posthog_persondistinctid"."distinct_id"
FROM "posthog_persondistinctid"
WHERE (
        "posthog_persondistinctid"."person_id" = 1
        AND "posthog_persondistinctid"."team_id" = 1
    )
ORDER BY "posthog_persondistinctid"."id" ASC
LIMIT 2500

posthog/hogql_queries/events_query_runner.py

pauldambra · 2023-11-06T17:08:08Z

posthog/models/person/person.py

@@ -34,7 +36,7 @@ def distinct_ids(self) -> List[str]:
            id[0]
            for id in PersonDistinctId.objects.filter(person=self, team_id=self.team_id)
            .order_by("id")
-            .values_list("distinct_id")
+            .values_list("distinct_id")[:MAX_LIMIT_DISTINCT_IDS]


So... order by is ascending by default and id is numeric...

So this is the oldest 2500 IDs

So, what happens if I am interested in ID 2501.

E.g. "load me all events for person X" now becomes "load me all events for the first 2500 ids attached to person X"

I can't think of a good solution for this without person on events in place. And since distinct id to person mapping isn't in ClickHouse (I assume) you can't easily join

But this affects everyone querying the Person model even if they want all of the distinct ids.

(it is in clickhouse too, person_distinct_id2 table) - that's how funnels aggregate people.

Yeah, don't like this solution too much, querying for 10,000+ distinct ids isn't really the issue as far as I can tell, it's when you pass it in somewhere that bad things happen (like in another query, or returning via API).

This also leads to an issue of not being able to trust person.distinct_ids to return the right things.

Sooo.. not sure what's best here, but if this is just for events view for a given person, I'd scope the change to the event_query_runner.. And if this query runs only on a given person's page, maybe the join is ok to do, since this doesn't affect the default event query, which has no personID filter? Wouldn't be as fast definitely, but correctness seems more important here, since events can be 'orphaned' otherwise? Or, programatically switch between the optimal & non-optimal version depending on no. of distinct ids. Less than 2000, go for the fast version, since all ids will be included. More, do the join, best of both worlds, is fast for all regular users, slow only for users doing questionable things, but still correct :P

How many users like this exist? Maybe programattic switch is overkill and just choosing the first 2500 is good enough

Thanks for the thoughtful responses! It's super helpful that you're keeping an eye out & adding context. The comments totally make sense and I adjusted the PR to apply the change only to the events list queries. I'm thinking it's okay to make the change here since we'd get a 500 for many distinct_ids otherwise. Let me know if you think otherwise.

So... order by is ascending by default and id is numeric...

So this is the oldest 2500 IDs

I initially thought this is what we want here, since many distinct_ids are usually caused by instrumentation issues and the distinct_ids belonging to a person should be among the first ones (later ones not belonging to the person or being completely random). As the query is also used for the person events page, we want last ids for events as well. Therefore I'm unioning a mix of both now.

(it is in clickhouse too, person_distinct_id2 table) - that's how funnels aggregate people.

I've been told we don't use CH (join) here, since it's much slower in this case. A programatic switch sounds like a good idea and I heard there are other ideas for mitigating the problem in the future. However I don't want to spend too much time on this right now and just get an intermediate fix in, so that we don't 500 for the persons query.

I also heard we want to get rid of the table in CH and am wondering wether we could use the CH PostgreSQL table engine to make join queries instead. A quick POC seems promising.

How many users like this exist? Maybe programmatic switch is overkill and just choosing the first 2500 is good enough

So, I did dig a bit into the data and it seems ~1,5% of teams have persons with more than 2.500 distinct_ids and ~0.001% of persons are affected.

For a sampling of data the curve is flattening quickly:

500 1000 2000 2500 3000 3500 4000 4500

15835 8726 4143 3123 2403 1934 1538 1252

However I don't want to spend too much time on this right now and just get an intermediate fix in, so that we don't 500 for the persons query.

🤘

posthog/models/person/person.py

pauldambra

I'd worry we're swapping one subtle bug for another... but also seems like you all are aware this could be better.

Totally onboard with getting over the initial bug and we've got context captured in the comment and this PR 🥇

neilkakkar reviewed Nov 6, 2023

View reviewed changes

posthog/hogql_queries/events_query_runner.py Outdated Show resolved Hide resolved

fix(persons): limit to maximum of 2500 distinct_ids for cross-db join

d89468e

thmsobrmlr force-pushed the max-limit-distinct-ids branch from 561f090 to d89468e Compare November 6, 2023 14:48

Update query snapshots

0eec9e9

thmsobrmlr marked this pull request as ready for review November 6, 2023 15:48

thmsobrmlr requested review from pauldambra and mariusandra November 6, 2023 15:48

pauldambra reviewed Nov 6, 2023

View reviewed changes

scope max distinct ids change to cross-db queries

deee8bc

thmsobrmlr force-pushed the max-limit-distinct-ids branch from 6343577 to deee8bc Compare November 7, 2023 23:46

thmsobrmlr and others added 7 commits November 7, 2023 23:46

Merge branch 'master' into max-limit-distinct-ids

4972299

whitespace

0c55adc

use first and last distinct_ids

777336f

Update query snapshots

9dfd662

Update query snapshots

8d9956c

types

158aa64

Update query snapshots

79e2981

thmsobrmlr requested review from pauldambra and neilkakkar November 8, 2023 11:54

neilkakkar reviewed Nov 8, 2023

View reviewed changes

posthog/models/person/person.py Show resolved Hide resolved

neilkakkar approved these changes Nov 8, 2023

View reviewed changes

pauldambra approved these changes Nov 8, 2023

View reviewed changes

thmsobrmlr merged commit 196f1b0 into master Nov 8, 2023
67 checks passed

thmsobrmlr deleted the max-limit-distinct-ids branch November 8, 2023 13:55

thmsobrmlr mentioned this pull request Jul 12, 2024

Persons API should not time out for huge amounts of distinct_ids #23673

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(persons): limit to maximum of 2500 distinct_ids for cross-db join #18414

fix(persons): limit to maximum of 2500 distinct_ids for cross-db join #18414

thmsobrmlr commented Nov 6, 2023 •

edited

Loading

pauldambra Nov 6, 2023

neilkakkar Nov 6, 2023

thmsobrmlr Nov 8, 2023

neilkakkar Nov 8, 2023

pauldambra Nov 8, 2023

pauldambra left a comment

fix(persons): limit to maximum of 2500 distinct_ids for cross-db join #18414

fix(persons): limit to maximum of 2500 distinct_ids for cross-db join #18414

Conversation

thmsobrmlr commented Nov 6, 2023 • edited Loading

Problem

Changes

How did you test this code?

pauldambra Nov 6, 2023

Choose a reason for hiding this comment

neilkakkar Nov 6, 2023

Choose a reason for hiding this comment

thmsobrmlr Nov 8, 2023

Choose a reason for hiding this comment

neilkakkar Nov 8, 2023

Choose a reason for hiding this comment

pauldambra Nov 8, 2023

Choose a reason for hiding this comment

pauldambra left a comment

Choose a reason for hiding this comment

thmsobrmlr commented Nov 6, 2023 •

edited

Loading