Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding Persons On Events (PoE) #173

Closed
mariusandra opened this issue Jan 11, 2024 · 25 comments
Closed

Understanding Persons On Events (PoE) #173

mariusandra opened this issue Jan 11, 2024 · 25 comments
Assignees

Comments

@mariusandra
Copy link
Contributor

mariusandra commented Jan 11, 2024

This should move to docs, putting my notes here now for quick access.

PostHog has two operating modes when you use person properties in your queries, such as when asking things like "filter by users whose email ends with @gmail.com".

  1. Mode "PoE disabled": Person and event data are kept in separate tables, and JOIN-ed when queried. This is slow, as we need to read and compare a lot of data. We always use the latest properties of a person when querying in this mode.

  2. Mode "PoE enabled": A cached snapshot of the person's properties is stored on the event. When querying, we read the data on the event without making a costly JOIN. The query matches the person's properties at the time of the event, not as they are now.

You can toggle between these modes under project settings:

image

Turning "PoE on" yields anywhere between 3x-10x improvements in query time, with larger datasets seeing the biggest wins.

However, you might need to update your code to be comaptible with "PoE".

How to send events with PoE on.

Problems arise if you have two types of users: anonymous (logged out) and signed in. You must make sure that the first event made by the signed in user contains a reference to the anonymous user.

If you're only sending events from the frontend, everything is handled for you, provided you call posthog.identify() as soon as you have the new ID of the user.

case = 'AD_web'

// posthog-js
posthog.capture(`${case}_anon`, '$pageview', {"lib": "web"})
posthog.capture(`${case}_anon`, 'other event', {"lib": "web"})
posthog.capture(`${case}_anon`, 'signup page', {"lib": "web"})

// frontend signup happens here, we get the new ID

// posthog-js
posthog.identify(`${case}_id`, {"lib": "web"})
posthog.capture(`${case}_id`, 'frontend signup', {"lib": "web"})
posthog.capture(`${case}_id`, '$pageview', {"lib": "web"})
image

If your flow demands a backend signup event, the flow above will fail

case = 'AD_not'

// posthog-js
posthog.capture(`${case}_anon`, '$pageview', {"lib": "web"})
posthog.capture(`${case}_anon`, 'other event', {"lib": "web"})
posthog.capture(`${case}_anon`, 'signup page', {"lib": "web"})

# in the python backend library
posthog.capture(f'{case}_id', 'backend signup', {"lib": "backend"})

// posthog-js
posthog.identify(`${case}_id`, {"lib": "web"})
posthog.capture(`${case}_id`, '$pageview', {"lib": "web"})
Screenshot 2024-01-11 at 13 49 31

The first event sent by {case}_id did not contain the anonymous user's ID, so we could not link the users. By the time we got the ID with the frontend identify event, the users were already created and could not be linked.

To get around this, pass the user's anonymous ID to your backend, and send a backend $identify event.

case = 'AD_ok'

// posthog-js
posthog.capture(`${case}_anon`, '$pageview', {"lib": "web"})
posthog.capture(`${case}_anon`, 'other event', {"lib": "web"})
posthog.capture(`${case}_anon`, 'signup page', {"lib": "web"})

# in the python backend library
posthog.capture(f'{case}_id', '$identify', {"$anon_distinct_id": f"{case}_anon", "lib": "backend"})
posthog.capture(f'{case}_id', 'backend signup', {"lib": "backend"})

// posthog-js
posthog.identify(`${case}_id`, {"lib": "web"})
posthog.capture(`${case}_id`, '$pageview', {"lib": "web"})
image

To get the anonymous user on the frontend, call

const anonDistinctId = posthog.get_distinct_id()

Then send this value to your backend, and submit an $identify even with it as the $anon_distinct_id property.

Note about querying PoE special fields

The following table might be helpful when debugging your events. These are all fields you can select on the events table:

Data Project setting From event Via join
Distinct ID distinct_id distinct_id distinct_id
Person ID person_id poe.id pdi.person_id
Person properties person.properties.foo poe.properties.foo pdi.person.properties.foo

What does not work with PoE enabled?

Currently the only thing that really doesn't work is tracking the anonymous part of returning signed up visitors.

User's visit 1: 5 anonymous pages + signup + signed in pages
User's visit 2: 2 anonymous pages + login page + signed in pages

In this case, the 2 anonymous pages from visit 2 wouldn't be associated with the user. They'd remain "stuck" on the anonymous user.

Note about the future of PoE

We're working hard on removing the required workaround with passing the person's details to your backend, and also adding the ability to track the anonymous part of each recurring visit. Stay tuned!

@tiina303
Copy link
Contributor

This is great. Thanks for writing it up ❤️ Just a nit:

The first event sent by {case}_id did not contain the anonymous user's ID, so we could not link the users. By the time we got the ID with the frontend identify event, the users were already created and could not be linked.

the "could not be linked" could be a bit confusing ... not exactly sure what wording is best, but we can maybe just say from PoE perspective there are now two different users and e.g. funnels wouldn't combine them.

posthog.capture(f'{case}_id', '$identify', {"$anon_distinct_id": f"{case}_anon", "lib": "backend"})
Then send this value to your backend, and submit an $identify even with it as the $anon_distinct_id property.

Optional: in the backend we suggest folks use $create_alias events instead. Important here is that you want the id to be the new backend id (either alias or identify usage), so future events in the same session would go to the same kafka bucket and hence couldn't be processed before the alias event. sth like this: posthog.capture(f'{case}_id', '$create_alias', {"alias": f"{case}_anon", "lib": "backend"})

@asteinlein
Copy link

Will this Just Work™ when using the Segment integration? We're using their JS SDK in the frontend, and their Python lib in the backend (where we send user ID with every event track call).

@mariusandra
Copy link
Contributor Author

@asteinlein I really don't know, depends on what you're sending over and if it matches what's written above or not 🤷

@corywatilo
Copy link
Contributor

Who'd like to write this up? This would be a great addition to the docs. =]

cc @PostHog/marketing (did I tag this right?)

@ivanagas
Copy link
Contributor

I can do it 😄

@ivanagas ivanagas self-assigned this Feb 12, 2024
@joshforbes
Copy link

Will this Just Work™ when using the Segment integration? We're using their JS SDK in the frontend, and their Python lib in the backend (where we send user ID with every event track call).

If your flow is similar to mine, I don't think it will. Our flow is:

  • run segment js and posthog js on our marketing site
  • anon track all events before sign up
  • on sign up, submit a form to a backend running segment on the server
  • backend creates a user plus organization objects and calls segment.identify with the database user id
  • backend returns the database user id to the marketing site in the form response payload
  • marketing site uses segment js to call identify with the backend id

If I understand correctly, this is precisely the flow that will break PoE. To fix it, we would have to send the anon ID to the backend as part of the signup form and then use that in the segment server identify call. Though tbh... I have no idea how I would include "$anon_distinct_id" in the segment identify call in a way that posthog would use. 🤷‍♂️

@asteinlein
Copy link

Will this Just Work™ when using the Segment integration? We're using their JS SDK in the frontend, and their Python lib in the backend (where we send user ID with every event track call).

If your flow is similar to mine, I don't think it will. Our flow is:

  • run segment js and posthog js on our marketing site
  • anon track all events before sign up
  • on sign up, submit a form to a backend running segment on the server
  • backend creates a user plus organization objects and calls segment.identify with the database user id
  • backend returns the database user id to the marketing site in the form response payload
  • marketing site uses segment js to call identify with the backend id

If I understand correctly, this is precisely the flow that will break PoE.

Indeed, that is exactly our use-case as well. And I would think that is a pretty common flow for users of PostHog + Segment?

To fix it, we would have to send the anon ID to the backend as part of the signup form and then use that in the segment server identify call. Though tbh... I have no idea how I would include "$anon_distinct_id" in the segment identify call in a way that posthog would use. 🤷‍♂️

I haven't been following along here in detail to be honest, but from afar it sounds strange why this couldn't work. When having a cookie/anon ID, and then subsequently identify it with a person-identified user ID, couldn't this be made to work? What makes this so special for PostHog compared to how Segment associates events with identified users in general?

@joshforbes
Copy link

Indeed, that is exactly our use-case as well. And I would think that is a pretty common flow for users of PostHog + Segment?

Yeah agreed. I'm fairly certain that this is the flow that Segment recommends.

This is just conjecture but my reading of the final part of the post makes me think that they aren't going to force the switch to PoE until they have this fixed:

Note about the future of PoE
We're working hard on removing the required workaround with passing the person's details to your backend, and also adding the ability to track the anonymous part of each recurring visit. Stay tuned!

@tiina303
Copy link
Contributor

Just for FYI we're actively working on improving the way this works PostHog/posthog#20460 which should ship by the end of Q1

The primary goal of this issue is that PoE query mode (in terms of unique users) will return exactly the same results as joins with the person & distinct_id tables

@jclusso
Copy link

jclusso commented Mar 21, 2024

@tiina303 is this stil on track for end of Q1? Also, will past fired events be fixed?

@MarconLP
Copy link
Member

@tiina303 is this stil on track for end of Q1? Also, will past fired events be fixed?

The release has been delayed. The current plan is to ship this change in the next couple of weeks.

@jclusso
Copy link

jclusso commented Mar 27, 2024

@tiina303 is this stil on track for end of Q1? Also, will past fired events be fixed?

The release has been delayed. The current plan is to ship this change in the next couple of weeks.

What about how past fired events? Will events fired in past be queryable by the person attributes like location that are set on initial identify?

@tiina303
Copy link
Contributor

tiina303 commented Apr 1, 2024

What about how past fired events? Will events fired in past be queryable by the person attributes like location that are set on initial identify?

Yes, we have been writing person properties to events for a while and backfilled the time before.
Just to clarify also this is for properties at the time of the event.

@jclusso
Copy link

jclusso commented Apr 1, 2024

Just to clarify also this is for properties at the time of the event.

So if an anonymous user enters your site and then they get identified, you'll be able to filter the identified events by that data from the anonymous user. (ex: initial country)

@tiina303
Copy link
Contributor

tiina303 commented Apr 1, 2024

So if an anonymous user enters your site and then they get identified, you'll be able to filter the identified events by that data from the anonymous user. (ex: initial country)

Yes, assuming you have geoIP enabled (and using events with person processing - the default and only option until now), then we'd write the associated location data to the event. The note is more about the fact that if the user did that session in Germany and a later session in Austria, then the filtering would use Germany (i.e. at the time of the event), not Austria (i.e. current location value on the person object).

@jclusso
Copy link

jclusso commented Apr 21, 2024

@tiina303 is this stil on track for end of Q1? Also, will past fired events be fixed?

The release has been delayed. The current plan is to ship this change in the next couple of weeks.

Is there any update on when this should be expected?

@amacneil
Copy link

The beta project setting has disappeared, does that mean POE is enabled by default now?

@jclusso
Copy link

jclusso commented Apr 22, 2024

The beta project setting has disappeared, does that mean POE is enabled by default now?

I hope not, because if so, it's not working.

@tkaemming
Copy link

  1. For Cloud, the setting was removed from the settings panel for now as we determine how to move forward with the rollout. Teams that were previously opted in via the user-controlled setting should have remained in the same state that they were prior to that change. If that didn't happen (or something else looks incorrect), reach out to us via support and we'll get things sorted out.
  2. For self-hosted, … yeah, this got broken accidentally. We'll fix that.

@mariusandra
Copy link
Contributor Author

Update May 2024

We have now enabled the following section under "Project Settings" -> "Product Analytics"

image

This lets you choose whether you want person properties to be ingestion-time (faster) or current (slower), and whether you care about merged users (anon -> identified) being distinct or not.

@jclusso
Copy link

jclusso commented May 6, 2024

This lets you choose whether you want person properties to be ingestion-time (faster) or current (slower), and whether you care about merged users (anon -> identified) being distinct or not.

Any future plans to allow users to choose this option at the Insight level?

@mariusandra
Copy link
Contributor Author

In some way you already can. Click "..." and "view source" from the top, then click the little "debug" link. The page that opens lets you specify this PoE setting on the insight level. Notice how changing it also changes the query.

Now you can just copy that back into the "view source" view and have the setting be applied per insight.

There's one caveat: we have a bug that prevents the view source dialog from saving. Once this is fixed, you should be able to set this per insight this way.

Whether we want to expose this in the UI or not is a different question 🤔

@andyvan-ph
Copy link

@ivanagas feel free to re-open, but I'm assuming this is stale for now.

@andyvan-ph andyvan-ph closed this as not planned Won't fix, can't repro, duplicate, stale Jun 28, 2024
@NorfeldtKnowit
Copy link

NorfeldtKnowit commented Jul 22, 2024

Update May 2024

We have now enabled the following section under "Project Settings" -> "Product Analytics"

image This lets you choose whether you want person properties to be ingestion-time (faster) or current (slower), and whether you care about merged users (anon -> identified) being distinct or not.

@mariusandra That does not appear in our project settings?

@MarconLP
Copy link
Member

Hey @NorfeldtKnowit, that option is unavailable for organizations created since June 2024. You can still access the PoE special fields from the post above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests