feat(data-modeling): Data modelling django models and API #24232

tomasfarias · 2024-08-07T09:12:25Z

Problem

This PR contains the initial django model used to support data modeling as well as a set of initial API methods intended to provide some value while we build on more modeling features.

NOTE: Internally, I have been referring to the feature as data modeling to avoid confusion with django models, but we still want that from a user perspective, they are creating "data models".

Data modeling boils down to:

A query.
- Just a SELECT query, no other commands allowed (CTEs, JOINs are fine).
- This dictates the data and how it is to be modeled.
A path to the modeled data through all of its ancestors (all model paths can be used to form a DAG).

With these two objects, we can later build tools to "run" the modeled data, aka materialize it, while resolving their dependencies using their paths. Moreover, the paths can be used to present the user with a DAG of their modeling.

Changes

This first draft contains the django model used to store model paths, as well as a model manager that deals with their creation. Moreover, I've included some API methods to provide some value to users while we work on the rest of the feature set. These API methods can be used to construct a DAG or display parents and children.

The queries that represent models are backed by the already existing DataWarehouseSavedQuery. Importantly, we had to allow nested views to support modeling.

Questions

Should we transform all existing DataWarehouseSavedQuery into models? Or should they be separate. An earlier commit used a separate DataWarehouseModel to store model queries, so it's possible to revert to that.
- The main problem is that saved queries are just not materialized models, so it could be confusing for users if we have to teach them that there's two different ways to define a query.
- ANSWER: I've opted to go with using existing DataWarehouseSavedQuery, which saves us from duplicating a lot of code. Moving forward, we will create model paths for each new DataWarehouseSavedQuery. However, we do so in a safe fashion for now (without failing DataWarehouseSavedQuery creation.
What format should the DAG be returned as?
- ANSWER: This will likely be dictated by whatever JS library the frontend uses to display the DAG. For the time being, I've coded up a very simple class that holds a set of edges and nodes, and can be serialized to JSON in a straight-forward way. However, this can be changed later depending on what the frontend needs.

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Unit tests for the manager are included.
Unit tests for API.

tomasfarias · 2024-08-13T10:20:55Z

posthog/hogql/context.py

+    # How many nested views do we support on this query? If `None`, no limit.
+    max_view_depth: int | None = 1


Allow nested views.

Did you mean to make this None/remove this and the check below in the resolver?

Do we want to remove this? If we're opening ourselves up to many view depths, then why have a limit at all?

I'd be fine with removing it. Just wanted to keep the check in case we would want to enforce a view depth somewhere else, but for modeling purposes I think there is no limit (or a very high one for sanity reasons).

tomasfarias · 2024-08-13T10:21:27Z

posthog/warehouse/api/modeling.py

+        """Return this team's DAG as a set of edges and nodes"""
+        dag = DataWarehouseModelPath.objects.get_dag(self.team)
+
+        return response.Response({"edges": dag.edges, "nodes": dag.nodes})


DAG serialization is not final, optimal structure depends on how we would display it.

tomasfarias · 2024-08-13T10:22:12Z

posthog/warehouse/api/saved_query.py

+            try:
+                DataWarehouseModelPath.objects.create_from_saved_query(view)
+            except Exception:
+                # For now, do not fail saved query creation if we cannot model-ize it.
+                # Later, after bugs and errors have been ironed out, we may tie these two
+                # closer together.
+                logger.exception("Failed to create model path when creating view %s", view.name)


Being safe here, didn't want to disrupt saved query functionality in case of bugs.

EDsCODE

Looks good! Postgres ltree type makes sense imo. @Gilbert09 I know you had mentioned something about doing the pathing in hogql. Just mentioning here in case you wanted to hash that out before this merges

posthog/warehouse/models/test/test_modeling.py

EDsCODE · 2024-08-13T19:37:27Z

posthog/warehouse/models/modeling.py

+                        else:
+                            parent_id = parent_query.id.hex
+
+                    cursor.execute(UPDATE_PATHS_QUERY, params={**{"child": label, "parent": parent_id}, **base_params})


posthog/warehouse/models/modeling.py

EDsCODE · 2024-08-13T20:06:02Z

posthog/hogql/context.py

+    # How many nested views do we support on this query? If `None`, no limit.
+    max_view_depth: int | None = 1


Did you mean to make this None/remove this and the check below in the resolver?

EDsCODE · 2024-08-13T20:07:08Z

posthog/hogql/resolver.py

@@ -306,7 +306,7 @@ def visit_join_expr(self, node: ast.JoinExpr):
            if isinstance(database_table, SavedQuery):
                self.current_view_depth += 1

-                if self.current_view_depth > self.context.max_view_depth:
+                if self.context.max_view_depth is not None and self.current_view_depth > self.context.max_view_depth:


If we're supporting these now should just completely remove. May end up with some heavy queries temporarily. Also related to handling cycles comment below

Happy to remove, I think I was just being overly cautious in case there was some use case to limiting query depth sometimes.

Main reason was just performance so no other gotchas

Gilbert09

My opinion that the LTree DAG system still stands. I don't see the benefit of running SQL queries to generate a graph of table names when we've already done the hard lifting of pulling out the table names from HogQL (I also still think that we should be doing this on the fly instead of storing it). Generally, I just think this is too complex and adds overheads. I won't block this going in, but I do think it'll slow us down moving forward.

Gilbert09 · 2024-08-14T14:37:25Z

posthog/warehouse/models/modeling.py

@@ -0,0 +1,506 @@
+import collections.abc


This is all a lot of logic, most of which I'm struggling to understand due to the LTree extension. Considering we already have the logic to pull a series of tables from a HogQL query, I'm not entirely sure why we need to store them as DAGs here. The majority of this file is executing obscure SQL - I just don't think we really need this at all, especially not at the stage we're building this, if ever.

Gilbert09 · 2024-08-14T14:39:34Z

posthog/warehouse/api/modeling.py

+        """Return this team's DAG as a set of edges and nodes"""
+        dag = DataWarehouseModelPath.objects.get_dag(self.team)
+
+        return response.Response({"edges": dag.edges, "nodes": dag.nodes})


I don't think this is a good API to expose - I can't think of any UI that's gonna need to render the DAGs directly as "edges" and "nodes"

A set of nodes and edges is a very common representation of a DAG, but I am happy to leave this out until we decide on the frontend that's going to consume this API.

Gilbert09 · 2024-08-14T14:40:11Z

posthog/hogql/context.py

+    # How many nested views do we support on this query? If `None`, no limit.
+    max_view_depth: int | None = 1


Do we want to remove this? If we're opening ourselves up to many view depths, then why have a limit at all?

Gilbert09 · 2024-08-14T14:40:44Z

posthog/warehouse/api/modeling.py

+class DataWarehouseModelPathViewSet(TeamAndOrgViewSetMixin, viewsets.ReadOnlyModelViewSet):
+    scope_object = "INTERNAL"
+
+    queryset = DataWarehouseModelPath.objects.all()
+    serializer_class = DataWarehouseModelPathSerializer


Do we need this too right now?

Gilbert09 · 2024-08-14T14:41:39Z

posthog/warehouse/api/saved_query.py

+            view.save()
+
+            try:
+                DataWarehouseModelPath.objects.create_from_saved_query(view)


I still believe that we shouldn't need to store these, but they can be computed on the fly. The more moving parts we have in the system, the more points of failure exist

Gilbert09 · 2024-08-14T14:42:48Z

posthog/warehouse/api/saved_query.py

+    def ancestors(self, request: request.Request, *args, **kwargs) -> response.Response:
+        """Return the ancestors of this saved query.
+
+        By default, we return the immediate parents. The `level` parameter can be used to
+        look further back into the ancestor tree. If `level` overshoots (i.e. points to only
+        ancestors beyond the root), we return an empty list.
+        """
+        level = request.data.get("level", 1)
+
+        saved_query = self.get_object()
+        saved_query_id = saved_query.id.hex
+        lquery = f"*{{{level},}}.{saved_query_id}"
+
+        paths = DataWarehouseModelPath.objects.filter(team=saved_query.team, path__lquery=lquery)
+
+        if not paths:
+            return response.Response({"ancestors": []})
+
+        ancestors = set()
+        for model_path in paths:
+            offset = len(model_path.path) - level - 1  # -1 corrects for level being 1-indexed
+
+            if offset < 0:
+                continue
+
+            ancestors.add(model_path.path[offset])
+
+        return response.Response({"ancestors": ancestors})
+
+    @action(methods=["POST"], detail=True)
+    def descendants(self, request: request.Request, *args, **kwargs) -> response.Response:
+        """Return the descendants of this saved query.
+
+        By default, we return the immediate children. The `level` parameter can be used to
+        look further ahead into the descendants tree. If `level` overshoots (i.e. points to only
+        descendants further than a leaf), we return an empty list.
+        """
+        level = request.data.get("level", 1)
+
+        saved_query = self.get_object()
+        saved_query_id = saved_query.id.hex
+
+        lquery = f"*.{saved_query_id}.*{{{level},}}"
+        paths = DataWarehouseModelPath.objects.filter(team=saved_query.team, path__lquery=lquery)
+
+        if not paths:
+            return response.Response({"descendants": []})
+
+        descendants = set()
+
+        for model_path in paths:
+            offset = model_path.path.index(saved_query_id) + level
+
+            if offset > len(model_path.path):
+                continue
+
+            descendants.add(model_path.path[offset])
+
+        return response.Response({"descendants": descendants})


I struggle to see a good use case for the frontend requesting certain levels of parents/children just yet - if we do need that, I imagine it'll be embedded in a different request for reading models as a whole - whats the use case for these APIs right now?

Gilbert09 · 2024-08-14T14:45:37Z

posthog/warehouse/models/test/test_modeling.py

+@pytest.mark.parametrize(
+    "query,parents",
+    [
+        ("select * from events, persons", {"events", "persons"}),
+        ("select * from some_random_view", {"some_random_view"}),
+        (
+            "with cte as (select * from events), cte2 as (select * from cte), cte3 as (select 1) select * from cte2",
+            {"events"},
+        ),
+        ("select 1", set()),
+    ],
+)


I'd like to see more examples tested here, e.g., using nested subqueries, CTEs, multiple joins, all combined - can this handle the most complex query? Take inspiration from some of the snapshots of insights and possibly use them as a base

EDsCODE · 2024-08-14T15:37:39Z

Ok, given feedback round above let's consider an alternative and see if it makes sense.

Recap, we want to track query relationships so that these derived tables can be built. The implementation above establishes the query relationships by using Postgres ltrees. It stores these relationships in a model that can be later used to traverse saved query relationships and build intermediary tables as necessary.

Tom's pushback is that most of this traversal can be done on the fly with the hogql system because hogql will resolve tables and can traverse the query tree already.

I think the main thing to clear up here @tomasfarias is there anything you have in mind that would necessitate distinct path model/logic vs completely piggy backing off existing traversal logic from hogql parsing?

posthog-bot · 2024-08-23T07:31:00Z

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.

EDsCODE · 2024-08-27T23:27:03Z

posthog/hogql/resolver.py

@@ -306,7 +306,7 @@ def visit_join_expr(self, node: ast.JoinExpr):
            if isinstance(database_table, SavedQuery):
                self.current_view_depth += 1

-                if self.current_view_depth > self.context.max_view_depth:
+                if self.context.max_view_depth is not None and self.current_view_depth > self.context.max_view_depth:


Main reason was just performance so no other gotchas

tomasfarias · 2024-08-28T08:12:25Z

Main reason was just performance so no other gotchas

Cool, I've dropped max_view_depth completely then (and the associated if statement). This simplifies things overall.

tomasfarias · 2024-08-28T08:47:05Z

Had to rebase once again due to conflicts with master. Hopefully this is the last time.

Co-authored-by: Eric Duong <[email protected]>

posthog/warehouse/api/saved_query.py

Co-authored-by: Eric Duong <[email protected]>

tomasfarias marked this pull request as draft August 7, 2024 09:12

tomasfarias changed the title ~~wip: Data modelling django models~~ feat(wip): Data modelling django models Aug 7, 2024

tomasfarias force-pushed the feat/data-modeling-first-steps branch 6 times, most recently from 8019ab5 to d9ff1a5 Compare August 12, 2024 10:17

tomasfarias changed the title ~~feat(wip): Data modelling django models~~ feat(data-models): Data modelling django models and API Aug 13, 2024

tomasfarias force-pushed the feat/data-modeling-first-steps branch from 0029f68 to 3363944 Compare August 13, 2024 10:12

tomasfarias changed the title ~~feat(data-models): Data modelling django models and API~~ feat(data-modeling): Data modelling django models and API Aug 13, 2024

tomasfarias commented Aug 13, 2024

View reviewed changes

tomasfarias requested a review from a team August 13, 2024 10:24

tomasfarias marked this pull request as ready for review August 13, 2024 10:24

EDsCODE reviewed Aug 13, 2024

View reviewed changes

EDsCODE mentioned this pull request Aug 14, 2024

Sprint - Aug 19 to Aug 30 #24364

Closed

Gilbert09 reviewed Aug 14, 2024

View reviewed changes

EDsCODE self-requested a review August 15, 2024 16:50

posthog-bot added the stale label Aug 23, 2024

tomasfarias removed the stale label Aug 23, 2024

tomasfarias force-pushed the feat/data-modeling-first-steps branch 4 times, most recently from 2335656 to eea2ca5 Compare August 27, 2024 14:17

EDsCODE approved these changes Aug 27, 2024

View reviewed changes

tomasfarias force-pushed the feat/data-modeling-first-steps branch from 863eb1b to 623ba8e Compare August 28, 2024 08:46

tomasfarias and others added 22 commits August 28, 2024 13:52

feat(data-models): Setup initial django models and api

b9898ac

feat(data-models): Setup initial django models

b9981cc

fix: Create extension in migration

d7e3666

feat(data-models): Support for updating a model path

3a71489

fix(data-models): Update migration to defer constraint

1e65233

fix(data-models): Create extension in migration

42c9ce4

fix(data-models): Typing fixes

13cbecc

fix(data-models): Final cleanup update

7b33f0e

fix(data-models): Drop deleted field

0ad82c9

feat(data-models): Add basic API

e7fe79b

fix(data-models): Add reverse statement

0ae3f25

fix(data-models): Create extension in test db

db30f10

feat(data-models): Allow nested save queries

e2eb6f5

feat(data-models): Add ancestors and descendants methods to API

d9a1d79

feat(data-models): Add DAG API

e0e7ab6

fix(data-models): Add missing api routing

7399be8

fix: Update docstring

238e304

Co-authored-by: Eric Duong <[email protected]>

fix: Remove type hints

973431e

fix: Raise on non-existent parents to prevent cycles

aba1e45

test: Add a unit test to cover cycles created via updates

2693ccb

feat: Remove max view depth as it only guarantees performance

f8fb71e

fix: Move update to inside transaction block

aa673c4

tomasfarias force-pushed the feat/data-modeling-first-steps branch from 6ad922a to aa673c4 Compare August 28, 2024 11:54

github-advanced-security bot found potential problems Aug 28, 2024

View reviewed changes

posthog/warehouse/api/saved_query.py Dismissed Show resolved Hide resolved

tomasfarias merged commit 254faa3 into master Aug 28, 2024
86 checks passed

tomasfarias deleted the feat/data-modeling-first-steps branch August 28, 2024 12:22

pauldambra pushed a commit that referenced this pull request Aug 29, 2024

feat(data-modeling): Data modelling django models and API (#24232)

f5792a6

Co-authored-by: Eric Duong <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-modeling): Data modelling django models and API #24232

feat(data-modeling): Data modelling django models and API #24232

tomasfarias commented Aug 7, 2024 •

edited

Loading

tomasfarias Aug 13, 2024

EDsCODE Aug 13, 2024

Gilbert09 Aug 14, 2024

tomasfarias Aug 23, 2024

tomasfarias Aug 13, 2024 •

edited

Loading

tomasfarias Aug 13, 2024

EDsCODE left a comment

EDsCODE Aug 13, 2024

EDsCODE Aug 13, 2024

EDsCODE Aug 13, 2024

tomasfarias Aug 27, 2024 •

edited

Loading

EDsCODE Aug 27, 2024

Gilbert09 left a comment

Gilbert09 Aug 14, 2024

Gilbert09 Aug 14, 2024

tomasfarias Aug 26, 2024

Gilbert09 Aug 14, 2024

Gilbert09 Aug 14, 2024

Gilbert09 Aug 14, 2024 •

edited

Loading

Gilbert09 Aug 14, 2024 •

edited

Loading

Gilbert09 Aug 14, 2024

EDsCODE commented Aug 14, 2024

posthog-bot commented Aug 23, 2024

EDsCODE Aug 27, 2024

tomasfarias commented Aug 28, 2024

tomasfarias commented Aug 28, 2024

		# How many nested views do we support on this query? If `None`, no limit.
		max_view_depth: int \| None = 1

feat(data-modeling): Data modelling django models and API #24232

feat(data-modeling): Data modelling django models and API #24232

Conversation

tomasfarias commented Aug 7, 2024 • edited Loading

Problem

Changes

Questions

Does this work well for both Cloud and self-hosted?

How did you test this code?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasfarias Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EDsCODE left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasfarias Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gilbert09 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gilbert09 Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Gilbert09 Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EDsCODE commented Aug 14, 2024

posthog-bot commented Aug 23, 2024

Choose a reason for hiding this comment

tomasfarias commented Aug 28, 2024

tomasfarias commented Aug 28, 2024

tomasfarias commented Aug 7, 2024 •

edited

Loading

tomasfarias Aug 13, 2024 •

edited

Loading

tomasfarias Aug 27, 2024 •

edited

Loading

Gilbert09 Aug 14, 2024 •

edited

Loading

Gilbert09 Aug 14, 2024 •

edited

Loading