Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(data-modeling): Data modelling django models and API #24232

Merged
merged 22 commits into from
Aug 28, 2024

Conversation

tomasfarias
Copy link
Contributor

@tomasfarias tomasfarias commented Aug 7, 2024

Problem

This PR contains the initial django model used to support data modeling as well as a set of initial API methods intended to provide some value while we build on more modeling features.

NOTE: Internally, I have been referring to the feature as data modeling to avoid confusion with django models, but we still want that from a user perspective, they are creating "data models".

Data modeling boils down to:

  • A query.
    • Just a SELECT query, no other commands allowed (CTEs, JOINs are fine).
    • This dictates the data and how it is to be modeled.
  • A path to the modeled data through all of its ancestors (all model paths can be used to form a DAG).

With these two objects, we can later build tools to "run" the modeled data, aka materialize it, while resolving their dependencies using their paths. Moreover, the paths can be used to present the user with a DAG of their modeling.

Changes

This first draft contains the django model used to store model paths, as well as a model manager that deals with their creation. Moreover, I've included some API methods to provide some value to users while we work on the rest of the feature set. These API methods can be used to construct a DAG or display parents and children.

The queries that represent models are backed by the already existing DataWarehouseSavedQuery. Importantly, we had to allow nested views to support modeling.

Questions

  • Should we transform all existing DataWarehouseSavedQuery into models? Or should they be separate. An earlier commit used a separate DataWarehouseModel to store model queries, so it's possible to revert to that.

    • The main problem is that saved queries are just not materialized models, so it could be confusing for users if we have to teach them that there's two different ways to define a query.
    • ANSWER: I've opted to go with using existing DataWarehouseSavedQuery, which saves us from duplicating a lot of code. Moving forward, we will create model paths for each new DataWarehouseSavedQuery. However, we do so in a safe fashion for now (without failing DataWarehouseSavedQuery creation.
  • What format should the DAG be returned as?

    • ANSWER: This will likely be dictated by whatever JS library the frontend uses to display the DAG. For the time being, I've coded up a very simple class that holds a set of edges and nodes, and can be serialized to JSON in a straight-forward way. However, this can be changed later depending on what the frontend needs.

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Does this work well for both Cloud and self-hosted?

Yes

How did you test this code?

Unit tests for the manager are included.
Unit tests for API.

@tomasfarias tomasfarias marked this pull request as draft August 7, 2024 09:12
@tomasfarias tomasfarias changed the title wip: Data modelling django models feat(wip): Data modelling django models Aug 7, 2024
@tomasfarias tomasfarias force-pushed the feat/data-modeling-first-steps branch 6 times, most recently from 8019ab5 to d9ff1a5 Compare August 12, 2024 10:17
@tomasfarias tomasfarias changed the title feat(wip): Data modelling django models feat(data-models): Data modelling django models and API Aug 13, 2024
@tomasfarias tomasfarias force-pushed the feat/data-modeling-first-steps branch from 0029f68 to 3363944 Compare August 13, 2024 10:12
@tomasfarias tomasfarias changed the title feat(data-models): Data modelling django models and API feat(data-modeling): Data modelling django models and API Aug 13, 2024
Comment on lines 39 to 40
# How many nested views do we support on this query? If `None`, no limit.
max_view_depth: int | None = 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow nested views.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to make this None/remove this and the check below in the resolver?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to remove this? If we're opening ourselves up to many view depths, then why have a limit at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine with removing it. Just wanted to keep the check in case we would want to enforce a view depth somewhere else, but for modeling purposes I think there is no limit (or a very high one for sanity reasons).

"""Return this team's DAG as a set of edges and nodes"""
dag = DataWarehouseModelPath.objects.get_dag(self.team)

return response.Response({"edges": dag.edges, "nodes": dag.nodes})
Copy link
Contributor Author

@tomasfarias tomasfarias Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DAG serialization is not final, optimal structure depends on how we would display it.

Comment on lines +72 to +78
try:
DataWarehouseModelPath.objects.create_from_saved_query(view)
except Exception:
# For now, do not fail saved query creation if we cannot model-ize it.
# Later, after bugs and errors have been ironed out, we may tie these two
# closer together.
logger.exception("Failed to create model path when creating view %s", view.name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being safe here, didn't want to disrupt saved query functionality in case of bugs.

@tomasfarias tomasfarias requested a review from a team August 13, 2024 10:24
@tomasfarias tomasfarias marked this pull request as ready for review August 13, 2024 10:24
Copy link
Member

@EDsCODE EDsCODE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Postgres ltree type makes sense imo. @Gilbert09 I know you had mentioned something about doing the pathing in hogql. Just mentioning here in case you wanted to hash that out before this merges

posthog/warehouse/models/test/test_modeling.py Outdated Show resolved Hide resolved
else:
parent_id = parent_query.id.hex

cursor.execute(UPDATE_PATHS_QUERY, params={**{"child": label, "parent": parent_id}, **base_params})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 🚀

posthog/warehouse/models/modeling.py Show resolved Hide resolved
Comment on lines 39 to 40
# How many nested views do we support on this query? If `None`, no limit.
max_view_depth: int | None = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to make this None/remove this and the check below in the resolver?

@@ -306,7 +306,7 @@ def visit_join_expr(self, node: ast.JoinExpr):
if isinstance(database_table, SavedQuery):
self.current_view_depth += 1

if self.current_view_depth > self.context.max_view_depth:
if self.context.max_view_depth is not None and self.current_view_depth > self.context.max_view_depth:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're supporting these now should just completely remove. May end up with some heavy queries temporarily. Also related to handling cycles comment below

Copy link
Contributor Author

@tomasfarias tomasfarias Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to remove, I think I was just being overly cautious in case there was some use case to limiting query depth sometimes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main reason was just performance so no other gotchas

Copy link
Member

@Gilbert09 Gilbert09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion that the LTree DAG system still stands. I don't see the benefit of running SQL queries to generate a graph of table names when we've already done the hard lifting of pulling out the table names from HogQL (I also still think that we should be doing this on the fly instead of storing it). Generally, I just think this is too complex and adds overheads. I won't block this going in, but I do think it'll slow us down moving forward.

@@ -0,0 +1,506 @@
import collections.abc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all a lot of logic, most of which I'm struggling to understand due to the LTree extension. Considering we already have the logic to pull a series of tables from a HogQL query, I'm not entirely sure why we need to store them as DAGs here. The majority of this file is executing obscure SQL - I just don't think we really need this at all, especially not at the stage we're building this, if ever.

"""Return this team's DAG as a set of edges and nodes"""
dag = DataWarehouseModelPath.objects.get_dag(self.team)

return response.Response({"edges": dag.edges, "nodes": dag.nodes})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good API to expose - I can't think of any UI that's gonna need to render the DAGs directly as "edges" and "nodes"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A set of nodes and edges is a very common representation of a DAG, but I am happy to leave this out until we decide on the frontend that's going to consume this API.

Comment on lines 39 to 40
# How many nested views do we support on this query? If `None`, no limit.
max_view_depth: int | None = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to remove this? If we're opening ourselves up to many view depths, then why have a limit at all?

Comment on lines +15 to +19
class DataWarehouseModelPathViewSet(TeamAndOrgViewSetMixin, viewsets.ReadOnlyModelViewSet):
scope_object = "INTERNAL"

queryset = DataWarehouseModelPath.objects.all()
serializer_class = DataWarehouseModelPathSerializer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this too right now?

view.save()

try:
DataWarehouseModelPath.objects.create_from_saved_query(view)
Copy link
Member

@Gilbert09 Gilbert09 Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe that we shouldn't need to store these, but they can be computed on the fly. The more moving parts we have in the system, the more points of failure exist

Comment on lines +153 to +214
def ancestors(self, request: request.Request, *args, **kwargs) -> response.Response:
"""Return the ancestors of this saved query.

By default, we return the immediate parents. The `level` parameter can be used to
look further back into the ancestor tree. If `level` overshoots (i.e. points to only
ancestors beyond the root), we return an empty list.
"""
level = request.data.get("level", 1)

saved_query = self.get_object()
saved_query_id = saved_query.id.hex
lquery = f"*{{{level},}}.{saved_query_id}"

paths = DataWarehouseModelPath.objects.filter(team=saved_query.team, path__lquery=lquery)

if not paths:
return response.Response({"ancestors": []})

ancestors = set()
for model_path in paths:
offset = len(model_path.path) - level - 1 # -1 corrects for level being 1-indexed

if offset < 0:
continue

ancestors.add(model_path.path[offset])

return response.Response({"ancestors": ancestors})

@action(methods=["POST"], detail=True)
def descendants(self, request: request.Request, *args, **kwargs) -> response.Response:
"""Return the descendants of this saved query.

By default, we return the immediate children. The `level` parameter can be used to
look further ahead into the descendants tree. If `level` overshoots (i.e. points to only
descendants further than a leaf), we return an empty list.
"""
level = request.data.get("level", 1)

saved_query = self.get_object()
saved_query_id = saved_query.id.hex

lquery = f"*.{saved_query_id}.*{{{level},}}"
paths = DataWarehouseModelPath.objects.filter(team=saved_query.team, path__lquery=lquery)

if not paths:
return response.Response({"descendants": []})

descendants = set()

for model_path in paths:
offset = model_path.path.index(saved_query_id) + level

if offset > len(model_path.path):
continue

descendants.add(model_path.path[offset])

return response.Response({"descendants": descendants})
Copy link
Member

@Gilbert09 Gilbert09 Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggle to see a good use case for the frontend requesting certain levels of parents/children just yet - if we do need that, I imagine it'll be embedded in a different request for reading models as a whole - whats the use case for these APIs right now?

Comment on lines +8 to +25
@pytest.mark.parametrize(
"query,parents",
[
("select * from events, persons", {"events", "persons"}),
("select * from some_random_view", {"some_random_view"}),
(
"with cte as (select * from events), cte2 as (select * from cte), cte3 as (select 1) select * from cte2",
{"events"},
),
("select 1", set()),
],
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see more examples tested here, e.g., using nested subqueries, CTEs, multiple joins, all combined - can this handle the most complex query? Take inspiration from some of the snapshots of insights and possibly use them as a base

@EDsCODE
Copy link
Member

EDsCODE commented Aug 14, 2024

Ok, given feedback round above let's consider an alternative and see if it makes sense.

Recap, we want to track query relationships so that these derived tables can be built. The implementation above establishes the query relationships by using Postgres ltrees. It stores these relationships in a model that can be later used to traverse saved query relationships and build intermediary tables as necessary.

Tom's pushback is that most of this traversal can be done on the fly with the hogql system because hogql will resolve tables and can traverse the query tree already.

I think the main thing to clear up here @tomasfarias is there anything you have in mind that would necessitate distinct path model/logic vs completely piggy backing off existing traversal logic from hogql parsing?

@EDsCODE EDsCODE self-requested a review August 15, 2024 16:50
@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week. If you want to permanentely keep it open, use the waiting label.

@tomasfarias tomasfarias removed the stale label Aug 23, 2024
@tomasfarias tomasfarias force-pushed the feat/data-modeling-first-steps branch 4 times, most recently from 2335656 to eea2ca5 Compare August 27, 2024 14:17
@@ -306,7 +306,7 @@ def visit_join_expr(self, node: ast.JoinExpr):
if isinstance(database_table, SavedQuery):
self.current_view_depth += 1

if self.current_view_depth > self.context.max_view_depth:
if self.context.max_view_depth is not None and self.current_view_depth > self.context.max_view_depth:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main reason was just performance so no other gotchas

@tomasfarias
Copy link
Contributor Author

Main reason was just performance so no other gotchas

Cool, I've dropped max_view_depth completely then (and the associated if statement). This simplifies things overall.

@tomasfarias tomasfarias force-pushed the feat/data-modeling-first-steps branch from 863eb1b to 623ba8e Compare August 28, 2024 08:46
@tomasfarias
Copy link
Contributor Author

Had to rebase once again due to conflicts with master. Hopefully this is the last time.

@tomasfarias tomasfarias force-pushed the feat/data-modeling-first-steps branch from 6ad922a to aa673c4 Compare August 28, 2024 11:54
posthog/warehouse/api/saved_query.py Dismissed Show resolved Hide resolved
@tomasfarias tomasfarias merged commit 254faa3 into master Aug 28, 2024
86 checks passed
@tomasfarias tomasfarias deleted the feat/data-modeling-first-steps branch August 28, 2024 12:22
pauldambra pushed a commit that referenced this pull request Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants