-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(data-modeling): Data modelling django models and API #24232
Conversation
8019ab5
to
d9ff1a5
Compare
0029f68
to
3363944
Compare
posthog/hogql/context.py
Outdated
# How many nested views do we support on this query? If `None`, no limit. | ||
max_view_depth: int | None = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow nested views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to make this None/remove this and the check below in the resolver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to remove this? If we're opening ourselves up to many view depths, then why have a limit at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be fine with removing it. Just wanted to keep the check in case we would want to enforce a view depth somewhere else, but for modeling purposes I think there is no limit (or a very high one for sanity reasons).
"""Return this team's DAG as a set of edges and nodes""" | ||
dag = DataWarehouseModelPath.objects.get_dag(self.team) | ||
|
||
return response.Response({"edges": dag.edges, "nodes": dag.nodes}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DAG serialization is not final, optimal structure depends on how we would display it.
try: | ||
DataWarehouseModelPath.objects.create_from_saved_query(view) | ||
except Exception: | ||
# For now, do not fail saved query creation if we cannot model-ize it. | ||
# Later, after bugs and errors have been ironed out, we may tie these two | ||
# closer together. | ||
logger.exception("Failed to create model path when creating view %s", view.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being safe here, didn't want to disrupt saved query functionality in case of bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Postgres ltree type makes sense imo. @Gilbert09 I know you had mentioned something about doing the pathing in hogql. Just mentioning here in case you wanted to hash that out before this merges
else: | ||
parent_id = parent_query.id.hex | ||
|
||
cursor.execute(UPDATE_PATHS_QUERY, params={**{"child": label, "parent": parent_id}, **base_params}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice 🚀
posthog/hogql/context.py
Outdated
# How many nested views do we support on this query? If `None`, no limit. | ||
max_view_depth: int | None = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to make this None/remove this and the check below in the resolver?
posthog/hogql/resolver.py
Outdated
@@ -306,7 +306,7 @@ def visit_join_expr(self, node: ast.JoinExpr): | |||
if isinstance(database_table, SavedQuery): | |||
self.current_view_depth += 1 | |||
|
|||
if self.current_view_depth > self.context.max_view_depth: | |||
if self.context.max_view_depth is not None and self.current_view_depth > self.context.max_view_depth: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're supporting these now should just completely remove. May end up with some heavy queries temporarily. Also related to handling cycles comment below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to remove, I think I was just being overly cautious in case there was some use case to limiting query depth sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main reason was just performance so no other gotchas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion that the LTree DAG system still stands. I don't see the benefit of running SQL queries to generate a graph of table names when we've already done the hard lifting of pulling out the table names from HogQL (I also still think that we should be doing this on the fly instead of storing it). Generally, I just think this is too complex and adds overheads. I won't block this going in, but I do think it'll slow us down moving forward.
@@ -0,0 +1,506 @@ | |||
import collections.abc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all a lot of logic, most of which I'm struggling to understand due to the LTree extension. Considering we already have the logic to pull a series of tables from a HogQL query, I'm not entirely sure why we need to store them as DAGs here. The majority of this file is executing obscure SQL - I just don't think we really need this at all, especially not at the stage we're building this, if ever.
"""Return this team's DAG as a set of edges and nodes""" | ||
dag = DataWarehouseModelPath.objects.get_dag(self.team) | ||
|
||
return response.Response({"edges": dag.edges, "nodes": dag.nodes}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good API to expose - I can't think of any UI that's gonna need to render the DAGs directly as "edges" and "nodes"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A set of nodes and edges is a very common representation of a DAG, but I am happy to leave this out until we decide on the frontend that's going to consume this API.
posthog/hogql/context.py
Outdated
# How many nested views do we support on this query? If `None`, no limit. | ||
max_view_depth: int | None = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to remove this? If we're opening ourselves up to many view depths, then why have a limit at all?
class DataWarehouseModelPathViewSet(TeamAndOrgViewSetMixin, viewsets.ReadOnlyModelViewSet): | ||
scope_object = "INTERNAL" | ||
|
||
queryset = DataWarehouseModelPath.objects.all() | ||
serializer_class = DataWarehouseModelPathSerializer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this too right now?
view.save() | ||
|
||
try: | ||
DataWarehouseModelPath.objects.create_from_saved_query(view) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still believe that we shouldn't need to store these, but they can be computed on the fly. The more moving parts we have in the system, the more points of failure exist
def ancestors(self, request: request.Request, *args, **kwargs) -> response.Response: | ||
"""Return the ancestors of this saved query. | ||
|
||
By default, we return the immediate parents. The `level` parameter can be used to | ||
look further back into the ancestor tree. If `level` overshoots (i.e. points to only | ||
ancestors beyond the root), we return an empty list. | ||
""" | ||
level = request.data.get("level", 1) | ||
|
||
saved_query = self.get_object() | ||
saved_query_id = saved_query.id.hex | ||
lquery = f"*{{{level},}}.{saved_query_id}" | ||
|
||
paths = DataWarehouseModelPath.objects.filter(team=saved_query.team, path__lquery=lquery) | ||
|
||
if not paths: | ||
return response.Response({"ancestors": []}) | ||
|
||
ancestors = set() | ||
for model_path in paths: | ||
offset = len(model_path.path) - level - 1 # -1 corrects for level being 1-indexed | ||
|
||
if offset < 0: | ||
continue | ||
|
||
ancestors.add(model_path.path[offset]) | ||
|
||
return response.Response({"ancestors": ancestors}) | ||
|
||
@action(methods=["POST"], detail=True) | ||
def descendants(self, request: request.Request, *args, **kwargs) -> response.Response: | ||
"""Return the descendants of this saved query. | ||
|
||
By default, we return the immediate children. The `level` parameter can be used to | ||
look further ahead into the descendants tree. If `level` overshoots (i.e. points to only | ||
descendants further than a leaf), we return an empty list. | ||
""" | ||
level = request.data.get("level", 1) | ||
|
||
saved_query = self.get_object() | ||
saved_query_id = saved_query.id.hex | ||
|
||
lquery = f"*.{saved_query_id}.*{{{level},}}" | ||
paths = DataWarehouseModelPath.objects.filter(team=saved_query.team, path__lquery=lquery) | ||
|
||
if not paths: | ||
return response.Response({"descendants": []}) | ||
|
||
descendants = set() | ||
|
||
for model_path in paths: | ||
offset = model_path.path.index(saved_query_id) + level | ||
|
||
if offset > len(model_path.path): | ||
continue | ||
|
||
descendants.add(model_path.path[offset]) | ||
|
||
return response.Response({"descendants": descendants}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I struggle to see a good use case for the frontend requesting certain levels of parents/children just yet - if we do need that, I imagine it'll be embedded in a different request for reading models as a whole - whats the use case for these APIs right now?
@pytest.mark.parametrize( | ||
"query,parents", | ||
[ | ||
("select * from events, persons", {"events", "persons"}), | ||
("select * from some_random_view", {"some_random_view"}), | ||
( | ||
"with cte as (select * from events), cte2 as (select * from cte), cte3 as (select 1) select * from cte2", | ||
{"events"}, | ||
), | ||
("select 1", set()), | ||
], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see more examples tested here, e.g., using nested subqueries, CTEs, multiple joins, all combined - can this handle the most complex query? Take inspiration from some of the snapshots of insights and possibly use them as a base
Ok, given feedback round above let's consider an alternative and see if it makes sense. Recap, we want to track query relationships so that these derived tables can be built. The implementation above establishes the query relationships by using Postgres ltrees. It stores these relationships in a model that can be later used to traverse saved query relationships and build intermediary tables as necessary. Tom's pushback is that most of this traversal can be done on the fly with the hogql system because hogql will resolve tables and can traverse the query tree already. I think the main thing to clear up here @tomasfarias is there anything you have in mind that would necessitate distinct path model/logic vs completely piggy backing off existing traversal logic from hogql parsing? |
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the |
2335656
to
eea2ca5
Compare
posthog/hogql/resolver.py
Outdated
@@ -306,7 +306,7 @@ def visit_join_expr(self, node: ast.JoinExpr): | |||
if isinstance(database_table, SavedQuery): | |||
self.current_view_depth += 1 | |||
|
|||
if self.current_view_depth > self.context.max_view_depth: | |||
if self.context.max_view_depth is not None and self.current_view_depth > self.context.max_view_depth: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main reason was just performance so no other gotchas
Cool, I've dropped |
863eb1b
to
623ba8e
Compare
Had to rebase once again due to conflicts with |
Co-authored-by: Eric Duong <[email protected]>
6ad922a
to
aa673c4
Compare
Co-authored-by: Eric Duong <[email protected]>
Problem
This PR contains the initial django model used to support data modeling as well as a set of initial API methods intended to provide some value while we build on more modeling features.
NOTE: Internally, I have been referring to the feature as data modeling to avoid confusion with django models, but we still want that from a user perspective, they are creating "data models".
Data modeling boils down to:
SELECT
query, no other commands allowed (CTEs, JOINs are fine).With these two objects, we can later build tools to "run" the modeled data, aka materialize it, while resolving their dependencies using their paths. Moreover, the paths can be used to present the user with a DAG of their modeling.
Changes
This first draft contains the django model used to store model paths, as well as a model manager that deals with their creation. Moreover, I've included some API methods to provide some value to users while we work on the rest of the feature set. These API methods can be used to construct a DAG or display parents and children.
The queries that represent models are backed by the already existing
DataWarehouseSavedQuery
. Importantly, we had to allow nested views to support modeling.Questions
Should we transform all existing
DataWarehouseSavedQuery
into models? Or should they be separate. An earlier commit used a separateDataWarehouseModel
to store model queries, so it's possible to revert to that.DataWarehouseSavedQuery
, which saves us from duplicating a lot of code. Moving forward, we will create model paths for each newDataWarehouseSavedQuery
. However, we do so in a safe fashion for now (without failingDataWarehouseSavedQuery
creation.What format should the DAG be returned as?
👉 Stay up-to-date with PostHog coding conventions for a smoother review.
Does this work well for both Cloud and self-hosted?
Yes
How did you test this code?
Unit tests for the manager are included.
Unit tests for API.