-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(metadata-io): in Neo4j service use proper algorithm to get lineage #8687
fix(metadata-io): in Neo4j service use proper algorithm to get lineage #8687
Conversation
Nice job! |
APOC needs to be enabled in Neo4j for this change to work. In particular, for deployment via Helm chart, this PR is needed: The dependency should be communicated via change logs. |
d01fa6c
to
5dd0dc7
Compare
6c185b9
to
8678241
Compare
Trying to figure out why |
* this bypasses dependency issues with apoc:all:4.4.0.9
I made a fix and (Needed to use a slightly higher version of neo4j apoc as a test dependency to avoid some dependency issue on snakeyaml.) |
…se-proper-algorithm-to-get-lineage # Conflicts: # metadata-io/build.gradle
Can you please make a note of this breaking change in updating-datahub.md? Thanks! Otherwise LGTM :) |
@RyanHolstien Thanks! I'll update it as soon as acryldata/datahub-helm#353 is merged. Btw, should we also add a breaking change tip line for the replacement of the Neo4j chart (PR: acryldata/datahub-helm#365)? It looks like the existing chart overrides will not automatically work with the new chart structure. |
@RyanHolstien I added a breaking change note to cover this PR change and the change introduced in acryldata/datahub-helm#365. Please check 😋 |
Not sure why the build checks start to fail after merging master and the errors do not seem to be relevant to this PR. Any help is appreciated! |
Only test |
Tests are passing. Mind a final check for merge? 😋 @RyanHolstien |
The existing
shortestPath
algorithm supporting the lineage queries has performance issue when the graph database has a high amount of dataset nodes (say >100K). According to Neo4j Cypher PROFILE, it looks like the shortestPath query tries to enumerate all the nodes in the graph database to compute a shortestPath from the start node, which is unnecessary since usually the reachable nodes from the start node are only a very small portion of the whole set of nodes.Such a performance issue basically leads to that the lineage features in the UI of our DataHub do not work at all. (GraphQL queries
searchAcrossLineage
orgetLineage
always time out, but in the background the execution of the query continues until it finishes after a long time.)After extensive investigations, we found that apoc.path.spanningTree is a better alternative to the existing algorithm. It is uses Breadth-First-Search to expand the lineage paths from the start node, efficiently returns collected paths and does not explore any non-reachable nodes.
With a fix via this PR, in our case, the query time given a node having >200 downstream lineages reduced from
90s to roughly 13s or even less, which is a substantial improvement.In addition, I restructured how the query statements are formatted, which should lead to better readability and easier maintenance.
Checklist
Links to related issues (if applicable)Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub