Aleph Pathfinding #1514
mynameisfiber
started this conversation in
Ideas
Replies: 1 comment
-
Note: Calculating the component ID of each entity in a dataset is very straightforward and can be done in linear time in memory or quasi-linear time out of memory |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I've got an idea to help make graph path finding on aleph more performant. I observed that generally the data inside of a dataset has multiple components (ie: unconnected subgaphs, connected through edge nodes) that is much smaller than the number of entities in the dataset. Also, becauase we isolate entities/edges within a dataset, each component can be abstracted into it's own "search node" which is connected to other datasets using profiles. So, if we add the following lookup tables:
Then, when doing a path query between entity A and B we would first find what components they belong to then iterate on a join between the componentID tabke and the EntitySetItem.componentID field to step to adjacent entities. As a breath first search this would give us a worst-case full path complexity of O(|C| + m * n^2) where C is the set of components, m is the number of profiles and n is the average number of entities in a profile (or lower if we limit the depth of the search depth).
There are also heuristics we can add to further speed things up. Intuitively, having measures of how "hub and spoke"y a graph is and not using those would be useful in some queries (ie: not using KPMG which has a lot of employees and doesn't give much information). These can be implemented by calculating component metrics during the indexing phase. Also, this structure is very close to the hierarchical graph structure needed for the Thorup path finding algorithm on weighted graphs which would give us O(m * n^2 + |C| log log |C| ) with edge weights, which would be nice to specify "interestingness" of an edge.
One issue with this method is we don't know the exact length of the path until we are done and we transform our proposed path of components into a path of entities. For example, one of the components in our path could be very large and it could add a bunch of actual traversals to our path without us knowing. This doesn't prohibit us from still finding shortest path, but it does mean there'll be cases where we want to find paths of length X but return back a bunch of paths much longer than X and don't know it until we are done.
The next step for this is to do an analysis of the data in aleph to get a sense of how many components there actually are in the data and what the distribution of their sizes are. This method works well if datasets have a small number of large components and the exact distribution will make or break this method. Anecdotally, I've found that datasets generally have one or two large components then a long-tail. If this follows the dynamics close to a non-critical random graph, we can expect a log decrease in complexity over entities (ie: O(log |V| + m * n^2) where V is the set of entities in aleph). This is in comparison to the naive method which would be O(V + E + m*n^2) where E is the number of edges in aleph.
tldr; we can get O(log |V| + m * n^2) shortest path queries in aleph instead of O(V + E + m*n^2)
Beta Was this translation helpful? Give feedback.
All reactions