-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up CTC edge errors #59
Speed up CTC edge errors #59
Conversation
When calculating edge errors, there is a 1-to-1 mapping between computed graph nodes and GT graph nodes, see details below. It is faster to use a dictionary for doing the node mapping with O(n log(n)) runtime, compared to repeatedly iterating over a list with np.where (O(n^2)). Details: Potential 1-to-many matches, namely computed nodes that match multiple GT nodes (called non-split), are not part of the induced graph. Therefore, all edges incident to such nodes are also not part of the induced graph, leaving us with the desired 1-to-1 mapping for all nodes that are incident to existing edges in the induced graph, which we iterate over in the loop for finding FP edges. Conversely, each GT node is only matched to at most one computed node, directly yielding the 1-1 matching for finding FN edges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bentaculum the logic for the edge errors being 1-to-1 is sound and I agree that iterating through the list is grim. Do you know which recent changes lead to the slowdown observed in main
compared to after merging #37? I'm not getting the expected speed up in this branch - edge errors are being computed for me in similar times to on main
- about 400x slower than with #37.
I noticed you create the dict mapping from the original matching - is this because we can only be certain of the 1-to-1ness after computing the induced graph? I think even if we do want 1-to-many and many-to-1 support, a dictionary mapping node_key: set[node-key]
would still be much faster when vast majority of nodes are 1-to-1 anyway?
Would be nice to have some benchmarking in CI - we really want to be cautious of slowdowns.
@DragaDoncila Thanks for the feedback. In a fresh env I am not able to reproduce the speedup I reported either, sorry for the noise. I noticed that the Since the last sprint the track errors are stored as edge attributes on the I simply removed these two type casts, and tests still pass. Now I run For the |
The original To make this logic more explicit, I now only add an entry to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small nits but otherwise love it - it's buttery smooth 🎉 Thanks @bentaculum ❤️
src/traccuracy/_tracking_graph.py
Outdated
else: | ||
limited_edges = {_id: data for _id, data in edges if _id in limit_to} | ||
limited_edges = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be self.graph.edge_subgraph(limit_to).edges
and then the typing would be the same for both return statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, put this in here 7e72be3. networkx.DiGraph.edge_subgraph
and networkx.DiGraph.subgraph
unfortunately don't throw an error when you ask for non-existing things. I do think that this is a reasonable check and added it. This way we also keep the API that you had before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about a good way to make docstrings that include classes from third-party packages, like networkx OutEdgeView
here. Writing networkx.classes.reportviews.OutEdgeView
in the docstring seems excessive.
Might be a good segway into typing traccuracy ;), or at least the core parts of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah would really love to have this package better typed especially once we've sorta settled on the API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it! Thanks @bentaculum !🎉
@bentaculum @DragaDoncila I do object to returning networkx specific objects in our track graph (e.g. NodeView, OutEdgeView). We want to be able to switch away from networkx and not change any of the API. Turning it into a dictionary is slow, I know, but perhaps we can do something else like return an iterator. Would that help with the speedup as well? |
@bentaculum @DragaDoncila I see your point way more after some benchmarking... In general I am totally happy to change the signature of the TrackingGraph (e.g. adding functions, changing the signature of functions), as I sort of guessed what functionality we would need when I wrote the original set of functions. Looking closer at the ctc code, it seems like we just need more flexible ways to access the nodes/edges, e.g. for membership checks, just getting an iterator over ids instead of also including data, etc. I've already added a get_node_attribute and get_edge_attribute function in the access_by_attr branch. Given a look at all places where |
@cmalinmayor thanks for flagging! I know initially we wanted to provide a full API via the As I've been using I guess for me if I was to make any decision, it would be to further expose the underlying networkx graph, rather than add more functionality to the |
Re: the above comments and in the other thread here: #63 (comment) @bentaculum @DragaDoncila Overall, I'm happy to go forward with this PR! However, I do think there are significant advantages to putting limitations on the networkx API in traccuracy. There are some things you can do in networkx that we explicitly want to ban (e.g. changing locations, adding/removing edges) - for power developers like you and Ben who fully understand networkx and traccuracy, you can easily avoid potential pitfalls, but as the user and/or developer base expands, it might be more important to enforce constraints. Additionally, if we retain control of the API we can make assumptions or augmentations to speed up access by attribute or location (linear on size of graph in native networkx, constant with the dictionaries I implemented in the other PR). Finally, if we ever have graphs too large to process in memory, we will want to implement or take advantage of an on-disk solution. I also just personally dislike the networkx API, but I know that's not a good reason to make everyone do a ton more work :). The specific change in this PR is not introducing that big of a dependency, so I say we go forward with it but keep these potential issues with entangling the two libraries further in mind as we continue. If they never become issues, great, full steam ahead! |
Thanks @bentaculum! |
When calculating edge errors, there is a 1-to-1 mapping between computed graph nodes and GT graph nodes, see details below.
It is faster to use a dictionary for doing the node mapping with O(n log(n)) runtime, compared to repeatedly iterating over a list with np.where (O(n^2)).
Details: Potential 1-to-many matches, namely computed nodes that match multiple GT nodes (called non-split), are not part of the induced graph. Therefore, all edges incident to such nodes are also not part of the induced graph, leaving us with the desired 1-to-1 mapping for all nodes that are incident to existing edges in the induced graph, which we iterate over in the loop for finding FP edges.
Conversely, each GT node is only matched to at most one computed node, directly yielding the 1-1 matching for finding FN edges.
In case the test cases do not yet cover this edge case of a 1-to-many match for CTC we should probably add it.