Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up CTC edge errors #59

Merged

Conversation

bentaculum
Copy link
Contributor

@bentaculum bentaculum commented Sep 22, 2023

When calculating edge errors, there is a 1-to-1 mapping between computed graph nodes and GT graph nodes, see details below.

It is faster to use a dictionary for doing the node mapping with O(n log(n)) runtime, compared to repeatedly iterating over a list with np.where (O(n^2)).

Details: Potential 1-to-many matches, namely computed nodes that match multiple GT nodes (called non-split), are not part of the induced graph. Therefore, all edges incident to such nodes are also not part of the induced graph, leaving us with the desired 1-to-1 mapping for all nodes that are incident to existing edges in the induced graph, which we iterate over in the loop for finding FP edges.

Conversely, each GT node is only matched to at most one computed node, directly yielding the 1-1 matching for finding FN edges.

In case the test cases do not yet cover this edge case of a 1-to-many match for CTC we should probably add it.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
When calculating edge errors, there is a 1-to-1 mapping between computed
graph nodes and GT graph nodes, see details below.

It is faster to use a dictionary for doing the node mapping with O(n
log(n)) runtime, compared to repeatedly iterating over a list with
np.where (O(n^2)).

Details: Potential 1-to-many matches, namely computed nodes that match multiple
GT nodes (called non-split), are not part of the induced graph.
Therefore, all edges incident to such nodes are also not part of the induced
graph, leaving us with the desired 1-to-1 mapping for all nodes that are
incident to existing edges in the induced graph, which we iterate over
in the loop for finding FP edges.

Conversely, each GT node is only matched to at most one computed node,
directly yielding the 1-1 matching for finding FN edges.
@bentaculum
Copy link
Contributor Author

Little example: For the dataset in #57, getting FP/FN nodes now runs in less than a second for me, compared to ~10s in f8c982c and >1h in current main.

Copy link
Collaborator

@DragaDoncila DragaDoncila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bentaculum the logic for the edge errors being 1-to-1 is sound and I agree that iterating through the list is grim. Do you know which recent changes lead to the slowdown observed in main compared to after merging #37? I'm not getting the expected speed up in this branch - edge errors are being computed for me in similar times to on main - about 400x slower than with #37.

I noticed you create the dict mapping from the original matching - is this because we can only be certain of the 1-to-1ness after computing the induced graph? I think even if we do want 1-to-many and many-to-1 support, a dictionary mapping node_key: set[node-key] would still be much faster when vast majority of nodes are 1-to-1 anyway?

Would be nice to have some benchmarking in CI - we really want to be cautious of slowdowns.

@bentaculum
Copy link
Contributor Author

@DragaDoncila Thanks for the feedback.

In a fresh env I am not able to reproduce the speedup I reported either, sorry for the noise.

I noticed that the TrackingGraph currently converts the networkx edges stored as EdgeView to a dictionary each time we call TrackingGraph.edges() (and nodes similarly). This is very slow. Is there a reason why this is necessary? Maybe @msschwartz21 has some input on this as well.

Since the last sprint the track errors are stored as edge attributes on the TrackingGraph, which is the reason for the drastic slowdown.

I simply removed these two type casts, and tests still pass. Now I run traccuracy.track_errors._ctc.get_edge_errors in 3.3s for the example in #57.

For the limit_to attributes of TrackingGraph.edges/nodes() we might want to look into https://networkx.org/documentation/stable/reference/classes/generated/networkx.classes.graphviews.subgraph_view.html.

@bentaculum
Copy link
Contributor Author

I noticed you create the dict mapping from the original matching - is this because we can only be certain of the 1-to-1ness after computing the induced graph? I think even if we do want 1-to-many and many-to-1 support, a dictionary mapping node_key: set[node-key] would still be much faster when vast majority of nodes are 1-to-1 anyway?

The original Matched object contains the 1-to-many matches. The dictionary that is currently created will store one arbitrary match for each 1-to-many matching, but since we iterate only over the edges of the induced graph this dictionary entry will never be used. So the node_key: set[node-key] is actually not necessary.

To make this logic more explicit, I now only add an entry to node_mapping_first and node_mapping_second if the node is part of the induced graph.

Copy link
Collaborator

@DragaDoncila DragaDoncila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small nits but otherwise love it - it's buttery smooth 🎉 Thanks @bentaculum ❤️

src/traccuracy/track_errors/_ctc.py Outdated Show resolved Hide resolved
else:
limited_edges = {_id: data for _id, data in edges if _id in limit_to}
limited_edges = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be self.graph.edge_subgraph(limit_to).edges and then the typing would be the same for both return statements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, put this in here 7e72be3. networkx.DiGraph.edge_subgraph and networkx.DiGraph.subgraph unfortunately don't throw an error when you ask for non-existing things. I do think that this is a reasonable check and added it. This way we also keep the API that you had before.

Copy link
Contributor Author

@bentaculum bentaculum Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about a good way to make docstrings that include classes from third-party packages, like networkx OutEdgeView here. Writing networkx.classes.reportviews.OutEdgeView in the docstring seems excessive.

Might be a good segway into typing traccuracy ;), or at least the core parts of it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah would really love to have this package better typed especially once we've sorta settled on the API

Copy link
Collaborator

@DragaDoncila DragaDoncila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it! Thanks @bentaculum !🎉

@cmalinmayor
Copy link
Collaborator

@bentaculum @DragaDoncila I do object to returning networkx specific objects in our track graph (e.g. NodeView, OutEdgeView). We want to be able to switch away from networkx and not change any of the API. Turning it into a dictionary is slow, I know, but perhaps we can do something else like return an iterator. Would that help with the speedup as well?
FYI, we are also planning to add a way to access nodes/edges by attributes efficiently, essentially by storing a dictionary from attributes to node/edge ids, which should then negate the slowdown of storing attributes like FP only on the graph, isntead of in a separate counts object as well.

@cmalinmayor
Copy link
Collaborator

cmalinmayor commented Oct 5, 2023

@bentaculum @DragaDoncila I see your point way more after some benchmarking... In general I am totally happy to change the signature of the TrackingGraph (e.g. adding functions, changing the signature of functions), as I sort of guessed what functionality we would need when I wrote the original set of functions. Looking closer at the ctc code, it seems like we just need more flexible ways to access the nodes/edges, e.g. for membership checks, just getting an iterator over ids instead of also including data, etc. I've already added a get_node_attribute and get_edge_attribute function in the access_by_attr branch. Given a look at all places where TrackingGraph.nodes() and TrackingGraph.edges() are called, I don't think we actually use the data or the limit_to aspect, just the ids, so I'm open to changing those to return iterators over the ids only. With that change and the changes in the access_by_attr branch (#63), I think we can easily achieve the same or better performance without using networkx Views.

@DragaDoncila
Copy link
Collaborator

@cmalinmayor thanks for flagging! I know initially we wanted to provide a full API via the TrackingGraph, so that the user would never have to dig into the networkx graph, but I'm not actually sure we do want to do that. For one thing, converting to and from the native networkx attributes is going to add some overhead even if it is efficient, and I'd like to avoid overhead wherever possible.

As I've been using traccuracy, I find myself only ever using the networkx API, rather than going through TrackingGraph. I think other users are also more likely to be familiar with the networkx API, and if they're not, then I would still say learning that is more useful (and easier, given the level of documentation), than learning our bespoke API. We could consider making our API exactly equivalent, but I think that's a lot off effort for not much gain.

I guess for me if I was to make any decision, it would be to further expose the underlying networkx graph, rather than add more functionality to the TrackingGraph.

@cmalinmayor
Copy link
Collaborator

Re: the above comments and in the other thread here: #63 (comment) @bentaculum @DragaDoncila

Overall, I'm happy to go forward with this PR! However, I do think there are significant advantages to putting limitations on the networkx API in traccuracy. There are some things you can do in networkx that we explicitly want to ban (e.g. changing locations, adding/removing edges) - for power developers like you and Ben who fully understand networkx and traccuracy, you can easily avoid potential pitfalls, but as the user and/or developer base expands, it might be more important to enforce constraints. Additionally, if we retain control of the API we can make assumptions or augmentations to speed up access by attribute or location (linear on size of graph in native networkx, constant with the dictionaries I implemented in the other PR). Finally, if we ever have graphs too large to process in memory, we will want to implement or take advantage of an on-disk solution. I also just personally dislike the networkx API, but I know that's not a good reason to make everyone do a ton more work :).

The specific change in this PR is not introducing that big of a dependency, so I say we go forward with it but keep these potential issues with entangling the two libraries further in mind as we continue. If they never become issues, great, full steam ahead!

@cmalinmayor cmalinmayor self-requested a review October 25, 2023 19:16
@cmalinmayor cmalinmayor merged commit ed2b7b1 into Janelia-Trackathon-2023:main Oct 30, 2023
14 checks passed
@cmalinmayor
Copy link
Collaborator

Thanks @bentaculum!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants