-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Locate nearest clusters for given data #214
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #214 +/- ##
==========================================
+ Coverage 79.62% 79.82% +0.19%
==========================================
Files 11 11
Lines 854 892 +38
Branches 186 199 +13
==========================================
+ Hits 680 712 +32
- Misses 141 144 +3
- Partials 33 36 +3
Continue to review full report at Codecov.
|
This subsumes both, so you could just review this one. Or if you want to do it in parts, start with the other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a mini example to /examples
like a mini vignette?
Edit: example here: https://github.com/sphinx-gallery/sphinx-gallery/blob/master/examples/no_output/just_code.py
kmapper/kmapper.py
Outdated
---------- | ||
newdata : Numpy array | ||
New dataset. Accepts both 1-D and 2-D array. | ||
nodes : dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, this could be all of the nodes from a graph, right? It would be horribly inefficient because there's no way that a point would be close to a cluster that wasn't even in its open set, but still.
If I'm thinking correctly, if it's just for efficiency, then nearest_clusters
could itself receive the full graph, and itself call clusters_from_cover
before looping over the nodes to get cluster_members.
Am I thinking correctly?
Also, I lean towards standardizing on replacing clusters
with nodes
throughout.
And eventually, but not now, replacing cube
with openset
throughout;
if so, that would look something like:
def nearest_nodes(self, newdata, graph, cover, data, nn):
cube_ids = cover.find(newdata)
nodes = self.find_nodes(graph, cube_ids)
# then the rest unchanged...
for cluster_id, cluster_members in nodes.items():
cluster_data = data[cluster_members]
nn_data.append(cluster_data)
nn_cluster_ids.append([cluster_id]*len(cluster_data))
nn_data = np.vstack(nn_data)
nn_cluster_ids = np.concatenate(nn_cluster_ids)
nn.fit(nn_data)
nn_ids = nn.kneighbors(newdata, return_distance=False)
return np.unique(nn_cluster_ids[nn_ids])
@@ -827,6 +827,63 @@ def data_from_cluster_id(self, cluster_id, graph, data): | |||
else: | |||
return np.array([]) | |||
|
|||
def clusters_from_cover(self, cube_ids, graph): | |||
"""Returns the clusters and their members from the subset of the cover spanned by the given cube_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud. I'm trying to think of another name. Kmapper has a separate Cover
class, so calling this clusters_from_cover
suggests to me that a cover should be passed, but it isn't.
But a Cover doesn't have clusters, so I don't think this should go in the Cover
class.
If graph
were a class, this would go in there as graph.find_clusters_by_cube_ids(cube_ids)
or something.
Sort-of following the pattern from the last PR, maybe we rename this to find_clusters
find_nodes
In general in the above, I argue for converting |
458e094
to
4ae98ea
Compare
4ae98ea
to
3bd6116
Compare
3bd6116
to
0bd7b00
Compare
Hit a bit of a snag, but we should be good now. |
Fixes #208.