-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter out nodes in graph based on number of elements #185
Comments
For the record, I tried manipulating the
condition in I guess it is safer to manipulate the graph after both clustering and mapper? |
It looks like you were the one who authored the commit that added the fixed kepler-mapper/kmapper/kmapper.py Lines 396 to 401 in 2ccce2d
|
If I loosely understand you commit message 2ccce2d , then it's possible that the clusterer not have
Is intended to first try |
Yes I issued the pull request you are mentioning. I seem to recall that I got a I can not justify the fixed number 2, probably from the standard I must admit I did not notice the logic you are referring to, I was naively focused on fixing that specific error mentioned in the pull request. Sorry! |
Want to issue a new PR that still works for your usecase which was breaking, and which re-adopts that more flexible logic? |
I certainly can if I am able to get the logic :)
|
Which scikit modeler are you referring to? DBSCAN doesn't -- I have used DBSCAN thusly:
That would be a tad hairy to manipulate afterwards because the complex could include links to clusters which might be dropped during adjustment, but in my naive opinion it's fine to do either before or after. |
Here is my full clustering section:
EDIT: added clustering import statement |
I re-read your earlier comment:
It is important to specify here, in case you think otherwise -- hypercubes are always of a fixed interval for a given mapping run. If a given hypercube doesn't have min_samples, it is skipped -- it is not extended until it has min_samples within it. The logic is that if the hypercube doesn't have min_samples, then the clusterer run on that hypercube would definitely not find any cluster of points within that hypercube with at least min_samples. Because it skips, rather than expands the window, there shouldn't be any difference in mapper output whether clusters are filtered out before or after the graph is created. ... except, hmm, in your case, where min_samples isn't used, I see that agglomerativeclusterer breaks when not fed at least two samples. https://github.com/scikit-learn/scikit-learn/blob/e5698bde9a8b719514bf39e6e5d58f90cfe5bc01/sklearn/cluster/_agglomerative.py#L796 I forgot to say an "otherwise" condition for this logic:
... failing the first three attempts, a min_cluster_sample size of 1 is used. I wonder (1) whether some persons would be interested in retaining one-element clusters (I bet the answer is "yes"), and if so, (2) how best to tell if a clusterer that requires at least two samples is being used. |
Perhaps the default value of |
Or maybe if cluster_params |
@sauln what do you think about the following: allow
I don't see an easier way to handle clusterers such as agglomerativeclusterer which, in @torlarse 's case, have to have at least two samples per clustering attempt, but don't necessarily have any of |
Yes thanks, I hope I have this understanding of hypercubes, pardon my imprecise english. By setting Anyways, I made things work by a combination of setting |
I think agglomerative clustering is common enough that ideally kmapper
should accommodate it without users having to modify kmapper.py -- glad you
found a workaround though.
…On Tue, Jan 7, 2020, 5:47 AM torlarse ***@***.***> wrote:
It is important to specify here, in case you think otherwise -- hypercubes
are always of a fixed interval for a given mapping run. If a given
hypercube doesn't have min_samples, it is skipped -- it is not extended
until it has min_samples within it.
Yes thanks, I hope I have this understanding of hypercubes, pardon my
imprecise english. By setting min_cluster_samples=20 on my data caused
Mapper to return a "broken" non-circular topology. It makes sense since by
inspection I can see the nodes around the holes have between 10 and 15
elements.
Anyways, I made things work by a combination of setting
min_cluster_samples=10 in kmapper.py and distance_treshold=15 in the
clustering function call. I don't know if this will be a problem
experienced by others.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#185?email_source=notifications&email_token=AAI6Y7JUDUUQODOTUDMCOUTQ4R2XTA5CNFSM4KDGZDBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIIYA6I#issuecomment-571572345>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAI6Y7IHBGICLO4K4K2DQGLQ4R2XTANCNFSM4KDGZDBA>
.
|
@sauln I'm going to make the final fallback be to min of 2, not min of 1, and take out the hard fix of 2 for precomputed matrices |
Seems reasonable to me 👍 |
I just found this again because I have @torlarse 's original problem -- I want to filter out nodes that are too small for my liking, and @sauln I'm going to add an argument to .map that does what was originally asked -- specify a min number of points that a cluster must have in order to be retained as a node. "min_node_samples"? I'll also add a filter to the vis, because hey why not |
Is your feature request related to a problem? Please describe.
I have a point cloud with approximately 2000 data points. The clustering is based on a precomputed distance matrix. The graph produced by Mapper has a lot of nodes with between 1 and 5 elements, cluttering the visualization of the graph/ complex.
Describe the solution you'd like
I would like to specify a minimum number of elements in a node of the graph for visualization.
Describe alternatives you've considered
Filter out minor nodes from the dictionary produced by
mapper.map
with a dict comprehension. I have looked atscikit-learn
docs for some solution on filtering out clusters with a small number of elements. I have not found such parameters when using precomputed metric. Increasingdistance_treshold
helps up to a certain extent.Additional context
I would like to discuss how to remove nodes from graph without affecting the topology of the dataset.
The text was updated successfully, but these errors were encountered: