Filter out nodes in graph based on number of elements #185

holmbuar · 2020-01-06T16:38:57Z

Is your feature request related to a problem? Please describe.
I have a point cloud with approximately 2000 data points. The clustering is based on a precomputed distance matrix. The graph produced by Mapper has a lot of nodes with between 1 and 5 elements, cluttering the visualization of the graph/ complex.

Describe the solution you'd like
I would like to specify a minimum number of elements in a node of the graph for visualization.

Describe alternatives you've considered
Filter out minor nodes from the dictionary produced by mapper.map with a dict comprehension. I have looked at scikit-learn docs for some solution on filtering out clusters with a small number of elements. I have not found such parameters when using precomputed metric. Increasing distance_treshold helps up to a certain extent.

Additional context
I would like to discuss how to remove nodes from graph without affecting the topology of the dataset.

The text was updated successfully, but these errors were encountered:

holmbuar · 2020-01-06T17:03:12Z

For the record, I tried manipulating the

if precomputed:
            min_cluster_samples = 2

condition in kmapper.py. To my understanding this defines the minimal number of samples in each cube of the covering. This certainly works as a reduction method but it is difficult to know a priori what the number of samples should be. Pushing the minimum samples results in different topologies.

I guess it is safer to manipulate the graph after both clustering and mapper?

deargle · 2020-01-06T17:30:36Z

It looks like you were the one who authored the commit that added the fixed min_cluster_samples = 2 for precomputed? Why was fixing that necessary? I haven't used kepler-mapper recently, but I was using a precomputed matrix before without a fixed number. I wrote the initial precomputed logic flag, which pulls (used to pull before your commit) the min_computed_samples from the clusterer, as documented here:

kepler-mapper/kmapper/kmapper.py

Lines 396 to 401 in 2ccce2d

    
                   precomputed : Boolean 
        
                       Tell Mapper whether the data that you are clustering on is a precomputed distance matrix. If set to 
        
                       `True`, the assumption is that you are also telling your `clusterer` that `metric='precomputed'` (which 
        
                       is an argument for DBSCAN among others), which  
        
                       will then cause the clusterer to expect a square distance matrix for each hypercube. `precomputed=True` will give a square matrix 
        
                       to the clusterer to fit on for each hypercube.

deargle · 2020-01-06T17:34:28Z

If I loosely understand you commit message 2ccce2d , then it's possible that the clusterer not have n_clusters set. The logic here:

min_cluster_samples = cluster_params.get(
                "n_clusters",
                cluster_params.get(
                    "min_cluster_size", cluster_params.get("min_samples", 1)
                ),
            )

Is intended to first try n_clusters -- failing that, try min_cluster_size, -- failing that, try min_samples. What were you running where the clusterer had none of those three set which was leading to a crash?

holmbuar · 2020-01-06T17:56:41Z

Yes I issued the pull request you are mentioning. I seem to recall that I got a scikit-learn chrash if affinity='precomputed' and n_clusters was not None. Does that make sense?

I can not justify the fixed number 2, probably from the standard n_clusters in Scikit docs.

I must admit I did not notice the logic you are referring to, I was naively focused on fixing that specific error mentioned in the pull request. Sorry!

deargle · 2020-01-06T18:04:43Z

Want to issue a new PR that still works for your usecase which was breaking, and which re-adopts that more flexible logic?

holmbuar · 2020-01-06T18:17:29Z

I certainly can if I am able to get the logic :)

By using affinity=precomputed, doesn't Scikit require distance_treshold to be set?
How "safe" is it to manipulate min_cluster_samples a priori, compared to adjusting after Mapper has returned the complex?

deargle · 2020-01-06T18:25:28Z

By using affinity=precomputed, doesn't Scikit require distance_treshold to be set?

Which scikit modeler are you referring to? DBSCAN doesn't -- I have used DBSCAN thusly: clusterer=km.cluster.DBSCAN(eps=.5, min_samples=4, metric='precomputed')

How "safe" is it to manipulate min_cluster_samples a priori, compared to adjusting after Mapper has returned the complex?

That would be a tad hairy to manipulate afterwards because the complex could include links to clusters which might be dropped during adjustment, but in my naive opinion it's fine to do either before or after.

holmbuar · 2020-01-06T18:31:30Z

ah sorry for not specifying, I have used sklearn.cluster.AgglomerativeClustering 1 with single linkage. I am trying to reproduce the work of Carlsson and Gabrielsson 2.

holmbuar · 2020-01-06T18:34:25Z

Here is my full clustering section:

from sklearn.cluster import AgglomerativeClustering as cl
point_cloud_column_variance = np.var(point_cloud, axis=0)
point_cloud_normalized = point_cloud / point_cloud_column_variance
condensed_distance_matrix = spatial.distance.pdist(point_cloud_normalized)
redundant_distance_matrix = spatial.distance.squareform(condensed_distance_matrix)
model_vne = cl(n_clusters=None,
               affinity='precomputed',
               compute_full_tree=True,
               linkage='single',
               distance_threshold=15.0)
model_vne.fit(redundant_distance_matrix)

EDIT: added clustering import statement

deargle · 2020-01-06T19:19:23Z

I re-read your earlier comment:

To my understanding this defines the minimal number of samples in each cube of the covering. This certainly works as a reduction method but it is difficult to know a priori what the number of samples should be. Pushing the minimum samples results in different topologies.

It is important to specify here, in case you think otherwise -- hypercubes are always of a fixed interval for a given mapping run. If a given hypercube doesn't have min_samples, it is skipped -- it is not extended until it has min_samples within it. The logic is that if the hypercube doesn't have min_samples, then the clusterer run on that hypercube would definitely not find any cluster of points within that hypercube with at least min_samples.

Because it skips, rather than expands the window, there shouldn't be any difference in mapper output whether clusters are filtered out before or after the graph is created.

... except, hmm, in your case, where min_samples isn't used, I see that agglomerativeclusterer breaks when not fed at least two samples. https://github.com/scikit-learn/scikit-learn/blob/e5698bde9a8b719514bf39e6e5d58f90cfe5bc01/sklearn/cluster/_agglomerative.py#L796 I forgot to say an "otherwise" condition for this logic:

min_cluster_samples = cluster_params.get(
                "n_clusters",
                cluster_params.get(
                    "min_cluster_size", cluster_params.get("min_samples", 1)
                ),
            )

... failing the first three attempts, a min_cluster_sample size of 1 is used.

I wonder (1) whether some persons would be interested in retaining one-element clusters (I bet the answer is "yes"), and if so, (2) how best to tell if a clusterer that requires at least two samples is being used.

deargle · 2020-01-06T19:51:13Z

Perhaps the default value of n_clusters for the constructor could be used as a fallback. For agglomerativeclustering, that's 2. This method can be used to get a function's default argument values.

deargle · 2020-01-06T19:53:24Z

Or maybe if cluster_params distance_threshold is set, then default to min 2. I don't know how well that would generalize across scikit clusterers though.

deargle · 2020-01-06T20:17:07Z

@sauln what do you think about the following: allow min_cluster_samples to be passed to .map(), which would

be preferred over any cluster_params in determining whether or not to skip a hypercube

be manually checked against node size in before line 554

kepler-mapper/kmapper/kmapper.py

Lines 549 to 554 in 4c52847

    
           for pred in np.unique(cluster_predictions): 
        
               # if not predicted as noise                        
        
               if pred != -1 and not np.isnan(pred):   
        
                   cluster_id = "cube{}_cluster{}".format(i, int(pred)) 
        
                   nodes[cluster_id] = hypercube[:, 0][cluster_predictions == pred].astype(int).tolist()

to see if minimum cluster samples are present before adding a cluster as a node.

I don't see an easier way to handle clusterers such as agglomerativeclusterer which, in @torlarse 's case, have to have at least two samples per clustering attempt, but don't necessarily have any of n_clusters, min_cluster_size, or min_samples set. My above-idea of checking default values for n_samples doesn't generalize well to clusterers such as Birch, and also, otherwise, there is no reliable kmapper-way to specify a minimum number of samples per cluster.

holmbuar · 2020-01-07T12:47:53Z

It is important to specify here, in case you think otherwise -- hypercubes are always of a fixed interval for a given mapping run. If a given hypercube doesn't have min_samples, it is skipped -- it is not extended until it has min_samples within it.

Yes thanks, I hope I have this understanding of hypercubes, pardon my imprecise english. By setting min_cluster_samples=20 on my data caused Mapper to return a "broken" non-circular topology. It makes sense since by inspection I can see the nodes around the holes have between 10 and 15 elements.

Anyways, I made things work by a combination of setting min_cluster_samples=10 in kmapper.py and distance_treshold=15 in the clustering function call. I don't know if this will be a problem experienced by others.

deargle · 2020-01-07T15:04:19Z

I think agglomerative clustering is common enough that ideally kmapper should accommodate it without users having to modify kmapper.py -- glad you found a workaround though.

…

On Tue, Jan 7, 2020, 5:47 AM torlarse ***@***.***> wrote: It is important to specify here, in case you think otherwise -- hypercubes are always of a fixed interval for a given mapping run. If a given hypercube doesn't have min_samples, it is skipped -- it is not extended until it has min_samples within it. Yes thanks, I hope I have this understanding of hypercubes, pardon my imprecise english. By setting min_cluster_samples=20 on my data caused Mapper to return a "broken" non-circular topology. It makes sense since by inspection I can see the nodes around the holes have between 10 and 15 elements. Anyways, I made things work by a combination of setting min_cluster_samples=10 in kmapper.py and distance_treshold=15 in the clustering function call. I don't know if this will be a problem experienced by others. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#185?email_source=notifications&email_token=AAI6Y7JUDUUQODOTUDMCOUTQ4R2XTA5CNFSM4KDGZDBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIIYA6I#issuecomment-571572345>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAI6Y7IHBGICLO4K4K2DQGLQ4R2XTANCNFSM4KDGZDBA> .

deargle · 2020-08-07T20:29:45Z

@sauln I'm going to make the final fallback be to min of 2, not min of 1, and take out the hard fix of 2 for precomputed matrices

sauln · 2020-08-07T21:22:01Z

Seems reasonable to me 👍

deargle · 2021-03-08T22:41:14Z

I just found this again because I have @torlarse 's original problem -- I want to filter out nodes that are too small for my liking, and min_samples doesn't do that.

@sauln I'm going to add an argument to .map that does what was originally asked -- specify a min number of points that a cluster must have in order to be retained as a node. "min_node_samples"?

I'll also add a filter to the vis, because hey why not

deargle added a commit to deargle/kepler-mapper that referenced this issue Aug 26, 2020

final fallback of 2, not 1. see scikit-tda#185

69b2858

deargle mentioned this issue Aug 26, 2020

Fix fixed min samples #194

Merged

deargle added a commit to deargle/kepler-mapper that referenced this issue Aug 29, 2020

final fallback of 2, not 1. see scikit-tda#185

7f4554e

deargle added a commit that referenced this issue Mar 3, 2021

final fallback of 2, not 1. see #185

37298d8

deargle mentioned this issue Apr 9, 2021

Bug: min_cluster_samples should not be set to a non-integer #223

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter out nodes in graph based on number of elements #185

Filter out nodes in graph based on number of elements #185

holmbuar commented Jan 6, 2020 •

edited

Loading

holmbuar commented Jan 6, 2020 •

edited

Loading

deargle commented Jan 6, 2020

deargle commented Jan 6, 2020

holmbuar commented Jan 6, 2020 •

edited

Loading

deargle commented Jan 6, 2020

holmbuar commented Jan 6, 2020 •

edited

Loading

deargle commented Jan 6, 2020

holmbuar commented Jan 6, 2020

holmbuar commented Jan 6, 2020 •

edited

Loading

deargle commented Jan 6, 2020

deargle commented Jan 6, 2020 •

edited

Loading

deargle commented Jan 6, 2020

deargle commented Jan 6, 2020 •

edited

Loading

holmbuar commented Jan 7, 2020

deargle commented Jan 7, 2020 via email

deargle commented Aug 7, 2020

sauln commented Aug 7, 2020

deargle commented Mar 8, 2021

Filter out nodes in graph based on number of elements #185

Filter out nodes in graph based on number of elements #185

Comments

holmbuar commented Jan 6, 2020 • edited Loading

holmbuar commented Jan 6, 2020 • edited Loading

deargle commented Jan 6, 2020

deargle commented Jan 6, 2020

holmbuar commented Jan 6, 2020 • edited Loading

deargle commented Jan 6, 2020

holmbuar commented Jan 6, 2020 • edited Loading

deargle commented Jan 6, 2020

holmbuar commented Jan 6, 2020

holmbuar commented Jan 6, 2020 • edited Loading

deargle commented Jan 6, 2020

deargle commented Jan 6, 2020 • edited Loading

deargle commented Jan 6, 2020

deargle commented Jan 6, 2020 • edited Loading

holmbuar commented Jan 7, 2020

deargle commented Jan 7, 2020 via email

deargle commented Aug 7, 2020

sauln commented Aug 7, 2020

deargle commented Mar 8, 2021

holmbuar commented Jan 6, 2020 •

edited

Loading

holmbuar commented Jan 6, 2020 •

edited

Loading

holmbuar commented Jan 6, 2020 •

edited

Loading

holmbuar commented Jan 6, 2020 •

edited

Loading

holmbuar commented Jan 6, 2020 •

edited

Loading

deargle commented Jan 6, 2020 •

edited

Loading

deargle commented Jan 6, 2020 •

edited

Loading