Add scatter plot of results to examples #8

hugo-pires · 2016-03-19T18:25:39Z

Do you suggest any easy way to plot input data and fitted clusters?

Thank you

nicodv · 2016-04-08T22:38:52Z

I generally would consider it outside the scope of this package.

hugo-pires · 2016-04-09T08:54:04Z

Well, just asking. I was thinking about a 2-D scatter plot of the data points, with different color by cluster and the centroid with different size. After some kind of dimensionality reduction, of course.

nicodv · 2016-04-11T04:19:19Z

Something along these could be added to the examples. Not a priority for me, but feel free to make a pull request.

hugo-pires · 2016-04-18T11:11:46Z

I am looking for some Seaborn examples like:
Seaborn factor plot

jd155 · 2016-07-25T16:10:48Z

Really like this k-modes implementation, intend to use it a lot. Thanks @nicodv.

I agree that plotting functionality would be instructive, particularly in diagnosing model fit and determining how many clusters and centroids to use. I note the way Huang plots the results of his simulations - see page 9 onwards here: http://grid.cs.gsu.edu/~wkim/index_files/papers/kprototype.pdf - to determine the interactions and influences of numeric and categorical data, which seems advisable given the mixed data types. Scatterplots similar to these would be very useful IMO. Presumably it would be possible to use sklearn's PCA for dimensionality reduction.

To give you a steer, this is a snippet of code I use to visualise k-means model fits using PCA. (I haven't included all the variables and the model as I'm sure you'll get the idea.)

from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.show()

nicodv · 2016-10-17T01:58:48Z

@jd155 Thanks for your comment and code sample.

First of all, note that Huang's examples are somewhat contrived: he starts with 2 numerical variables and shows the effect of a third categorical dimension by plotting in the 2 original numerical dimensions. The value of this visualization is evident, but there are many possible applications of k-modes or k-prototypes where this might not be so -- or where the visualization may be downright misleading!

Applying PCA to categorical variables is generally regarded as unwise, given their non-Gaussian nature. There are other alternatives (e.g., correspondence analysis), but figuring out the best way of plotting is a research question of itself. Given the limitations of PCA (or any of the many other dimensionality reduction techniques), I'm doubtful I want to give users the illusion that what they are plotting is a faithful 2D representation of their data and clusters. This is especially so in the case of k-modes, and cases where there are many categorical variables and not many numerical ones.

Given the above, I'd rather let the user come up with their own insights into proper visualization methods for their data.

More discussion is welcomed.

hugo-pires · 2016-10-17T10:01:17Z

Could a dendrogram be a better choice?

nicodv · 2016-10-17T18:01:27Z

@hugo-pires Since this package does not cluster hierarchically, I don't see how a dendrogram would help.

Jomonsugi · 2017-06-06T20:16:05Z

Would a silhouette plot make sense? Wouldn't we just need to produce a distance matrix to be on our way? If not, what metric should be used to evaluate the performance of the model?

nicodv · 2017-06-06T20:49:58Z

@Jomonsugi , yes, a silhouette plot would work well. Scikit-learn gives an example here, that could be adapted: http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

Since the silhouette score can be computed based on a pre-computed distance matrix, that is all that would be needed to leverage scikit-learn's existing silhouette functions in combination with kmodes.

mpikoula · 2017-06-16T15:43:10Z

I would argue it's not as easy as that to use the existing silhouette score on scikit, as the algorithm calculates the distances between the cluster centres and each point using mean rather than mode. Passing pre-computed distances is not enough.

It is however an easy fix and that's how I've been using it.

In terms of calculating a mixed (numeric and categorical) silhouette score, would one use the same gamma as in k-prototypes? I've been using the average silhouette regardless of gamma

bahung · 2017-06-29T09:54:38Z

@mpikoula Do you have to write a new function to import? Could you please share how to fix this function?

mpikoula · 2017-07-12T12:40:48Z

@bahung I've modified the silhouette_samples function by using the mode (from scipy.stats) rather than mean (there's two instances where this is needed). I pass the precomputed distances (based on the dissimilarity metric) matrix to the function.

I feel this is getting slightly off topic though!

delilio · 2018-02-05T20:13:37Z

@mpikoula Can you please post your solution?

Thanks.

royzawadzki · 2018-08-23T20:44:00Z

@mpikoula I'm trying to figure out how do obtain the dissimilarity metric so I can pass it into the modified silhouette_score function. It requires the "label values for each sample." Any pointers? Thanks.

loukach · 2018-10-30T22:09:11Z

@royzawadzki , have you found a solution? If so, any chance you share the solution?
Thank you.

mpikoula · 2018-10-31T06:42:25Z

Hello and apologies for the late response. The dissimilarity metric I have been using is either a simple dissimilarity metric (obtained using the hamming distance) or the jaccard distance. Both are available through scipy.spatial.distance

LorenzoBottaccioli · 2019-04-18T15:18:18Z

Hi @mpikoula can you please pass a code example to compute silhouette for kmodes?

avilacabs · 2019-05-15T11:20:44Z

@mpikoula so in silhouette_score function you use one of those distances (hamming or jaccard) and instead of the mean you use the mode on return, right?
Are you sure this works well for a mixed (numerical+categorical) dataset?

avilacabs · 2019-05-15T14:04:22Z

@nicodv how do I get pre-computed distance matrix from kprototypes?

nicodv · 2019-05-15T20:42:48Z

@avilacabs , it's not currently available from the trained model object, but it's probably doable to set it as a post-training attribute (similar to cost_, for example).

rosskempner · 2021-09-29T21:08:33Z

@bahung I've modified the silhouette_samples function by using the mode (from scipy.stats) rather than mean (there's two instances where this is needed). I pass the precomputed distances (based on the dissimilarity metric) matrix to the function.

I feel this is getting slightly off topic though!

Hi @mpikoula , may you help point to those two instances where that is needed?

nicodv changed the title ~~Ploting~~ Add scatter plot of results to examples Apr 11, 2016

nicodv added the enhancement label Jul 27, 2016

nicodv mentioned this issue Jun 16, 2017

Determining the optimal number of clusters #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scatter plot of results to examples #8

Add scatter plot of results to examples #8

hugo-pires commented Mar 19, 2016

nicodv commented Apr 8, 2016 •

edited

Loading

hugo-pires commented Apr 9, 2016

nicodv commented Apr 11, 2016

hugo-pires commented Apr 18, 2016

jd155 commented Jul 25, 2016 •

edited

Loading

nicodv commented Oct 17, 2016 •

edited

Loading

hugo-pires commented Oct 17, 2016

nicodv commented Oct 17, 2016

Jomonsugi commented Jun 6, 2017 •

edited

Loading

nicodv commented Jun 6, 2017 •

edited

Loading

mpikoula commented Jun 16, 2017 •

edited

Loading

bahung commented Jun 29, 2017

mpikoula commented Jul 12, 2017

delilio commented Feb 5, 2018

royzawadzki commented Aug 23, 2018 •

edited

Loading

loukach commented Oct 30, 2018

mpikoula commented Oct 31, 2018

LorenzoBottaccioli commented Apr 18, 2019 •

edited

Loading

avilacabs commented May 15, 2019

avilacabs commented May 15, 2019

nicodv commented May 15, 2019

rosskempner commented Sep 29, 2021

Add scatter plot of results to examples #8

Add scatter plot of results to examples #8

Comments

hugo-pires commented Mar 19, 2016

nicodv commented Apr 8, 2016 • edited Loading

hugo-pires commented Apr 9, 2016

nicodv commented Apr 11, 2016

hugo-pires commented Apr 18, 2016

jd155 commented Jul 25, 2016 • edited Loading

nicodv commented Oct 17, 2016 • edited Loading

hugo-pires commented Oct 17, 2016

nicodv commented Oct 17, 2016

Jomonsugi commented Jun 6, 2017 • edited Loading

nicodv commented Jun 6, 2017 • edited Loading

mpikoula commented Jun 16, 2017 • edited Loading

bahung commented Jun 29, 2017

mpikoula commented Jul 12, 2017

delilio commented Feb 5, 2018

royzawadzki commented Aug 23, 2018 • edited Loading

loukach commented Oct 30, 2018

mpikoula commented Oct 31, 2018

LorenzoBottaccioli commented Apr 18, 2019 • edited Loading

avilacabs commented May 15, 2019

avilacabs commented May 15, 2019

nicodv commented May 15, 2019

rosskempner commented Sep 29, 2021

nicodv commented Apr 8, 2016 •

edited

Loading

jd155 commented Jul 25, 2016 •

edited

Loading

nicodv commented Oct 17, 2016 •

edited

Loading

Jomonsugi commented Jun 6, 2017 •

edited

Loading

nicodv commented Jun 6, 2017 •

edited

Loading

mpikoula commented Jun 16, 2017 •

edited

Loading

royzawadzki commented Aug 23, 2018 •

edited

Loading

LorenzoBottaccioli commented Apr 18, 2019 •

edited

Loading