Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance random initialization for K-means part of K-prototypes #116

Open
regorsmitz opened this issue Apr 4, 2019 · 1 comment
Open

Comments

@regorsmitz
Copy link

regorsmitz commented Apr 4, 2019

Thanks @nicodv for your response to my previous question about failed KPrototype initialization, and for building this library, which I have found very helpful!

Now I see that your KMeans implementation uses points selected from normal distribution to initialize—sorry for my previous confusion. That being said, I don’t think that the current behavior is appropriate to all use cases, and for example in my case, it is important that the initialization always succeeds, because I’d ideally like to be able to use this job as part of a production pipeline. I think random initialization of K means is a standard thing, and if n_init is set high enough, it should be reasonably accurate depending on the dataset.

I would just select a random set of points from my dataset to explicitly pass to the K Means initialization, but (correct me if I’m wrong but) it seems that this approach does not allow one to take advantage of n_init > 1, which makes random initialization much more likely to be suboptimal.

Thanks for reading and sorry to be filling this repo with issues. If you want me to put in a PR for this change, I can give it a shot (adding something like init=‘all-random’ to KPrototypes only, which randomly initializes the K Means component n_init times).

@nicodv
Copy link
Owner

nicodv commented Apr 4, 2019

I've followed the papers by Huang (https://github.com/nicodv/kmodes#huang98), which do the sampling from a normal distribution..

Feel free to make a PR for this. It makes sense to open up the initialization of the k-means part of k-prototypes to enhancements. We'd have init_num and init_cat arguments to k-prototypes, I'd imagine.

In the meantime, you can do the sampling yourself and re-run k-prototypes each time with the chosen points as the initialization points. You're right, it's not supported out of the box.

@nicodv nicodv changed the title Implement random initialization for Kmeans Enhance random initialization for K-means part of K-prototypes Apr 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants