Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centroids output format #33

Closed
nathanlgrossman opened this issue Mar 29, 2017 · 1 comment
Closed

Centroids output format #33

nathanlgrossman opened this issue Mar 29, 2017 · 1 comment
Labels

Comments

@nathanlgrossman
Copy link

nathanlgrossman commented Mar 29, 2017

Based on the results of running kprototypes on the stocks.csv file included in the examples, I have concluded that kprototypes.cluster_centroids_ represents the centroids in the following format:
[array([cluster 0 centroid coordinates in numerical space],
[cluster 1 centroid coordinates in numerical space], ...),
array([cluster 0 centroid coordinates in categorical space],
[cluster 1 centroid coordinates in categorical space], ...)]
where the i-th cluster centroid coordinates in either numerical or categorical space is of the form
[x_i,0, x_i,1, ...]
where
x_i,0 is the coordinate for the first (i.e. left-most) column of (categorical or numerical) data
x_i,1 is the coordinate for the second (i.e. second left-most) column of (categorical or numerical) data
...
and where the j-th cluster centroid coordinate values in categorical space are elements of the set
{0, 1, 2, ...}
where
a value of 0 represents the category value whose name is first in alphabetical order
a value of 1 represents the category value whose name is second in alphabetical order
...
i.e. the numerical values represent the mode (i.e. most frequently occurring) categorical value for the cluster, and where the numerical values shown are chosen by putting the category names in alphabetical order and representing the first name by 0, the second name by 1, etc.

Can you please tell me if my conclusions are correct? If there is documentation that describes all this, I apologize for this long-winded question, and I would greatly appreciate a pointer to that documentation.

Thank you very much.

@nicodv
Copy link
Owner

nicodv commented Mar 30, 2017

No documentation yet, sorry about that. (#28)

The mapping between the original categorical values and the {0, 1, 2, ...} values you see in the cluster centroids is not based on alphabet. Instead, you can look at kprotoypes.enc_map_ how the mapping is defined.

This is how it works in the version 0.6, but in version 0.7 this has changed. Instead of presenting the categorical mapping, it will simply show the original categorical values in the cluster centroids. That way, you don't have to concern yourself with that mapping at all.

@nicodv nicodv added the question label Apr 1, 2017
@nicodv nicodv closed this as completed Sep 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants