Skip to content

Implementation of EM fitting of a mixture of gaussians on the two-dimensional data set

Notifications You must be signed in to change notification settings

matalvepu/EM-for-GMM

Repository files navigation

EM-for-GMM

Implementation of EM fitting of a mixture of gaussians on the two-dimensional data set

Author: Md Kamrul Hasan Date: 31th March, 2017

===============================================================================================

Implementation of EM fitting of a mixture of gaussians on the two-dimensional data set I had tried different numbers of mixtures, as well as tied vs. separate covariance matrices for each gaussian.

Run instruction: python gmm_em.py (make sure points.dat in the same directory)

===============================================================================================

Init EM: I have randomly chose k (num of cluster) data point to initilize k means. And also intitlize k covariance to make sure determinate is non zero value

Output:

Five files:

1.seperate_cov_training.png : log likelihood on train vs iteration for different numbers of mixtures. I have used separate covariance matrices for each gaussian.

2.seperate_cov_dev.png: log likelihood on train vs iteration for different numbers of mixtures. I have used separate covariance matrices for each gaussian.

  1. tied_cov_training.png: log likelihood on traing vs iteration for different numbers of mixtures. I have used tied covariance matrices for each gaussian.

  2. tied_cov_dev.png: log likelihood on dev vs iteration for different numbers of mixtures. I have used tied covariance matrices for each gaussian.

  3. scatter.png: scatter plot for all data

===============================================================================================

Result Analysis: From scatter plot it can be guessed that the number of cluster should vary among [4,5,6,7]. From log_likelihood graph we can determine the number of appropritae cluster cluster. From both training and dev data loglikehood graph, for which k the graph show highest log likehood with less fluctuations is good choice for number of clusters. For here , k=5,6 or 7 is almost similar. So, they are the best choice for clustering. But as it is random algorithm, so it can vary. But I think good choice will
be either 5, 6 or 7 This conclusion also make sense if we see the scatter graph.

For tied covariance, it convergences very quickly compare to the seperate covariances.

About

Implementation of EM fitting of a mixture of gaussians on the two-dimensional data set

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages