Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data
mixMPLNFA
is an R package for performing clustering using parsimonious
mixtures of multivariate Poisson-log normal factor analyzers family
(MPLNFA) via variational Gaussian approximations. It was developed for
count data, with clustering of RNA sequencing data as a motivation.
However, the clustering method may be applied to other types of count
data. This model considers a factor analyzer structure and this reduces
the number of free covariance structure parameters to be calculated.
With the introduction of the factor analysis structure, the number of
covariance parameters to be calculated is linear in data dimensionality,
thus making this family well suited for analysis of high-dimensional
discrete data. This package provides functions for data simulation and
clustering with parameter estimation via a variational Gaussian
approximation with Expectation-Maximization (EM) algorithm. Information
criteria (AIC, BIC, AIC3 and ICL) are offered for model selection.
To install the latest version of the package:
require("devtools")
devtools::install_github("anjalisilva/mixMPLNFA", build_vignettes = TRUE)
library("mixMPLNFA")
To run the Shiny app (under construction):
mixMPLNFA::runMixMPLNFA()
To list all functions available in the package:
ls("package:mixMPLNFA")
MPLNClust
contains 4 functions.
- mplnFADataGenerator for generating simulated data with known number of latent factors, a known covariance structure model and a known number of clusters/components via mixtures of multivariate Poisson-log normal factor analyzers
- MPLNFAClust for carrying out clustering of count data using parsimonious mixtures of multivariate Poisson-log normal factor analyzers. Can input user provided count dataset or a dataset generated via the mplnFADataGenerator() function
- mplnFAVisLine for visualizing clustering results as line plots
- runMixMPLNFA is the shiny implementation of MPLNFAClust (under construction)
For more information, see details section below. An overview of the package is illustrated below:
Mixture model-based clustering methods can be over-parameterized in high-dimensional spaces, especially as the number of clusters increases. Subspace clustering allows to cluster data in low-dimensional subspaces, while keeping all the dimensions and by introducing restrictions to mixture parameters (Bouveyron and Brunet, 2014). Restrictions are introduced to the model parameters with the aim of obtaining parsimonious models, which are sufficiently flexible for clustering purposes. Since the largest contribution of free parameters is through the covariance matrices, it is a natural focus for the introduction of parsimony.
The factor analysis model was introduced by Spearman (1904) and is
useful in modeling the covariance structure of high-dimensional data
using a small number of latent variables. The mixture of factor
analyzers model was later introduced by Ghahramani et al., 1996, and
this model is able to concurrently perform clustering and, within each
cluster, local dimensionality reduction. In 2008, a family of eight
parsimonious Gaussian mixture models (PGMMs; McNicholas and Murphy,
2008) were
introduced with parsimonious covariance structures. In 2019, a
model-based clustering methodology using mixtures of multivariate
Poisson-log normal distribution (MPLN; Aitchison and Ho,
1989) was developed to analyze multivariate count
measurements by Silva et al.,
2019. In current work, a
family of mixtures of MPLN factor analyzers that is analogous to the
PGMM family is developed, by considering the general mixture of factor
analyzers model (
Subedi and Browne (2020) had proposed a framework for parameter estimation utilizing variational Gaussian approximation (VGA) for mixtures of multivariate Poisson-log normal distribution-based mixture models. Markov chain Monte Carlo expectation-maximization (MCMC-EM) has also been used for parameter estimation of MPLN-based mixture models, but VGA was shown to be computationally efficient (Silva et al., 2023). VGA alleviates challenges of MCMC-EM algorithm. Here the posterior distribution is approximated by minimizing the Kullback-Leibler (KL) divergence between the true and the approximating densities. A variational-EM based framework is used for parameter estimation.
Four model selection criteria are offered, which include the Akaike information criterion (AIC; Akaike, 1973), the Bayesian information criterion (BIC; Schwarz, 1978), a variation of the AIC used by Bozdogan (1994) called AIC3, and the integrated completed likelihood (ICL; Biernacki et al., 2000).
Starting values play an important role to the successful operation of this algorithm. There maybe issues with singularity, in which case altering initialization method or initialization values by setting a different seed may help. See function examples or vignette for details.
The Shiny app employing MPLNFAClust could be run and results could be visualized:
mixMPLNFA::runMixMPLNFA()
For tutorials and plot interpretation, refer to the vignette (under construction):
browseVignettes("mixMPLNFA")
citation("mixMPLNFA")
Payne, A., A. Silva, S. J. Rothstein, P. D. McNicholas, and S. Subedi (2023) Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data. Unpublished.
A BibTeX entry for LaTeX users is
@unpublished{,
title = "Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for Clustering Count Data",
author = "A. Payne and A. Silva and S. J. Rothstein and P. D. McNicholas and S. Subedi",
note = "Unpublished",
year = "2023",
}
-
Aitchison, J. and C. H. Ho (1989). The multivariate Poisson-log normal distribution. Biometrika.
-
Ghahramani, Z., G. E. Hinton, et al. (1996). The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto.
-
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6.
-
Spearman, C. (1904). The proof and measurement of association between two things. The American Journal of Psychology, 15(1).
- Anjali Silva ([email protected]).
- Andrea Payne ([email protected]).
- Sanjeena Dang ([email protected]).
- Anjali Silva ([email protected]).
mixMPLNFA
repository welcomes issues, enhancement requests, and other
contributions. To submit an issue, use the GitHub
issues.
- Dr. Marcelo Ponce, SciNet HPC Consortium, University of Toronto, ON, Canada for all the computational support.
- Early work was funded by Natural Sciences and Engineering Research Council of Canada (Subedi) and Queen Elizabeth II Graduate Scholarship (Silva).
- Later work was supported by the Postdoctoral Fellowship award from the Canadian Institutes of Health Research (Silva) and the Canada Natural Sciences and Engineering Research Council grant 400920-2013 (Subedi).