Merge branch 'master' into package for v1.1.1 release

QMUL · Feb 9, 2024 · f53daff · f53daff
2 parents 80bc8a2 + 54e5e29
commit f53daff
Show file tree

Hide file tree

Showing 11 changed files with 662 additions and 109 deletions.
diff --git a/.gitignore b/.gitignore
@@ -7,3 +7,6 @@ src/*.dll
 R/RcppExports.R
 src/RcppExports.cpp
 man/*
+html/
+latex/
+*.pdf
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: poLCAParallel
 Type: Package
 Title: Polytomous variable Latent Class Analysis Parallel
-Version: 1.1.0
+Version: 1.1.1
 Author: Sherman Lo <[email protected]>,
  Drew Linzer <[email protected]>,
  Jeffrey Lewis <[email protected]>.

diff --git a/Doxyfile b/Doxyfile
@@ -0,0 +1,7 @@
+PROJECT_NAME = "poLCAParallel"
+PROJECT_NUMBER = "v1.1.1"
+PROJECT_BRIEF = "C++ Implementation of poLCA (R package)"
+INPUT = src/
+INPUT += README.md
+USE_MDFILE_AS_MAINPAGE = README.md
+EXTRACT_ALL = YES
diff --git a/README.md b/README.md
@@ -2,6 +2,8 @@
 
 ## Polytomous Variable Latent Class Analysis
 
+### With Bootstrap Likelihood Ratio Test
+
 Sherman E. Lo, Queen Mary, University of London
 
 A reimplementation of poLCA
@@ -12,7 +14,7 @@ especially with multiple repetitions by using multiple threads.
 
 ## About poLCAParallel
 
-The library poLCAParallel reimplements poLCA and its bootstrap log-likelihood
+The library poLCAParallel reimplements poLCA and the bootstrap log-likelihood
 ratio test in C++. This was done using
 [Rcpp](https://cran.r-project.org/web/packages/Rcpp) and
 [RcppArmadillo](https://cran.r-project.org/web/packages/RcppArmadillo) which
@@ -27,7 +29,7 @@ allows C++ code to interact with R. Additional notes include:
 * Use of [`std::map`](https://en.cppreference.com/w/cpp/container/map) for the
  chi-squared calculations
 
-Further reading available on a
+Further reading is available on a
 [QMUL ITS Research Blog](https://blog.hpc.qmul.ac.uk/speeding_up_r_packages.html).
 
 ## About the Original Code
@@ -87,49 +89,62 @@ Run the following in R
 devtools::install_github("QMUL/poLCAParallel@package")
 ```
 
+## Installation from Releases
+
+Requires the R packages [Rcpp](https://cran.r-project.org/web/packages/Rcpp),
+[RcppArmadillo](https://cran.r-project.org/web/packages/RcppArmadillo) and
+[scatterplot3d](https://cran.r-project.org/web/packages/scatterplot3d/index.html).
+
+Download the `.tar.gz` file from the releases. Install it using using
+`R CMD INSTALL`, for example
+
+```bash
+R CMD INSTALL --preclean --no-multiarch poLCAParallel-*.*.*.tar.gz
+```
+
 ## Installation using a git clone
 
-Requires the R package
-[devtools](https://www.r-project.org/nosvn/pandoc/devtools.html),
-[Rcpp](https://cran.r-project.org/web/packages/Rcpp),
-[RcppArmadillo](https://cran.r-project.org/web/packages/RcppArmadillo).
-Optionally, creating the documentation requires the R package
+Requires the R packages [Rcpp](https://cran.r-project.org/web/packages/Rcpp) and
+[RcppArmadillo](https://cran.r-project.org/web/packages/RcppArmadillo). The R
+package [devtools](https://www.r-project.org/nosvn/pandoc/devtools.html) is
+recommended. Optionally, creating the documentation requires the R package
 [roxygen2](https://cran.r-project.org/web/packages/roxygen2/index.html).
 
 Git clone this repository.
 
-```console
+```bash
 git clone https://github.com/QMUL/poLCAParallel.git
 ```
 
-In R, run the following so that the package can be compiled correctly
+Run the following so that the package can be compiled correctly
 
-```r
-Rcpp::compileAttributes("poLCAParallel")
+```bash
+R -e "Rcpp::compileAttributes('poLCAParallel')"
 ```
 
 and optionally for the documentation
 
-```r
-devtools::document("poLCAParallel")
+```bash
+R -e "devtools::document('poLCAParallel')"
 ```
 
 Finally
 
-```r
-devtools::install("poLCAParallel")
+```bash
+R -e "devtools::install('poLCAParallel')"
 ```
 
 to install the package. Alternatively, `R CMD INSTALL` can be used as shown
 below in a terminal
 
-```console
+```bash
 R CMD INSTALL --preclean --no-multiarch poLCAParallel
 ```
 
 ## Changes from the Orginal Code
 
 R scripts which compare poLCAParallel with poLCA are provided in `exec/`.
+Example use of a bootstrap likelihood ratio test is shown in `exec/3_blrt.R`.
 
 * The stopping condition of the EM algorithm, if the log-likelihood change after
  an iteration of EM is too small, is evaluated after the E step rather than the
@@ -140,7 +155,7 @@ R scripts which compare poLCAParallel with poLCA are provided in `exec/`.
 * The output `probs.start` are the initial probabilities used to achieve the
  maximum log-likelihood from *any* repetition rather than from the first
  repetition.
-* The output `eflag` is set to `TRUE` if *any* repetition had to be restarted,
+* The output `eflag` is set to `TRUE` if *any* repetition has to be restarted,
  rather than the repetition which achieves maximum log-likelihood.
 * An additional argument `n.thread` is provided to specify the number of threads
  to use.
@@ -152,61 +167,76 @@ R scripts which compare poLCAParallel with poLCA are provided in `exec/`.
 
 ## Further Development Notes
 
-* There are possible problems with implementing the calculations of the standard
- error. They are discussed [here](inst/note_standard_error.tex).
-* There is a possible underflow error if the number of categories is too large,
- more than ~300. This is because in the calculation of the log-likelihood, the
- probabilities from each category are multiplied by each other. If there are
- $J$ categories, then there are $J$ probabilities to multiply together. This
- is addressed in commit 85ee419 but reverted. Consider summing over log space
- instead.
+* The standard errors have not been implemented in C++. This is because there
+ are possible problems with implementing the calculations of the standard
+ error. They are discussed here [[.tex]](inst/note_standard_error.tex)
+ [[.md]](inst/note_standard_error.md). In-built GitHub tools may not render
+ equations correctly.
+* When calculating the likelihood, probabilities are iteratively multiplied,
+ this is much faster than taking the sum of log probabilities. However to avoid
+ underflow errors, the calculation of the likelihood, instead, uses the sum of
+ log probabilities when an underflow is detected. See `PosteriorUnnormalize()`
+ in `src/em_algorithm.*` for the implementation.
 * Multiple Newton steps can be taken instead of a single one.
 
 ## Code Style
 
-There was an attempt to use the
-[Google C++ style guide](https://google.github.io/styleguide/cppguide.html).
+All generated documents and codes, eg from
 
-Armadillo objects are used sparingly, preferring the use of `double*` when
-handling vectors and matrices.
+```bash
+R -e "Rcpp::compileAttributes('poLCAParallel')"
+```
+
+and
+
+```bash
+R -e "devtools::document('poLCAParallel')"
+```
 
-All generated documents and codes shall not be included in the `master` branch.
-Instead, they shall be in the `package` branch so that this package can be
-installed using `devtools::install_github("QMUL/poLCAParallel@package")`, yet,
-avoiding duplicated code on the `master` branch.
+shall not be included in the `master` branch. Instead, they shall be in the
+`package` branch so that this package can be installed using
+`devtools::install_github("QMUL/poLCAParallel@package")`. This is to avoid
+having duplicate documentation and generated code on the `master` branch.
 
 Semantic versioning is used and tagged. Tags on the `master` branch shall have
 `v` prepended and `-master` appended, eg. `v1.1.0-master`. The corresponding
 tag on the `package` branch shall only have `v` prepended, eg. `v1.1.0`.
 
-## C++ Source Code Documentation
+There was an attempt to use the
+[Google C++ style guide](https://google.github.io/styleguide/cppguide.html).
 
-The C++ code is compatible with [Doxygen](https://doxygen.nl/) by running
+Armadillo objects are used sparingly, preferring the use of `double*` when
+handling vectors and matrices.
 
-```console
-doxygen -g
-```
+## C++ Source Code Documentation
 
-to create a config file `Doxyfile`. Doxygen needs to know the source code is
-located in `src/` which can be done by modifying a line in the config file
-`Doxyfile`
+The C++ code documentation can be created with [Doxygen](https://doxygen.nl/)
+by running
 
 ```console
-INPUT = src/
+doxygen
 ```
 
-Afterwards, running `doxygen` will create documentation for the C++ code. The
-configure file can be further configured.
+and viewed at `html/index.html`.
 
 ## Citation
 
 Please consider citing the corresponding
-[QMUL ITS Research Blog](https://blog.hpc.qmul.ac.uk/speeding_up_r_packages.html):
+[QMUL ITS Research Blog](https://blog.hpc.qmul.ac.uk/speeding_up_r_packages.html)
 
-* Lo, S.E. (2022). Speeding up and Parallelising R packages (using Rcpp and C++
+* Lo, S.E. (2022). Speeding up and Parallelising R packages (using Rcpp and C++)
  | QMUL ITS Research Blog.
  [[link]](https://blog.hpc.qmul.ac.uk/speeding_up_r_packages.html)
 
+and the publication below which this software was originally created for
+
+* Eto F, Samuel M, Henkin R, Mahesh M, Ahmad T, et al. (2023). Ethnic
+ differences in early onset multimorbidity and associations with health service
+ use, long-term prescribing, years of life lost, and mortality: A
+ cross-sectional study using clustering in the UK Clinical Practice Research
+ Datalink. *PLOS Medicine,* 20(10): e1004300.
+ <https://doi.org/10.1371/journal.pmed.1004300>
+
 ## References
 
 * Dziak, J. J., Lanza, S. T., & Tan, X. (2014). Effect size, statistical power,

diff --git a/exec/3_blrt.R b/exec/3_blrt.R
@@ -1,8 +1,20 @@
-# example script on bootstrap likelihood ratio test using poLCA and their sample
-# data
+# Example script on doing bootstrap likelihood ratio test using poLCAParallel
+#
+# For $R$ classes, bootstrap likelihood ratio test (BLRT) fits the null model
+# using $R-1$ classes and an alt model using $R$ classes, producing a likelihood
+# ratio. Using the fitted null and alt models, a parametric bootstrap can be
+# done to get an empirical distribution of the likelihood ratio.
+#
+# To do model selection, select the highest $R$ where the fitted log likelihood
+# ratio is on the 95% percentile or higher of the empirical (bootstrap)
+# distribution of log likelihood ratios, eg p-value less than 5%. Thus this
+# script should be run for a different number of classes, eg using an array job.
+#
+# The code uses poLCAParallel::blrt() to do BLRT. It records all bootstrap
+# samples log likelihood ratios, which are plotted. Figures are saved in the
+# current directory
 
-n_thread <- 32
-nrep <- 32 # number of different initial values (could be n_thread too)
+nrep <- 32 # number of different initial values
 n_class_max <- 10 # maximum number of classes to investigate
 n_bootstrap <- 100 # number of bootstrap samples
 set.seed(1746091653)
@@ -11,17 +23,15 @@ set.seed(1746091653)
 data(carcinoma, package = "poLCAParallel")
 data_og <- carcinoma
 data_column_names <- colnames(data_og)
-f <- cbind(A, B, C, D, E, F, G) ~ 1
+formula <- cbind(A, B, C, D, E, F, G) ~ 1
 
 # fit the model onto the data for different number of classes
 # save the fitted model into model_array
-# you have already have done this before
 model_array <- list()
 for (nclass in 1:n_class_max) {
  model <- poLCAParallel::poLCA(
- f, data_og,
- nclass = nclass, nrep = nrep, n.thread = n_thread,
- verbose = FALSE
+ formula, data_og,
+ nclass = nclass, nrep = nrep, verbose = FALSE
  )
  model_array[[nclass]] <- model
 }
@@ -46,8 +56,7 @@ for (nclass in 2:n_class_max) {
 
  # for each bootstrap sample, store the log likelihood ratio here
  bootstrap_results <- poLCAParallel::blrt(
- null_model, alt_model,
- n_bootstrap, n_thread, nrep
+ null_model, alt_model, n_bootstrap, nrep
  )
 
  # log likelihood ratio to compare the two models
@@ -59,9 +68,22 @@ for (nclass in 2:n_class_max) {
  p_value_array <- c(p_value_array, bootstrap_results[["p_value"]])
 }
 
-# plot the p value for each number of class
-# looking at the plot, I would select 3 classes as this is the biggest class
-# with a small p value.
+# plot the bootstrap distribution of the log likelihood ratios for each class
+# the red line shows the log likelihood ratio using the real data
+pdf("3_blrt_llik.pdf")
+boxplot(bootstrap_log_ratio_array,
+ xlab = "number of classses", ylab = "log likelihood ratio"
+)
+# also plot the log likelihood ratio when using the real data
+lines(1:n_class_max, fitted_log_ratio_array,
+ type = "b", col = "red", pch = 15
+)
+
+# Plot the p value for each number of class.
+# The p value is the proportion of bootstrap samples with log likelihood ratios
+# greater than using the real data.
+# Looking at the plot, I would select 3 or 4 classes as this is the biggest
+# class with a small p value.
 # Additional notes on how to interpret this graph:
 # - a low p value for a number of classes k suggest k number of classes is
 # better than k-1, so you would expect to see low p values for low number of
@@ -70,18 +92,9 @@ for (nclass in 2:n_class_max) {
 # uniform distribution, so for a class number too high, it should fluctuate
 # randomly between 0 and 1
 # the solid line is at 5%
+pdf("3_blrt_p_values.pdf")
 barplot(p_value_array,
  xlab = "number of classes", ylab = "p-value",
  names.arg = 1:n_class_max
 )
 abline(h = 0.05)
-
-
-# plot the bootstrap distribution of the log likelihood ratios for each class
-boxplot(bootstrap_log_ratio_array,
- xlab = "number of classses", ylab = "log likelihood ratio"
-)
-# also plot the log likelihood ratio when using the real data
-lines(1:n_class_max, fitted_log_ratio_array,
- type = "b", col = "red", pch = 15
-)