Skip to content

Commit

Permalink
Merge branch 'master' into package for v1.1.1 release
Browse files Browse the repository at this point in the history
  • Loading branch information
shermanlo77 committed Feb 9, 2024
2 parents 80bc8a2 + 54e5e29 commit f53daff
Show file tree
Hide file tree
Showing 11 changed files with 662 additions and 109 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,6 @@ src/*.dll
R/RcppExports.R
src/RcppExports.cpp
man/*
html/
latex/
*.pdf
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: poLCAParallel
Type: Package
Title: Polytomous variable Latent Class Analysis Parallel
Version: 1.1.0
Version: 1.1.1
Author: Sherman Lo <[email protected]>,
Drew Linzer <[email protected]>,
Jeffrey Lewis <[email protected]>.
Expand Down
7 changes: 7 additions & 0 deletions Doxyfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
PROJECT_NAME = "poLCAParallel"
PROJECT_NUMBER = "v1.1.1"
PROJECT_BRIEF = "C++ Implementation of poLCA (R package)"
INPUT = src/
INPUT += README.md
USE_MDFILE_AS_MAINPAGE = README.md
EXTRACT_ALL = YES
122 changes: 76 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

## Polytomous Variable Latent Class Analysis

### With Bootstrap Likelihood Ratio Test

Sherman E. Lo, Queen Mary, University of London

A reimplementation of poLCA
Expand All @@ -12,7 +14,7 @@ especially with multiple repetitions by using multiple threads.

## About poLCAParallel

The library poLCAParallel reimplements poLCA and its bootstrap log-likelihood
The library poLCAParallel reimplements poLCA and the bootstrap log-likelihood
ratio test in C++. This was done using
[Rcpp](https://cran.r-project.org/web/packages/Rcpp) and
[RcppArmadillo](https://cran.r-project.org/web/packages/RcppArmadillo) which
Expand All @@ -27,7 +29,7 @@ allows C++ code to interact with R. Additional notes include:
* Use of [`std::map`](https://en.cppreference.com/w/cpp/container/map) for the
chi-squared calculations

Further reading available on a
Further reading is available on a
[QMUL ITS Research Blog](https://blog.hpc.qmul.ac.uk/speeding_up_r_packages.html).

## About the Original Code
Expand Down Expand Up @@ -87,49 +89,62 @@ Run the following in R
devtools::install_github("QMUL/poLCAParallel@package")
```

## Installation from Releases

Requires the R packages [Rcpp](https://cran.r-project.org/web/packages/Rcpp),
[RcppArmadillo](https://cran.r-project.org/web/packages/RcppArmadillo) and
[scatterplot3d](https://cran.r-project.org/web/packages/scatterplot3d/index.html).

Download the `.tar.gz` file from the releases. Install it using using
`R CMD INSTALL`, for example

```bash
R CMD INSTALL --preclean --no-multiarch poLCAParallel-*.*.*.tar.gz
```

## Installation using a git clone

Requires the R package
[devtools](https://www.r-project.org/nosvn/pandoc/devtools.html),
[Rcpp](https://cran.r-project.org/web/packages/Rcpp),
[RcppArmadillo](https://cran.r-project.org/web/packages/RcppArmadillo).
Optionally, creating the documentation requires the R package
Requires the R packages [Rcpp](https://cran.r-project.org/web/packages/Rcpp) and
[RcppArmadillo](https://cran.r-project.org/web/packages/RcppArmadillo). The R
package [devtools](https://www.r-project.org/nosvn/pandoc/devtools.html) is
recommended. Optionally, creating the documentation requires the R package
[roxygen2](https://cran.r-project.org/web/packages/roxygen2/index.html).

Git clone this repository.

```console
```bash
git clone https://github.com/QMUL/poLCAParallel.git
```

In R, run the following so that the package can be compiled correctly
Run the following so that the package can be compiled correctly

```r
Rcpp::compileAttributes("poLCAParallel")
```bash
R -e "Rcpp::compileAttributes('poLCAParallel')"
```

and optionally for the documentation

```r
devtools::document("poLCAParallel")
```bash
R -e "devtools::document('poLCAParallel')"
```

Finally

```r
devtools::install("poLCAParallel")
```bash
R -e "devtools::install('poLCAParallel')"
```

to install the package. Alternatively, `R CMD INSTALL` can be used as shown
below in a terminal

```console
```bash
R CMD INSTALL --preclean --no-multiarch poLCAParallel
```

## Changes from the Orginal Code

R scripts which compare poLCAParallel with poLCA are provided in `exec/`.
Example use of a bootstrap likelihood ratio test is shown in `exec/3_blrt.R`.

* The stopping condition of the EM algorithm, if the log-likelihood change after
an iteration of EM is too small, is evaluated after the E step rather than the
Expand All @@ -140,7 +155,7 @@ R scripts which compare poLCAParallel with poLCA are provided in `exec/`.
* The output `probs.start` are the initial probabilities used to achieve the
maximum log-likelihood from *any* repetition rather than from the first
repetition.
* The output `eflag` is set to `TRUE` if *any* repetition had to be restarted,
* The output `eflag` is set to `TRUE` if *any* repetition has to be restarted,
rather than the repetition which achieves maximum log-likelihood.
* An additional argument `n.thread` is provided to specify the number of threads
to use.
Expand All @@ -152,61 +167,76 @@ R scripts which compare poLCAParallel with poLCA are provided in `exec/`.

## Further Development Notes

* There are possible problems with implementing the calculations of the standard
error. They are discussed [here](inst/note_standard_error.tex).
* There is a possible underflow error if the number of categories is too large,
more than ~300. This is because in the calculation of the log-likelihood, the
probabilities from each category are multiplied by each other. If there are
$J$ categories, then there are $J$ probabilities to multiply together. This
is addressed in commit 85ee419 but reverted. Consider summing over log space
instead.
* The standard errors have not been implemented in C++. This is because there
are possible problems with implementing the calculations of the standard
error. They are discussed here [[.tex]](inst/note_standard_error.tex)
[[.md]](inst/note_standard_error.md). In-built GitHub tools may not render
equations correctly.
* When calculating the likelihood, probabilities are iteratively multiplied,
this is much faster than taking the sum of log probabilities. However to avoid
underflow errors, the calculation of the likelihood, instead, uses the sum of
log probabilities when an underflow is detected. See `PosteriorUnnormalize()`
in `src/em_algorithm.*` for the implementation.
* Multiple Newton steps can be taken instead of a single one.

## Code Style

There was an attempt to use the
[Google C++ style guide](https://google.github.io/styleguide/cppguide.html).
All generated documents and codes, eg from

Armadillo objects are used sparingly, preferring the use of `double*` when
handling vectors and matrices.
```bash
R -e "Rcpp::compileAttributes('poLCAParallel')"
```

and

```bash
R -e "devtools::document('poLCAParallel')"
```

All generated documents and codes shall not be included in the `master` branch.
Instead, they shall be in the `package` branch so that this package can be
installed using `devtools::install_github("QMUL/poLCAParallel@package")`, yet,
avoiding duplicated code on the `master` branch.
shall not be included in the `master` branch. Instead, they shall be in the
`package` branch so that this package can be installed using
`devtools::install_github("QMUL/poLCAParallel@package")`. This is to avoid
having duplicate documentation and generated code on the `master` branch.

Semantic versioning is used and tagged. Tags on the `master` branch shall have
`v` prepended and `-master` appended, eg. `v1.1.0-master`. The corresponding
tag on the `package` branch shall only have `v` prepended, eg. `v1.1.0`.

## C++ Source Code Documentation
There was an attempt to use the
[Google C++ style guide](https://google.github.io/styleguide/cppguide.html).

The C++ code is compatible with [Doxygen](https://doxygen.nl/) by running
Armadillo objects are used sparingly, preferring the use of `double*` when
handling vectors and matrices.

```console
doxygen -g
```
## C++ Source Code Documentation

to create a config file `Doxyfile`. Doxygen needs to know the source code is
located in `src/` which can be done by modifying a line in the config file
`Doxyfile`
The C++ code documentation can be created with [Doxygen](https://doxygen.nl/)
by running

```console
INPUT = src/
doxygen
```

Afterwards, running `doxygen` will create documentation for the C++ code. The
configure file can be further configured.
and viewed at `html/index.html`.

## Citation

Please consider citing the corresponding
[QMUL ITS Research Blog](https://blog.hpc.qmul.ac.uk/speeding_up_r_packages.html):
[QMUL ITS Research Blog](https://blog.hpc.qmul.ac.uk/speeding_up_r_packages.html)

* Lo, S.E. (2022). Speeding up and Parallelising R packages (using Rcpp and C++
* Lo, S.E. (2022). Speeding up and Parallelising R packages (using Rcpp and C++)
| QMUL ITS Research Blog.
[[link]](https://blog.hpc.qmul.ac.uk/speeding_up_r_packages.html)

and the publication below which this software was originally created for

* Eto F, Samuel M, Henkin R, Mahesh M, Ahmad T, et al. (2023). Ethnic
differences in early onset multimorbidity and associations with health service
use, long-term prescribing, years of life lost, and mortality: A
cross-sectional study using clustering in the UK Clinical Practice Research
Datalink. *PLOS Medicine,* 20(10): e1004300.
<https://doi.org/10.1371/journal.pmed.1004300>

## References

* Dziak, J. J., Lanza, S. T., & Tan, X. (2014). Effect size, statistical power,
Expand Down
61 changes: 37 additions & 24 deletions exec/3_blrt.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,20 @@
# example script on bootstrap likelihood ratio test using poLCA and their sample
# data
# Example script on doing bootstrap likelihood ratio test using poLCAParallel
#
# For $R$ classes, bootstrap likelihood ratio test (BLRT) fits the null model
# using $R-1$ classes and an alt model using $R$ classes, producing a likelihood
# ratio. Using the fitted null and alt models, a parametric bootstrap can be
# done to get an empirical distribution of the likelihood ratio.
#
# To do model selection, select the highest $R$ where the fitted log likelihood
# ratio is on the 95% percentile or higher of the empirical (bootstrap)
# distribution of log likelihood ratios, eg p-value less than 5%. Thus this
# script should be run for a different number of classes, eg using an array job.
#
# The code uses poLCAParallel::blrt() to do BLRT. It records all bootstrap
# samples log likelihood ratios, which are plotted. Figures are saved in the
# current directory

n_thread <- 32
nrep <- 32 # number of different initial values (could be n_thread too)
nrep <- 32 # number of different initial values
n_class_max <- 10 # maximum number of classes to investigate
n_bootstrap <- 100 # number of bootstrap samples
set.seed(1746091653)
Expand All @@ -11,17 +23,15 @@ set.seed(1746091653)
data(carcinoma, package = "poLCAParallel")
data_og <- carcinoma
data_column_names <- colnames(data_og)
f <- cbind(A, B, C, D, E, F, G) ~ 1
formula <- cbind(A, B, C, D, E, F, G) ~ 1

# fit the model onto the data for different number of classes
# save the fitted model into model_array
# you have already have done this before
model_array <- list()
for (nclass in 1:n_class_max) {
model <- poLCAParallel::poLCA(
f, data_og,
nclass = nclass, nrep = nrep, n.thread = n_thread,
verbose = FALSE
formula, data_og,
nclass = nclass, nrep = nrep, verbose = FALSE
)
model_array[[nclass]] <- model
}
Expand All @@ -46,8 +56,7 @@ for (nclass in 2:n_class_max) {

# for each bootstrap sample, store the log likelihood ratio here
bootstrap_results <- poLCAParallel::blrt(
null_model, alt_model,
n_bootstrap, n_thread, nrep
null_model, alt_model, n_bootstrap, nrep
)

# log likelihood ratio to compare the two models
Expand All @@ -59,9 +68,22 @@ for (nclass in 2:n_class_max) {
p_value_array <- c(p_value_array, bootstrap_results[["p_value"]])
}

# plot the p value for each number of class
# looking at the plot, I would select 3 classes as this is the biggest class
# with a small p value.
# plot the bootstrap distribution of the log likelihood ratios for each class
# the red line shows the log likelihood ratio using the real data
pdf("3_blrt_llik.pdf")
boxplot(bootstrap_log_ratio_array,
xlab = "number of classses", ylab = "log likelihood ratio"
)
# also plot the log likelihood ratio when using the real data
lines(1:n_class_max, fitted_log_ratio_array,
type = "b", col = "red", pch = 15
)

# Plot the p value for each number of class.
# The p value is the proportion of bootstrap samples with log likelihood ratios
# greater than using the real data.
# Looking at the plot, I would select 3 or 4 classes as this is the biggest
# class with a small p value.
# Additional notes on how to interpret this graph:
# - a low p value for a number of classes k suggest k number of classes is
# better than k-1, so you would expect to see low p values for low number of
Expand All @@ -70,18 +92,9 @@ for (nclass in 2:n_class_max) {
# uniform distribution, so for a class number too high, it should fluctuate
# randomly between 0 and 1
# the solid line is at 5%
pdf("3_blrt_p_values.pdf")
barplot(p_value_array,
xlab = "number of classes", ylab = "p-value",
names.arg = 1:n_class_max
)
abline(h = 0.05)


# plot the bootstrap distribution of the log likelihood ratios for each class
boxplot(bootstrap_log_ratio_array,
xlab = "number of classses", ylab = "log likelihood ratio"
)
# also plot the log likelihood ratio when using the real data
lines(1:n_class_max, fitted_log_ratio_array,
type = "b", col = "red", pch = 15
)
Loading

0 comments on commit f53daff

Please sign in to comment.