Add simulated data from Gostic, 2020 for benchmarking (#37)

* Add simulated data from Gostic, 2020 for benchmarking This commit re-uses the data processing and documentation from CDCgov/cfa-epinow2-pipeline#17. That repo is open source and public domain by nature of being USG property. I think it's convenient to re-use, but @athowes and @kgostic there's a little text in there that you both suggested, so I'd appreciate if you could give your permission for the re-use here! I've made sure to credit you both in the commit as it's partially your writing. It's probably best practice to make the data prep and processing here fully reproducible in `data-raw/` but it's quite a pain to do, with a mixture of shells cripting and R scripting needed. I skipped it out of convenience, but if you have a really strong feeling that it's required @seabbs, let me know and we can revisit. Closes #9 Co-authored-by: Adam Howes <[email protected]> Co-authored-by: Katie Gostic <[email protected]> * Bump NEWS * Coerce new `obs_incidence` col to integer from double * Generate GI PMF in `data-raw/` * Point to specific CFAEpiNow2Pipeline commit * Remove copied & pasted unrelated text * Re-render roxygen docs * Add primarycensoreddist to remotes * Typo * Rename synthetic dataset and GI dist * Pin primarycensorreddist to R-universe not github --------- Co-authored-by: Adam Howes <[email protected]> Co-authored-by: Katie Gostic <[email protected]>
CDCgov · Sep 12, 2024 · bd6d033 · bd6d033
1 parent f80eefe
commit bd6d033
Show file tree

Hide file tree

Showing 13 changed files with 205 additions and 3 deletions.
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -5,5 +5,6 @@
 ^\.pre-commit-config\.yaml$
 ^_pkgdown\.yml$
 ^codecov\.yml$
+^data-raw$
 ^docs$
 ^pkgdown$
diff --git a/.github/workflows/R-CMD-check.yaml b/.github/workflows/R-CMD-check.yaml
@@ -37,6 +37,7 @@ jobs:
           r-version: ${{ matrix.config.r }}
           http-user-agent: ${{ matrix.config.http-user-agent }}
           use-public-rspm: true
+          extra-repositories: 'https://epinowcast.r-universe.dev'
 
       - uses: r-lib/actions/setup-r-dependencies@v2
         with:

diff --git a/.github/workflows/pkgdown.yaml b/.github/workflows/pkgdown.yaml
@@ -29,6 +29,7 @@ jobs:
       - uses: r-lib/actions/setup-r@v2
         with:
           use-public-rspm: true
+          extra-repositories: 'https://epinowcast.r-universe.dev'
 
       - uses: r-lib/actions/setup-r-dependencies@v2
         with:

diff --git a/.github/workflows/test-coverage.yaml b/.github/workflows/test-coverage.yaml
@@ -20,6 +20,7 @@ jobs:
       - uses: r-lib/actions/setup-r@v2
         with:
           use-public-rspm: true
+          extra-repositories: 'https://epinowcast.r-universe.dev'
 
       - uses: r-lib/actions/setup-r-dependencies@v2
         with:

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -8,7 +8,6 @@ repos:
     -   id: style-files
         args: [--style_pkg=styler, --style_fun=tidyverse_style,
                --cache-root=styler-perm]
-    -   id: roxygenize
     -   id: use-tidy-description
     -   id: lintr
     -   id: readme-rmd-rendered

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -25,13 +25,20 @@ BugReports: https://github.com/cdcgov/cfa-gam-rt/issues
 Suggests:
     testthat (>= 3.0.0),
     pkgdown,
-    withr
+    withr,
+    usethis,
+    primarycensoreddist
 Config/testthat/edition: 3
 Encoding: UTF-8
 Roxygen: list(markdown = TRUE)
-RoxygenNote: 7.3.1
+RoxygenNote: 7.3.2
 Imports:
     cli,
     glue,
     mgcv,
     rlang
+Depends:
+    R (>= 2.10)
+LazyData: true
+Additional_repositories:
+    https://epinowcast.r-universe.dev
diff --git a/NEWS.md b/NEWS.md
@@ -1,5 +1,6 @@
 # RtGam (development version)
 
+* Add data from Gostic, 2020 for testing and benchmarking (#37)
 * Fix math rendering in pkgdown (#36)
 * CI status in README badges (#27)
 * Update pre-commit hooks from template repo to newest version (#25)

diff --git a/R/data.R b/R/data.R
@@ -0,0 +1,76 @@
+#' Synthetic dataset of stochastic SIR system with known Rt
+#'
+#' A dataset from Gostic, Katelyn M., et al. "Practical considerations for
+#' measuring the effective reproductive number, Rt." PLoS Computational Biology
+#' 16.12 (2020): e1008409. The data are simulated from a stochastic SEIR
+#' compartmental model.
+#'
+#' This synthetic dataset has a number of desirable properties:
+#'
+#' 1. The force of infection changes depending on the Rt, allowing for sudden
+#' changes in the Rt. This allows for modeling of sudden changes in infection
+#' dynamics, which might otherwise be difficult to capture. Rt estimation
+#' framework
+#'
+#' 2. The realized Rt is known at each timepoint
+#'
+#' 3. The dataset incorporates a simple generation interval and a reporting
+#' delay.
+#'
+#' Gostic et al. benchmark the performance of a number of Rt estimation
+#' frameworks, providing practical guidance on how to use this dataset to
+#' evaluate Rt estimates.
+#'
+#' In practice, we've found that the amount of observation noise in the
+#' incidence and/or observed cases is often undesirably low for testing. Many
+#' empirical datasets are much noisier. As a result, models built with these
+#' settings in mind can perform poorly on this dataset or fail to converge. We
+#' manually add observation noise with `rnbinom(299, mu =
+#' stochastic_sir_rt[["obs_cases"]], size = 10)` and the random seed 123456 and
+#' store it in the `obs_incidence` column.
+#'
+#' @name stochastic_sir_rt
+#' @format `stochastic_sir_rt` A data frame with 301 rows and 12 columns:
+#' \describe{
+#'    \item{time}{Timestep of the discrete-time stochastic SEIR simulation}
+#'    \item{date}{Added from the original Gostic, 2020 dataset. A date
+#'    corresponding to the assigned `time`. Arbitrarily starts on January 1st,
+#'    2023.}
+#'    \item{S, E, I, R}{The realized state of the stochastic SEIR system}
+#'    \item{dS, dEI, DIR}{The stochastic transition between compartments}
+#'    \item{incidence}{The true incidence in the `I` compartment at time t}
+#'    \item{obs_cases}{The observed number of cases at time t from
+#'    forward-convolved incidence.}
+#'    \item{obs_incidence}{Added from the original Gostic, 2020 dataset. The
+#'     `incidence` column with added negative-binomial observation noise.
+#'     Created with `set.seed(123456)` and the call
+#'      `rnbinom(299, mu = [["incidence"]], size = 10)` Useful for
+#'       testing.}
+#'    \item{true_r0}{The initial R0 of the system (i.e., 2)}
+#'    \item{true_rt}{The known, true Rt of the epidemic system}
+#' }
+#' @source
+#' <https://github.com/cobeylab/Rt_estimation/tree/d9d8977ba8492ac1a3b8287d2f470b313bfb9f1d> # nolint
+#' <https://github.com/CDCgov/cfa-epinow2-pipeline/pull/17> # nolint
+"stochastic_sir_rt"
+
+#' Generation interval corresponding to the sample `stochastic_sir_rt` dataset
+#'
+#' Gostic et al., 2020 simulates data from a stochastic SEIR model. Residence
+#' time in both the E and the I compartments is exponentially distributed, with
+#' a mean of 4 days (or a rate/inverse-scale of 1/4). These residence times
+#' imply a gamma-distributed generation time distribution with a shape of 2 and
+#' a rate of 1/4. The distribution can be regenerated in
+#' `data-raw/sir_gt_pmf.R`.
+#'
+#' From this parametric specification, we produce a double-censored,
+#' left-truncated probability mass function of the generation interval
+#' distribution. We produce the PMF using
+#' [primarycensoreddist::dpcens()] with version 0.4.0. See
+#' https://doi.org/10.1101/2024.01.12.24301247 for more information on
+#' double-censoring biases and corrections.
+#'
+#' @name sir_gt_pmf
+#' @format `sir_gt_pmf` A numeric vector of length 26 that sums to one within
+#'   numerical tolerance
+"sir_gt_pmf"
diff --git a/data-raw/sir_gt_pmf.R b/data-raw/sir_gt_pmf.R
@@ -0,0 +1,21 @@
+# E and I compartments both with exponentially distributed residence times
+# with a mean of 4 days.
+shape <- 2
+rate <- 1 / 4
+
+sir_gt_pmf <- primarycensoreddist::dpcens(0:26,
+  pgamma,
+  shape = shape,
+  rate = rate,
+  D = 27
+) # v0.4.0
+
+# Drop first element because GI can't have same-day transmission
+sir_gt_pmf <- sir_gt_pmf[2:27]
+
+# Renormalize to a proper PMF
+while (abs(sum(sir_gt_pmf) - 1) > 1e-10) {
+  sir_gt_pmf <- sir_gt_pmf / sum(sir_gt_pmf)
+}
+
+usethis::use_data(sir_gt_pmf, overwrite = TRUE)
diff --git a/data/sir_gt_pmf.rda b/data/sir_gt_pmf.rda
diff --git a/data/stochastic_sir_rt.rda b/data/stochastic_sir_rt.rda
diff --git a/man/sir_gt_pmf.Rd b/man/sir_gt_pmf.Rd
diff --git a/man/stochastic_sir_rt.Rd b/man/stochastic_sir_rt.Rd