-
Notifications
You must be signed in to change notification settings - Fork 5
/
README.Rmd
340 lines (232 loc) · 10.5 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r readmesetup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# Welcome to the *epitrix* package!
<!-- badges: start -->
[![R-CMD-check](https://github.com/reconhub/epitrix/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/reconhub/epitrix/actions/workflows/R-CMD-check.yaml)
[![codecov.io](https://codecov.io/github/reconhub/epitrix/coverage.svg?branch=master)](https://codecov.io/github/reconhub/epitrix?branch=master)
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/epitrix)](https://cran.r-project.org/package=epitrix)
[![CRAN Downloads](https://cranlogs.r-pkg.org/badges/epitrix)](https://cran.r-project.org/package=epitrix)
[![Downloads from Rstudio mirror](https://cranlogs.r-pkg.org/badges/grand-total/epitrix)](http://www.r-pkg.org/pkg/epitrix)
[![Codecov test coverage](https://codecov.io/gh/reconhub/epitrix/branch/master/graph/badge.svg)](https://app.codecov.io/gh/reconhub/epitrix?branch=master)
<!-- badges: end -->
This package implements small helper functions usefull in infectious disease
modelling and epidemics analysis.
## Installing the package
To install the current stable, CRAN version of the package, type:
```{r install, eval = FALSE}
install.packages("epitrix")
```
To benefit from the latest features and bug fixes, install the development,
*github* version of the package using:
```{r install2, eval = FALSE}
devtools::install_github("reconhub/epitrix")
```
Note that this requires the package *devtools* installed.
# What does it do?
The main features of the package include:
- **`gamma_shapescale2mucv`**: convert shape and scale of a Gamma distribution
to mean and CV
- **`gamma_mucv2shapescale`**: convert mean and CV of a Gamma distribution to
shape and scale
- **`gamma_log_likelihood`**: Gamma log-likelihood using mean and CV
- **`r2R0`**: convert growth rate into a reproduction number
- **`lm2R0_sample`**: generates a distribution of R0 from a log-incidence linear
model
- **`fit_disc_gamma`**: fits a discretised Gamma distribution to data (typically
useful for describing delays)
- **`clean_labels`**: generate portable labels by removing non-standard characters or
replacing them with their closest alphanumeric matches, standardising
separators, etc.
- **`hash_names`**: generate unique, anonymised, reproducible labels from
various data fields (e.g. First name, Last name, Date of birth).
- **`emperical_incubation_dist()`** will estimate the empirical incubation
distribution if given a data frame with dates of onset and a range of
exposure dates.
- **`fit_gamma_incubation_dist()`** wraps `empirical_incubation_dist()` and
`fit_disc_gamma()` to fit a discretized gamma distribution to an empirical
incubation distribution
- **`AR2R0()`** calculates the R0 corresponding to a give attack rate
- **`R02AR()`** calculates the attack rate corresponding to a give R0
- **`R02herd_immunity_threshold()`** calculates the herd immunity threshold for
a given R0
- **`sim_linelist()`** simulates a simple linelist (with no epi model implied)
`data.frame` which can be used for illustrating other functions
# Resources
## Worked examples
### Fitting a gamma distribution to delay data
In this example, we simulate data which replicate the serial interval (SI),
i.e. the delays between primary and secondary symptom onsets, in Ebola Virus
Disease (EVD). We start by converting previously estimates of the mean and
standard deviation of the SI (WHO Ebola Response Team (2014) NEJM 371:1481–1495)
to the parameters of a Gamma distribution:
```{r generate_data}
library(epitrix)
mu <- 15.3 # mean in days days
sigma <- 9.3 # standard deviation in days
cv <- sigma / mu # coefficient of variation
cv
param <- gamma_mucv2shapescale(mu, cv) # convertion to Gamma parameters
param
```
The *shape* and *scale* are parameters of a Gamma distribution we can use to
generate delays. However, delays are typically reported per days, which implies
a discretisation (from continuous time to discrete numbers). We use the package
[*distcrete*](https://github.com/reconhub/distcrete) to achieve this
discretisation. It generates a list of functions, including one to simulate data
(`$r`), which we use to simulate 500 delays:
```{r si}
si <- distcrete::distcrete("gamma", interval = 1,
shape = param$shape,
scale = param$scale, w = 0)
si
set.seed(1)
x <- si$r(500)
head(x, 10)
hist(x, col = "grey", border = "white",
xlab = "Days between primary and secondary onset",
main = "Simulated serial intervals")
```
`x` contains simulated data, for illustrative purpose. In practice, one would
use real data from an ongoing outbreaks. Now we use `fit_disc_gamma` to estimate
the parameters of a dicretised Gamma distribution from the data:
```{r fit}
si_fit <- fit_disc_gamma(x)
si_fit
```
### Converting a growth rate (r) to a reproduction number (R0)
The package [*incidence*](https://github.com/reconhub/incidence) can fit a
log-linear model to incidence curves (function `fit`), which produces a growth
rate (r). This growth rate can in turn be translated into a basic reproduction
number (R0) using `r2R0`. We illustrate this using simulated Ebola data from the
[*outbreaks*](https://github.com/reconverse/outbreaks) package, and using the
serial interval from the previous example:
```{r fit_i}
library(outbreaks)
library(incidence)
i <- incidence(ebola_sim$linelist$date_of_onset)
i
f <- fit(i[1:150]) # fit on first 150 days
plot(i[1:200], fit = f, color = "#9fc2fc")
r2R0(f$info$r, si$d(1:100))
r2R0(f$info$r.conf, si$d(1:100))
```
In addition, we can also use the function `lm2R0_sample` to generate samples of
R0 values compatible with a model fit:
```{r sample_R0}
R0_val <- lm2R0_sample(f$model, si$d(1:100), n = 100)
head(R0_val)
hist(R0_val, col = "grey", border = "white")
```
### Standardising labels
If you want to use labels that will work across different computers, independent
of local encoding and operating systems, `clean_labels` will make your life
easier. The function transforms character strings by replacing diacritic symbols
with their closest alphanumeric matches, setting all characters to lower case,
and replacing various separators with a single, consistent one.
For instance:
```{r clean_labels}
x <- " Thîs- is A wêïrD LäBeL .."
x
clean_labels(x)
variables <- c("Date.of.ONSET ",
"/ date of hôspitalisation /",
"-DäTÈ--OF___DîSCHARGE-",
"GEndèr/",
" Location. ")
variables
clean_labels(variables)
```
### Anonymising data
`hash_names` can be used to generate hashed labels from linelist data. Based on
pre-defined fields, it will generate anonymous labels. This system has the
following desirable features:
- given the same input, the output will always be the same, so this encoding
system generates labels which can be used by different people and
organisations
- given different inputs, the output will always be different; even minor
differences in input will result in entirely different outputs
- given an output, it is very hard to infer the input (it requires hacking
skills); if security is challenged, the hashing algorithm can be 'salted' to
strengthen security
```{r}
first_name <- c("Jane", "Joe", "Raoul", "Raoul")
last_name <- c("Doe", "Smith", "Dupont", "Dupond")
age <- c(25, 69, 36, 36)
## detailed output by default
hash_names(first_name, last_name, age)
## short labels for practical use
hash_names(first_name, last_name, age,
size = 8, full = FALSE)
```
### Estimate incubation periods
The function `empirical_incubation_dist()` computes the discrete probability
distribution by giving equal weight to each patient. Thus, in the case of `N`
patients, the `n` possible exposure dates of a given patient get the overall
weight `1/(n*N)`. The function returns a data frame with column
`incubation_period` containing the different incubation periods with a time step
of one day and their `relative_frequency`.
Load environment:
```{r, echo = TRUE, message = FALSE}
library(magrittr)
library(tibble)
library(epitrix)
library(distcrete)
library(ggplot2)
```
Make a linelist object containing toy data with several possible exposure dates for each case:
```{r, echo = TRUE}
ll <- sim_linelist(30) %>%
tibble()
x <- 0:15
y <- distcrete("gamma", 1, shape = 12, rate = 3, w = 0)$d(x)
mkexposures <- function(i) i - sample(x, size = sample.int(5, size = 1), replace = FALSE, prob = y)
exposures <- sapply(ll$date_of_onset, mkexposures)
ll$dates_exposure <- exposures
print(ll)
```
Empirical distribution:
```{r}
incubation_period_dist <- empirical_incubation_dist(ll, date_of_onset, dates_exposure)
print(incubation_period_dist)
ggplot(incubation_period_dist, aes(incubation_period, relative_frequency)) +
geom_col()
```
Fit discrete gamma:
```{r}
fit <- fit_gamma_incubation_dist(ll, date_of_onset, dates_exposure)
print(fit)
x = c(0:10)
y = fit$distribution$d(x)
ggplot(data.frame(x = x, y = y), aes(x, y)) +
geom_col(data = incubation_period_dist, aes(incubation_period, relative_frequency)) +
geom_point(stat="identity", col = "red", size = 3) +
geom_line(stat="identity", col = "red")
```
**Note** that if the possible exposure dates are consecutive for all patients then `empirical_incubation_dist()` and `fit_gamma_incubation_dist()` can take date ranges as inputs instead of lists of individual exposure dates (see help for details).
## Vignettes
The [overview vignette](http://www.repidemicsconsortium.org/epitrix/articles/epitrix.html)
essentially replicates the content of this `README`. To request or contribute
other vignettes, see the section "*getting help, contributing*".
The [estimate incubation vignette](http://www.repidemicsconsortium.org/epitrix/articles/estimate_incubation.html) contains worked examples for the `emperical_incubation_dist()`
`fit_gamma_incubation_dist()`.
## Websites
Click [here](http://www.repidemicsconsortium.org/epitrix/) for the website
dedicated to *epitrix*.
## Getting help, contributing
Bug reports and feature requests should be posted on *github* using the
[*issue*](http://github.com/reconhub/epitrix/issues) system. All other questions
should be posted on the [**RECON forum**](http://www.repidemicsconsortium.org/forum/).
Contributions are welcome via **pull requests**.
Please note that this project is released with a
[Contributor Code of Conduct](CONDUCT.md).
By participating in this project you agree to abide by its terms.