The goal of {matchmaker} is to provide dictionary-based cleaning for R users in a simple and intuitive manner built on the {forcats} package. Some of the features of this package include:
- preservation of factor orders
- ability to specify explicit and implicit missing values
- option to replace by fuzzy matching (regular expressions, anchored by default)
- optional variable selection by fuzzy matching
You can install {matchmaker} from CRAN:
install.packages("matchmaker")
The matchmaker package has two user-facing functions that perform dictionary-based cleaning:
match_vec()
will translate the values in a single vectormatch_df()
will translate values in all specified columns of a data frame
Each of these functions have four manditory options:
x
: your data. This will be a vector or data frame depending on the function.dictionary
: This is a data frame with at least two columns specifying keys and values to modifyfrom
: a character or number specifying which column contains the keysto
: a character or number specifying which column contains the values
Mostly, users will be working with match_df()
to transform values
across specific columns. A typical workflow would be to:
- construct your dictionary in a spreadsheet program based on your data
- read in your data and dictionary to data frames in R
- match!
library("matchmaker")
# Read in data set
dat <- read.csv(matchmaker_example("coded-data.csv"),
stringsAsFactors = FALSE
)
dat$date <- as.Date(dat$date)
# Read in dictionary
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
stringsAsFactors = FALSE
)
This is the top of our data set, generated for example purposes
id | date | readmission | treated | facility | age_group | lab_result_01 | lab_result_02 | lab_result_03 | has_symptoms | followup |
---|---|---|---|---|---|---|---|---|---|---|
ef267c | 2019-07-08 | NA | 0 | C | 10 | unk | high | inc | NA | u |
e80a37 | 2019-07-07 | y | 0 | 3 | 10 | inc | unk | norm | y | oui |
b72883 | 2019-07-07 | y | 1 | 8 | 30 | inc | norm | inc | oui | |
c9ee86 | 2019-07-09 | n | 1 | 4 | 40 | inc | inc | unk | y | oui |
40bc7a | 2019-07-12 | n | 1 | 6 | 0 | norm | unk | norm | NA | n |
46566e | 2019-07-14 | y | NA | B | 50 | unk | unk | inc | NA | NA |
The dictionary looks like this:
options | values | grp | orders |
---|---|---|---|
y | Yes | readmission | 1 |
n | No | readmission | 2 |
u | Unknown | readmission | 3 |
.missing | Missing | readmission | 4 |
0 | Yes | treated | 1 |
1 | No | treated | 2 |
.missing | Missing | treated | 3 |
1 | Facility 1 | facility | 1 |
2 | Facility 2 | facility | 2 |
3 | Facility 3 | facility | 3 |
4 | Facility 4 | facility | 4 |
5 | Facility 5 | facility | 5 |
6 | Facility 6 | facility | 6 |
7 | Facility 7 | facility | 7 |
8 | Facility 8 | facility | 8 |
9 | Facility 9 | facility | 9 |
10 | Facility 10 | facility | 10 |
.default | Unknown | facility | 11 |
0 | 0-9 | age_group | 1 |
10 | 10-19 | age_group | 2 |
20 | 20-29 | age_group | 3 |
30 | 30-39 | age_group | 4 |
40 | 40-49 | age_group | 5 |
50 | 50+ | age_group | 6 |
high | High | .regex ^lab_result_ | 1 |
norm | Normal | .regex ^lab_result_ | 2 |
inc | Inconclusive | .regex ^lab_result_ | 3 |
y | yes | .global | Inf |
n | no | .global | Inf |
u | unknown | .global | Inf |
unk | unknown | .global | Inf |
oui | yes | .global | Inf |
.missing | missing | .global | Inf |
# Clean spelling based on dictionary -----------------------------
cleaned <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp"
)
head(cleaned)
#> id date readmission treated facility age_group
#> 1 ef267c 2019-07-08 Missing Yes Unknown 10-19
#> 2 e80a37 2019-07-07 Yes Yes Facility 3 10-19
#> 3 b72883 2019-07-07 Yes No Facility 8 30-39
#> 4 c9ee86 2019-07-09 No No Facility 4 40-49
#> 5 40bc7a 2019-07-12 No No Facility 6 0-9
#> 6 46566e 2019-07-14 Yes Missing Unknown 50+
#> lab_result_01 lab_result_02 lab_result_03 has_symptoms followup
#> 1 unknown High Inconclusive missing unknown
#> 2 Inconclusive unknown Normal yes yes
#> 3 Inconclusive Normal Inconclusive missing yes
#> 4 Inconclusive Inconclusive unknown yes yes
#> 5 Normal unknown Normal missing no
#> 6 unknown unknown Inconclusive missing missing