-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
180 lines (128 loc) · 6.43 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/"
)
options(digits = 4, width = 160)
```
# hmatch: Tools for cleaning and matching hierarchically-structured data
<!-- badges: start -->
[![Lifecycle: maturing](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)
[![R-CMD-check](https://github.com/epicentre-msf/hmatch/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/epicentre-msf/hmatch/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/epicentre-msf/hmatch/branch/master/graph/badge.svg)](https://codecov.io/gh/epicentre-msf/hmatch?branch=master)
<!-- badges: end -->
An R package for cleaning and matching messy hierarchically-structured data
(e.g. country / region / district / municipality). The general goal is to match
sets of hierarchical values in a raw dataset to corresponding values within a
reference dataset, while accounting for potential discrepancies such as:
- variation in character case, punctuation, spacing, use of accents, or spelling
- variation in hierarchical resolution (e.g. some entries specified to municipality-level but others only to region)
- missing values at one or more hierarchical levels
- values entered at the wrong hierarchical level
### Installation
Install from GitHub with:
```{r, eval=FALSE}
# install.packages("remotes")
remotes::install_github("epicentre-msf/hmatch")
```
### Matching strategies
##### Low-level
- **`hmatch`**: match hierarchical sequences up to the highest-resolution level specified within a given row of raw data, optionally allowing for missing values below the match level, and fuzzy matches (using the [stringdist](https://github.com/markvanderloo/stringdist) package)
##### Higher-level
- **`hmatch_tokens`**: match tokens rather than entire strings to allow for variation in multi-term names
- **`hmatch_permute`**: sequentially permute hierarchical columns to allow for values entered at the wrong level
- **`hmatch_parents`**: match values at a given hierarchical level based on shared sets of 'offspring'
- **`hmatch_settle`**: try matching at every level and settle for the highest-resolution match possible
- **`hmatch_manual`**: match using a user-supplied dictionary
- **`hmatch_split`**: implement any other `hmatch_` function separately at each hierarchical level, only on unique sequences
- **`hmatch_composite`**: implement a variety of matching strategies in sequence, from most to least strict
### String standardization
Independent of optional fuzzy matching with
[stringdist](https://github.com/markvanderloo/stringdist), `hmatch` functions
use behind-the-scenes string standardization to help account for variation in
character case, punctuation, spacing, or use of accents between the raw and
reference data. E.g.
```
raw_value reference_value match
----------------------------------------------------
original: ILE DE FRANCE Île-de-France FALSE
standardized: ile_de_france ile_de_france TRUE
```
Users can choose default standardization (illustrated above), no
standardization, or supply their own preferred function to standardize strings
(e.g. `tolower`).
### Usage
#### Example dataset
The `hmatch` package contains example datasets `ne_raw` (messy geographical
data) and `ne_ref` (reference data derived from a shapefile), based on a small
subset of northeastern North America.
```{r}
library(hmatch)
head(ne_raw) # raw messy data
head(ne_ref) # reference data derived from shapefile
```
#### Example workflow
##### Basic hierarchical matching with `hmatch()`
We'll start with a simple call to `hmatch` to see which rows can be matched with
no extra magic.
```{r}
hmatch(ne_raw, ne_ref, pattern = "^adm")
```
There are still quite a few unmatched rows, and entry 'PID14' actually matches
two different rows within `ref`, so we'll press on. We can separate the matched
and unmatched rows using inner- and anti-joins respectively, specifically using
the "resolve_" join type here to only consider matches that are unique.
```{r}
(raw_match1 <- hmatch(ne_raw, ne_ref, pattern = "^adm", type = "resolve_inner"))
(raw_remain1 <- hmatch(ne_raw, ne_ref, pattern = "^adm", type = "resolve_anti"))
```
##### Fuzzy matching
Next we'll add in fuzzy-matching, using the default maximum string-distance of 1.
```{r}
hmatch(raw_remain1, ne_ref, pattern = "^adm", fuzzy = TRUE, type = "inner")
```
Only one additional *unique* match, so we'll again split and move on. Note that
we've been using the `pattern` argument above to specify the hierarchical
columns in `raw` and `ref`, but because the hierarchical columns have the same
names in `raw` and `ref` (and are the only matching column names), we can drop
the `pattern` argument for brevity.
```{r}
(raw_match2 <- hmatch(raw_remain1, ne_ref, fuzzy = TRUE, type = "resolve_inner"))
(raw_remain2 <- hmatch(raw_remain1, ne_ref, fuzzy = TRUE, type = "resolve_anti"))
```
##### Tokenized matching
Next let's try `hmatch_tokens`, which matches based on components of strings
(i.e. tokens) rather than entire strings.
```{r}
(raw_match3 <- hmatch_tokens(raw_remain2, ne_ref, type = "resolve_inner"))
(raw_remain3 <- hmatch_tokens(raw_remain2, ne_ref, type = "resolve_anti"))
```
##### Permutation matching
If there are any values entered at the wrong hierarchical level, we can try
systematically permuting the hierarchical columns before matching.
```{r}
(raw_match4 <- hmatch_permute(raw_remain3, ne_ref, type = "resolve_inner"))
(raw_remain4 <- hmatch_permute(raw_remain3, ne_ref, type = "resolve_anti"))
```
##### The toughest cases
For the remaining rows that we haven't yet matched, there a few options. We
could use `hmatch_settle()` to settle for matches below the highest-resolution
level specified within a given row of `raw`. We could also do some 'manual'
comparison of the raw and reference datasets and create a dictionary to recode
values within `raw` to match corresponding entries in `ref`. Here we'll do both.
```{r}
ne_dict <- data.frame(
value = "NJ",
replacement = "New Jersey",
variable = "adm1"
)
(raw_match5 <- hmatch_settle(raw_remain4, ne_ref, dict = ne_dict,
fuzzy = TRUE, type = "resolve_inner"))
(raw_remain5 <- hmatch_settle(raw_remain4, ne_ref, dict = ne_dict,
fuzzy = TRUE, type = "resolve_anti"))
```