forked from pmcharrison/ppm
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
265 lines (216 loc) · 9.96 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
---
output: github_document
bibliography: inst/REFERENCES.bib
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# ppm
[![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)
[![Travis build status](https://travis-ci.org/pmcharrison/ppm.svg?branch=master)](https://travis-ci.org/pmcharrison/ppm)
[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/pmcharrison/ppm?branch=master&svg=true)](https://ci.appveyor.com/project/pmcharrison/ppm)
[![Coverage status](https://coveralls.io/repos/github/pmcharrison/ppm/badge.svg)](https://coveralls.io/r/pmcharrison/ppm?branch=master)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.2620414.svg)](https://doi.org/10.5281/zenodo.2620414)
Prediction by Partial Matching (PPM) is a flexible and robust algorithm
for generating incremental probabilistic predictions for symbolic sequences
[@Cleary1984].
Many variants of PPM exist in the literature.
This package, `ppm`, implements the PPM variant developed by @Bunton1996
and subsequently used by @Pearce2005 in the Information Dynamics Of Music
(IDyOM) model.
It also implements the PPM-Decay algorithm of @Harrison2020,
which applies a flexible memory-decay kernel to PPM such that
historic events are downweighted compared to recent events.
Please cite the `ppm` package by referring to this latter paper:
> Harrison, Peter M. C., Roberta Bianco, Maria Chait, and Marcus T.
> Pearce. 2020. “PPM-Decay: A Computational Model of Auditory Prediction
> with Memory Decay.” *bioRxiv*.
> <https://doi.org/10.1101/2020.01.09.900266>.
You might also want to mention the particular version of the package you used.
To check the installed version, you can run the following code in your R console:
``` r
asNamespace("ppm")$`.__NAMESPACE__.`$spec[["version"]]
```
In addition to the GitHub repository,
the source code for the `ppm` package is permanently archived on Zenodo,
and is available at the following DOI: https://doi.org/10.5281/zenodo.2620414.
This DOI will always point to the latest version of the `ppm` package,
but you can also find version-specific DOIs on the Zenodo page.
## Installation
To install from GitHub:
``` r
if (!require("devtools")) install.packages("devtools")
devtools::install_github("pmcharrison/ppm")
```
## PPM
The standard PPM model is a variable-order Markov order model that
blends together the predictions of multiple *n*-gram models.
An *n*-gram model generates predictions by tabulating the
occurrences of subsequences of *n* tokens -- *n*-grams -- in the training dataset.
These *n*-gram models are combined using a technique called *smoothing*.
Many versions of PPM exist corresponding to different smoothing and order selection
techniques. The implementation in this package uses *interpolated smoothing*
as introduced by Bunton (1996), along with a selection of different *escape methods*
which determine the weighting parameters for the *n*-gram models.
*Update exclusion* is also supported.
The implementation also supports the order selection technique from PPM*,
which was designed to help PPM extend to arbitrarily long *n*-gram orders.
Note however that the present implementation is not optimized for large order bounds;
with an infinite order bound, model training has computational complexity quadratic
in the length of the input, whereas alternative implementations can achieve
linear complexity in the length of the input.
You can create a PPM model object using the `new_ppm_simple` function:
```{r}
library(ppm)
mod <- new_ppm_simple(alphabet_size = 5)
```
See `?new_ppm_simple` for information on configuring the model.
Note how we provided an `alphabet_size` parameter to the model.
This compulsory parameter refers to the size of the alphabet over which
the model will be trained and will generate predictions.
For example, if we were modelling the letters A-Z, we might set
this parameter to 26, corresponding to the 26 letters in the English alphabet.
Input sequences are coded as integers within this alphabet.
For example, we might define a sequence as follows:
```{r}
seq_1 <- c(1, 3, 3, 1, 2, 1, 3)
```
The `factor` function from base R can be useful for making this encoding
from a series of textual tokens.
Factor objects are represented as integer vectors under the hood,
but they come with an associated mapping between the integers and
their labels, which makes them easier to manipulate within R.
For example:
```{r}
print(letters)
seq_2 <- factor(c("a", "b", "r", "a", "c", "a", "d", "a", "b", "r", "a"),
levels = c("a", "b", "c", "d", "r"))
print(seq_2)
print(as.integer(seq_2))
```
The *ppm* package treats factor objects like their underlying integer representations.
When modelling factor objects, it's recommended to pass the underlying factor representation
to the PPM model upon initialisation. In this case, the `alphabet_size` parameter need not
be provided.
```{r}
mod <- new_ppm_simple(alphabet_levels = c("a", "b", "c", "d", "r"))
```
You feed sequences to the PPM model using the `model_seq` function.
By default, the model processes these sequences incrementally,
one symbol at a time.
It simultaneously learns from these incoming symbols,
and generates predictions for future symbols based
on what came before.
For example:
```{r}
res <- model_seq(mod, seq_2)
print(res)
```
The output of `model_seq` is a tibble with several columns.
Each row corresponds to a different element of the input sequence,
organised in order of presentation.
The row describes what happened when the model tried to predict
this element, conditioned on the preceding elements in the sequence.
Each row has the following fields:
- `symbol` - The integer encoding of the corresponding symbol in the sequence.
- `model_order` - The highest-order *n*-gram model used for generating predictions.
- `information_content` - The negative log probability, or *information content*,
of the observed symbol according to the model.
- `entropy` - The entropy of the model's predictive distribution when predicting
that element of the sequence.
- `distribution`- The predictive probability distribution when predicting
that element of the sequence.
Often we are particularly interested in the `information_content` field,
which gives us an index of the model's surprisal at different
parts of the input sequence.
```{r, dpi = 300, out.width = "50%", fig.align = 'center', fig.width = 4, fig.height = 4}
plot(res$information_content,
xlab = "Position",
ylab = "Information content (bits)",
type = "l",
ylim = c(0, 5))
```
The `model_seq` function changes the input PPM model object,
even the absence of the assignment operator `<-`.
Typically, the PPM model will have been updated with the contents of
the training sequence.
If we present the same sequence again, it should prove to be much more predictable (red line):
```{r, dpi = 300, out.width = "50%", fig.align = 'center', fig.width = 4, fig.height = 4}
res_2 <- model_seq(mod, seq_2)
plot(res$information_content,
xlab = "Position",
ylab = "Information content (bits)",
type = "l",
ylim = c(0, 5))
points(res_2$information_content,
type = "l", col = "red")
```
By setting `generate = TRUE`, we can also instruct the model to generate new samples
based on the statistics that it has learned so far. In this case the first argument to
`model_seq` should be an integer corresponding to the desired length of the generated sequence.
```{r}
res_3 <- model_seq(mod, seq = 20, generate = TRUE)
res_3$symbol
```
## PPM-Decay
The original PPM algorithm has a perfect memory,
and weights all historic observations equally when generating predictions.
The PPM-Decay modifies this behaviour,
introducing a customisable memory decay kernel that determines the
weight of historic observations as a function of the
time and number of events observed since the original event.
For example, a decay kernel for modelling auditory prediction might
resemble the following:
```{r, out.width = "50%", fig.align = 'center', echo = FALSE}
knitr::include_graphics("man/figures/example-decay-kernel.png")
```
In its most general form (illustrated above),
the decay kernel comprises three phases:
- A buffer phase (yellow);
- A short-term memory phase (red);
- A long-term memory phase (blue).
The parameters for these different phases, in particular durations
and relative weights, are customisable.
Each phase can be disabled separately to produce
simpler families of decay kernels.
For example, the default parameters define a one-stage exponential decay kernel;
adding a buffer phase and retrieval noise produces the two-stage decay kernel
in @Harrison2020.
The `new_ppm_decay` function is used to create a new PPM-Decay model.
It works in a similar way to `new_ppm_simple`, described above for PPM models.
The function has many customisable parameters (see `?new_ppm_decay`)
that support the specification of various kinds of decay kernels,
including single-stage exponential decays,
two-stage exponential decays,
exponential decays with non-zero asymptotes,
and kernels combining a flat buffer period with subsequent exponential decays.
```{r}
mod_decay <- new_ppm_decay(alphabet_size = 5, ltm_half_life = 2)
```
When modelling a sequence with a PPM-Decay model,
you need to specify both the sequence itself and
a numeric vector corresponding to the timepoints of the
symbol observations.
```{r, dpi = 300, out.width = "50%", fig.align = 'center', fig.width = 4, fig.height = 4}
seq_2_time <- seq_along(seq_2)
print(seq_2_time)
res_4 <- model_seq(mod_decay,
seq_2,
time = seq_2_time)
plot(res_4$information_content,
xlab = "Position",
ylab = "Information content (bits)",
type = "l",
ylim = c(0, 5))
points(res_4$information_content,
type = "l", col = "blue")
```
Here the original PPM output is plotted in black,
the PPM-Decay model in blue.
## References