This repository has been archived by the owner on Jun 27, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
/
largescaledesigned.qmd
668 lines (536 loc) · 25.2 KB
/
largescaledesigned.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
---
jupyter: julia-1.9
---
# A large-scale designed experiment {#sec-largescaledesigned}
Load the packages to be used.
```{julia}
#| code-fold: true
#| output: false
using AlgebraOfGraphics
using Arrow
using CairoMakie
using Chain
using DataFrameMacros
using DataFrames
using Effects
using MixedModels
using MixedModelsMakie
using SMLP2023: dataset
using StandardizedPredictors
using StatsBase
CairoMakie.activate!(; type="svg")
import ProgressMeter
ProgressMeter.ijulia_behavior(:clear)
```
The English Lexicon Project [@Balota_2007] was a large-scale multicenter study to examine properties of English words.
It incorporated both a lexical decision task and a word recognition task.
Different groups of subjects participated in the different tasks.
# Extracting data tables from the raw data
The raw data are available as an [OSF project](https://osf.io/n63s2) as Zip files for each of the tasks.
These Zip files contain one data file for each participant, which has a mixture of demographic data, responses on some pre-tests, and the actual trial runs.
Parsing these data files is not fun -- see [this repository](https://github.com/dmbates/EnglishLexicon.jl) for some of the code used to untangle the data.
(This repository is an unregistered Julia package.)
Some lessons from this:
- When an identifier is described as a "unique subject id", it probably isn't.
- In a multi-center trial, the coordinating center should assign the range of id's for each satellite site. Failure of a satellite site to stay within its range should result in banishment to a Siberian work camp.
- As with all data cleaning, the prevailing attitude should be "trust, but verify". Just because you are told that the file is in a certain format, doesn't mean it is. Just because you are told that the identifiers are unique doesn't mean they are, etc.
- It works best if each file has a well-defined, preferably simple, structure. These data files had two different formats mushed together.
- This is the idea of "tidy data" - each file contains only one type of record along with well-defined rules of how you relate one file to another.
- If one of the fields is a date, declare the **only** acceptable form of writing a date, preferably `yyyy-mm-dd`. Anyone getting creative about the format of the dates will be required to write the software to parse that form (and that is usually not an easy task).
- As you make changes in a file, document them. If you look at the `EnglishLexicon.jl` repository you will see that it is in the form of scripts that take the original Zip files and produce the Arrow files. That way, if necessary, the changes can be undone or modified.
- Remember, the data are only as useful as their provenance. If you invested a lot of time and money in gathering the data you should treat it as a valued resource and exercise great care with it.
- The `Arrow.jl` package allows you to add metadata as key/value pairs, called a `Dict` (or dictionary). Use this capability. The name of the file is **not** a suitable location for metadata.
## Trial-level data from the LDT
In the lexical decision task the study participant is shown a character string, under carefully controlled conditions, and responds according to whether they identify the string as a word or not.
Two responses are recorded: whether the choice of word/non-word is correct and the time that elapsed between exposure to the string and registering a decision.
Several covariates, some relating to the subject and some relating to the target, were recorded.
Initially we consider only the trial-level data.
```{julia}
ldttrial = dataset(:ELP_ldt_trial)
```
The two response variables are `acc` - the accuracy of the response - and `rt`, the response time in milliseconds.
There is one trial-level covariate, `seq`, the sequence number of the trial within subj.
Each subject participated in two sessions on different days, with 2000 trials recorded on the first day.
Notice the metadata with a citation and a URL for the OSF project.
We convert to a DataFrame and add a Boolean column `s2` which is `true` for trials in the second session.
```{julia}
ldttrial = @transform(DataFrame(ldttrial), :s2 = :seq > 2000)
describe(ldttrial)
```
## Initial data exploration {#sec-ldtinitialexplore}
From the basic summary of `ldttrial` we can see that there are some questionable response times --- negative values and values over 32 seconds.
Because of obvious outliers we will use the median response time, which is not strongly influenced by outliers, rather than the mean response time when summarizing by item or by subject.
Also, there are missing values of the accuracy.
We should check if these are associated with particular subjects or particular items.
### Summaries by item
To summarize by item we group the trials by item and use `combine` to produce the various summary statistics.
As we will create similar summaries by subject, we incorporate an 'i' in the names of these summaries (and an 's' in the name of the summaries by subject) to be able to identify the grouping used.
```{julia}
byitem = @chain ldttrial begin
groupby(:item)
@combine(
:ni = length(:acc), # no. of obs
:imiss = count(ismissing, :acc), # no. of missing acc
:iacc = count(skipmissing(:acc)), # no. of accurate
:imedianrt = median(:rt),
)
@transform!(
:wrdlen = Int8(length(:item)),
:ipropacc = :iacc / :ni
)
end
```
It can be seen that the items occur in word/nonword pairs and the pairs are sorted alphabetically by the word in the pair (ignoring case).
We can add the word/nonword status for the items as
```{julia}
byitem.isword = isodd.(eachindex(byitem.item))
describe(byitem)
```
This table shows that some of the items were never identified correctly.
These are
```{julia}
filter(:iacc => iszero, byitem)
```
Notice that these are all words but somewhat obscure words such that none of the subjects exposed to the word identified it correctly.
We can incorporate characteristics like `wrdlen` and `isword` back into the original trial table with a "left join".
This operation joins two tables by values in a common column.
It is called a *left* join because the left (or first) table takes precedence, in the sense that every row in the left table is present in the result.
If there is no matching row in the second table then missing values are inserted for the columns from the right table in the result.
```{julia}
describe(
leftjoin!(
ldttrial,
select(byitem, :item, :wrdlen, :isword);
on=:item,
),
)
```
Notice that the `wrdlen` and `isword` variables in this table allow for missing values, because they are derived from the second argument, but there are no missing values for these variables.
If there is no need to allow for missing values, there is a slight advantage in disallowing them in the element type, because the code to check for and handle missing values is not needed.
This could be done separately for each column or for the whole data frame, as in
```{julia}
describe(disallowmissing!(ldttrial; error=false))
```
::: {.callout-note collapse="true"}
### Named argument "error"
The named argument `error=false` is required because there is one column, `acc`, that does incorporate missing values.
If `error=false` were not given then the error thrown when trying to `disallowmissing` on the `acc` column would be propagated and the top-level call would fail.
:::
A barchart of the word length counts, @fig-ldtwrdlenhist, shows that the majority of the items are between 3 and 14 characters.
```{julia}
#| code-fold: true
#| fig-cap: "Barchart of word lengths in the items used in the lexical decision task."
#| label: fig-ldtwrdlenhist
let
wlen = 1:21
draw(
data((; wrdlen=wlen, count=counts(byitem.wrdlen, wlen))) *
mapping(:wrdlen => "Length of word", :count) *
visual(BarPlot),
)
end
```
To examine trends in accuracy by word length we plot the proportion accurate versus word-length separately for words and non-words with the area of each marker proportional to the number of observations for that combination (@fig-propvswrdlen).
```{julia}
#| code-fold: true
#| fig-cap: "Proportion of accurate trials in the LDT versus word length separately for words and non-words. The area of the marker is proportional to the number of observations represented."
#| label: fig-propvswrdlen
let
itemsummry = combine(
groupby(byitem, [:wrdlen, :isword]),
:ni => sum,
:imiss => sum,
:iacc => sum,
)
@transform!(
itemsummry,
:iacc_mean = :iacc_sum / (:ni_sum - :imiss_sum)
)
@transform!(itemsummry, :msz = sqrt((:ni_sum - :imiss_sum) / 800))
draw(
data(itemsummry) * mapping(
:wrdlen => "Word length",
:iacc_mean => "Proportion accurate";
color=:isword,
markersize=:msz,
);
figure=(; resolution=(800, 450)),
)
end
```
The pattern in the range of word lengths with non-negligible counts (there are points in the plot down to word lengths of 1 and up to word lengths of 21 but these points are very small) is that the accuracy for words is nearly constant at about 84% and the accuracy for nonwords is slightly higher until lengths of 13, at which point it falls off a bit.
### Summaries by subject {#sec-elpsumrysubj}
A summary of accuracy and median response time by subject
```{julia}
bysubj = @chain ldttrial begin
groupby(:subj)
@combine(
:ns = length(:acc), # no. of obs
:smiss = count(ismissing, :acc), # no. of missing acc
:sacc = count(skipmissing(:acc)), # no. of accurate
:smedianrt = median(:rt),
)
@transform!(:spropacc = :sacc / :ns)
end
```
shows some anomalies
```{julia}
describe(bysubj)
```
First, some subjects are accurate on only about half of their trials, which is the proportion that would be expected from random guessing.
A plot of the median response time versus proportion accurate, @fig-ldtmedianrtvspropacc, shows that the subjects with lower accuracy are some of the fastest responders, further indicating that these subjects are sacrificing accuracy for speed.
```{julia}
#| code-fold: true
#| fig-cap: "Median response time versus proportion accurate by subject in the LDT."
#| label: fig-ldtmedianrtvspropacc
draw(
data(bysubj) *
mapping(
:spropacc => "Proportion accurate",
:smedianrt => "Median response time (ms)",
) *
(visual(Scatter) + smooth())
)
```
As described in @Balota_2007, the participants performed the trials in blocks of 250 followed by a short break.
During the break they were given feedback concerning accuracy and response latency in the previous block of trials.
If the accuracy was less than 80% the participant was encouraged to improve their accuracy.
Similarly, if the mean response latency was greater than 1000 ms, the participant was encouraged to decrease their response time.
During the trials immediate feedback was given if the response was incorrect.
Nevertheless, approximately 15% of the subjects were unable to maintain 80% accuracy on their trials
```{julia}
count(<(0.8), bysubj.spropacc) / nrow(bysubj)
```
and there is some association of faster response times with low accuracy.
The majority of the subjects whose median response time is less than 500 ms. are accurate on less than 75% of their trials.
Another way of characterizing the relationship is that none of the subjects with 90% accuracy or greater had a median response time less than 500 ms.
```{julia}
minimum(@subset(bysubj, :spropacc > 0.9).smedianrt)
```
It is common in analyses of response latency in a lexical discrimination task to consider only the latencies on correct identifications and to trim outliers.
In @Balota_2007 a two-stage outlier removal strategy was used; first removing responses less than 200 ms or greater than 3000 ms then removing responses more than three standard deviations from the participant's mean response.
As described in @sec-ldtrtscale we will analyze these data on a speed scale (the inverse of response time) using only the first-stage outlier removal of response latencies less than 200 ms or greater than 3000 ms.
On the speed scale the limits are 0.333 per second up to 5 per second.
To examine the effects of the fast but inaccurate responders we will fit models to the data from all the participants and to the data from the 85% of participants who maintained an overall accuracy of 80% or greater.
```{julia}
pruned = @chain ldttrial begin
@subset(!ismissing(:acc), 200 ≤ :rt ≤ 3000,)
leftjoin!(select(bysubj, :subj, :spropacc); on=:subj)
dropmissing!
end
size(pruned)
```
```{julia}
describe(pruned)
```
### Choice of response scale {#sec-ldtrtscale}
As we have indicated, generally the response times are analyzed for the correct identifications only.
Furthermore, unrealistically large or small response times are eliminated.
For this example we only use the responses between 200 and 3000 ms.
A density plot of the pruned response times, @fig-elpldtrtdens, shows they are skewed to the right.
```{julia}
#| code-fold: true
#| fig-cap: Kernel density plot of the pruned response times (ms.) in the LDT.
#| label: fig-elpldtrtdens
draw(
data(pruned) *
mapping(:rt => "Response time (ms.) for correct responses") *
AlgebraOfGraphics.density();
figure=(; resolution=(800, 450)),
)
```
In such cases it is common to transform the response to a scale such as the logarithm of the response time or to the speed of the response, which is the inverse of the response time.
The density of the response speed, in responses per second, is shown in @fig-elpldtspeeddens.
```{julia}
#| code-fold: true
#| fig-cap: Kernel density plot of the pruned response speed in the LDT.
#| label: fig-elpldtspeeddens
draw(
data(pruned) *
mapping(
:rt => (x -> 1000 / x) => "Response speed (s⁻¹) for correct responses") *
AlgebraOfGraphics.density();
figure=(; resolution=(800, 450)),
)
```
@fig-elpldtrtdens and @fig-elpldtspeeddens indicate that it may be more reasonable to establish a lower bound of 1/3 second (333 ms) on the response latency, corresponding to an upper bound of 3 per second on the response speed.
However, only about one half of one percent of the correct responses have latencies in the range of 200 ms. to 333 ms.
```{julia}
count(
r -> !ismissing(r.acc) && 200 < r.rt < 333,
eachrow(ldttrial),
) / count(!ismissing, ldttrial.acc)
```
so the exact position of the lower cut-off point on the response latencies is unlikely to be very important.
:::{.callout-note collapse="true"}
### Using inline transformations vs defining new columns
If you examine the code for @fit-elpldtspeeddens, you will see that the conversion from `rt` to speed is done inline rather than creating and storing a new variable in the DataFrame.
I prefer to keep the DataFrame simple with the integer variables (e.g. `:rt`) if possible.
I recommend using the `StandardizedPredictors.jl` capabilities to center numeric variables or convert to zscores.
:::
### Transformation of response and the form of the model
As noted in @Box1964, a transformation of the response that produces a more Gaussian distribution often will also produce a simpler model structure.
For example, @fig-ldtrtvswrdlen shows the smoothed relationship between word length and response time for words and non-words separately,
```{julia}
#| code-fold: true
#| fig-cap: "Scatterplot smooths of response time versus word length in the LDT."
#| label: fig-ldtrtvswrdlen
draw(
data(pruned) *
mapping(
:wrdlen => "Word length",
:rt => "Response time (ms)";
:color => :isword,
) * smooth()
)
```
and @fig-ldtspeedvswrdlen shows the similar relationships for speed
```{julia}
#| code-fold: true
#| fig-cap: "Scatterplot smooths of response speed versus word length in the LDT."
#| label: fig-ldtspeedvswrdlen
draw(
data(pruned) *
mapping(
:wrdlen => "Word length",
:rt => (x -> 1000/x) => "Speed of response (s⁻¹)";
:color => :isword,
) * smooth()
)
```
For the most part the smoother lines in @fig-ldtspeedvswrdlen are reasonably straight.
The small amount of curvature is associated with short word lengths, say less than 4 characters, of which there are comparatively few in the study.
@fig-speedviolin shows a "violin plot" - the empirical density of the response speed by word length separately for words and nonwords. The lines on the plot are fit by linear regression.
```{julia}
#| code-fold: true
#| fig-cap: "Empirical density of response speed versus word length by word/non-word status."
#| label: fig-speedviolin
let
plt = data(@subset(pruned, :wrdlen > 3, :wrdlen < 14))
plt *= mapping(
:wrdlen => "Word length",
:rt => (x -> 1000/x) => "Speed of response (s⁻¹)",
color=:isword,
side=:isword,
)
plt *= visual(Violin)
draw(plt, axis=(; limits=(nothing, (0.0, 2.8))))
end
```
## Models with scalar random effects {#sec-ldtinitialmodel}
A major purpose of the English Lexicon Project is to characterize the items (words or nonwords) according to the observed accuracy of identification and to response latency, taking into account subject-to-subject variability, and to relate these to lexical characteristics of the items.
In @Balota_2007 the item response latency is characterized by the average response latency from the correct trials after outlier removal.
Mixed-effects models allow us greater flexibility and, we hope, precision in characterizing the items by controlling for subject-to-subject variability and for item characteristics such as word/nonword and item length.
We begin with a model that has scalar random effects for item and for subject and incorporates fixed-effects for word/nonword and for item length and for the interaction of these terms.
### Establish the contrasts
Because there are a large number of items in the data set it is important to assign a `Grouping()` contrast to `item` (and, less importantly, to `subj`).
For the `isword` factor we will use an `EffectsCoding` contrast with the base level as `false`.
The non-words are assigned -1 in this contrast and the words are assigned +1.
The `wrdlen` covariate is on its original scale but centered at 8 characters.
Thus the `(Intercept)` coefficient is the predicted speed of response for a typical subject and typical item (without regard to word/non-word status) of 8 characters.
Set these contrasts
```{julia}
contrasts = Dict(
:subj => Grouping(),
:item => Grouping(),
:isword => EffectsCoding(; base=false),
:wrdlen => Center(8),
)
```
and fit a first model with simple, scalar, random effects for `subj` and `item`.
```{julia}
elm01 = let
form = @formula(
1000 / rt ~ 1 + isword * wrdlen + (1 | item) + (1 | subj)
)
fit(MixedModel, form, pruned; contrasts)
end
```
The predicted response speed by word length and word/nonword status can be summarized as
```{julia}
effects(Dict(:isword => [false, true], :wrdlen => 4:2:12), elm01)
```
If we restrict to only those subjects with 80% accuracy or greater the model becomes
```{julia}
elm02 = let
form = @formula(
1000 / rt ~ 1 + isword * wrdlen + (1 | item) + (1 | subj)
)
dat = @subset(pruned, :spropacc > 0.8)
fit(MixedModel, form, dat; contrasts)
end
```
```{julia}
effects(Dict(:isword => [false, true], :wrdlen => 4:2:12), elm02)
```
The differences in the fixed-effects parameter estimates between a model fit to the full data set and one fit to the data from accurate responders only, are small.
However, the random effects for the item, while highly correlated, are not perfectly correlated.
```{julia}
#| code-fold: true
#| fig-cap: "Conditional means of scalar random effects for item in model elm01, fit to the pruned data, versus those for model elm02, fit to the pruned data with inaccurate subjects removed."
#| label: fig-itemreelm01vselm02
CairoMakie.activate!(; type="png")
disallowmissing!(
leftjoin!(
byitem,
leftjoin!(
rename!(DataFrame(raneftables(elm01)[:item]), [:item, :elm01]),
rename!(DataFrame(raneftables(elm02)[:item]), [:item, :elm02]);
on=:item,
),
on=:item,
),
)
disallowmissing!(
leftjoin!(
bysubj,
leftjoin!(
rename!(DataFrame(raneftables(elm01)[:subj]), [:subj, :elm01]),
rename!(DataFrame(raneftables(elm02)[:subj]), [:subj, :elm02]);
on=:subj,
),
on=:subj,
); error=false,
)
draw(
data(byitem) * mapping(
:elm01 => "Conditional means of item random effects for model elm01",
:elm02 => "Conditional means of item random effects for model elm02";
color=:isword,
) * visual(Scatter; alpha=0.2);
axis=(; width=600, height=600),
)
```
::: {.callout-note}
Adjust the alpha on @fig-itemreelm01vselm02.
:::
@fig-itemreelm01vselm02 is exactly of the form that would be expected in a sample from a correlated multivariate Gaussian distribution.
The correlation of the two sets of conditional means is about 96%.
```{julia}
cor(Matrix(select(byitem, :elm01, :elm02)))
```
```{julia}
#| echo: false
#| output: false
CairoMakie.activate!(; type="svg")
```
These models take only a few seconds to fit on a modern laptop computer, which is quite remarkable given the size of the data set and the number of random effects.
The amount of time to fit more complex models will be much greater so we may want to move those fits to more powerful server computers.
We can split the tasks of fitting and analyzing a model between computers by saving the optimization summary after the model fit and later creating the `MixedModel` object followed by restoring the `optsum` object.
```{julia}
saveoptsum("./fits/elm01.json", elm01);
```
```{julia}
elm01a = restoreoptsum!(
let
form = @formula(
1000 / rt ~ 1 + isword * wrdlen + (1 | item) + (1 | subj)
)
MixedModel(form, pruned; contrasts)
end,
"./fits/elm01.json",
)
```
Other covariates associated with the item are available as
```{julia}
elpldtitem = DataFrame(dataset("ELP_ldt_item"))
describe(elpldtitem)
```
and those associated with the subject are
```{julia}
elpldtsubj = DataFrame(dataset("ELP_ldt_subj"))
describe(elpldtsubj)
```
For the simple model `elm01` the estimated standard deviation of the random effects for subject is greater than that of the random effects for item, a common occurrence.
A caterpillar plot, @fig-elm01caterpillarsubj,
```{julia}
#| code-fold: true
#| fig-cap: Conditional means and 95% prediction intervals for subject random effects in elm01.
#| label: fig-elm01caterpillarsubj
qqcaterpillar!(
Figure(resolution=(800, 650)),
ranefinfo(elm01, :subj),
)
```
shows definite distinctions between subjects because the widths of the prediction intervals are small compared to the range of the conditional modes.
Also, there is at least one outlier with a conditional mode over 1.0.
@fig-elm02caterpillarsubj is the corresponding caterpillar plot for model `elm02` fit to the data with inaccurate responders eliminated.
```{julia}
#| code-fold: true
#| fig-cap: Conditional means and 95% prediction intervals for subject random effects in elm02.
#| label: fig-elm02caterpillarsubj
qqcaterpillar!(
Figure(resolution=(800, 650)),
ranefinfo(elm02, :subj),
)
```
## Random effects from the simple model related to covariates
The random effects "estimates" (technically they are "conditional means") from the simple model `elm01` provide a measure of how much the item or subject differs from the population. (We use `elm01` because the main difference between `elm01` and `elm02` are that some subjects were dropped before fitting `elm02`.)
For the item its length and word/non-word status have already been incorporated in the model.
At this point the subjects are just being treated as a homogeneous population.
The random effects conditional means have been extracted and incorporated in the `byitem` and `bysubj` tables.
Now add selected demographic and item-specific measures.
```{julia}
itemextended = leftjoin(
byitem,
select(elpldtitem, 1:5);
on = :item,
)
subjextended = leftjoin(
bysubj,
select(elpldtsubj, 1:3, :vocabAge);
on=:subj,
)
```
As shown in @fig-elm01vocabage, there does not seem to be a strong relationship between vocabulary age and speed of response by subject.
```{julia}
#| label: fig-elm01vocabage
#| fig-cap: "Random effect for subject in model elm01 versus vocabulary age"
#| code-fold: true
draw(
data(dropmissing(select(subjextended, :elm01, :vocabAge, :sex))) *
mapping(
:vocabAge => "Vocabulary age (yr) of subject",
:elm01 => "Random effect in model elm01";
color=:sex,
) * visual(Scatter; alpha=0.6)
)
```
```{julia}
#| code-fold: true
#| label: fig-elm01univ
#| fig-cap: "Estimated density of random effects for subject in model elm01 by university"
draw(
data(dropmissing(select(subjextended, :elm01, :univ))) *
mapping(
:elm01 => "Random effect in model elm01";
color=:univ => "University",
) * AlgebraOfGraphics.density()
)
```
```{julia}
#| code-fold: true
#| label: fig-elm02univ
#| fig-cap: "Estimated density of random effects for subject in model elm02, fit to accurate responders only, by university"
draw(
data(dropmissing(select(subjextended, :elm02, :univ))) *
mapping(
:elm02 => "Random effect in model elm02 (accurate responders only)";
color=:univ => "University",
) * AlgebraOfGraphics.density()
)
```
```{julia}
#| code-fold: true
#| label: fig-elm01BGMean
#| fig-cap: "Random effect in model elm01 versus mean bigram frequency, by word/nonword status"
CairoMakie.activate!(; type="png")
draw(
data(dropmissing(select(itemextended, :elm01, :BG_Mean, :isword))) *
mapping(
:BG_Mean => "Mean bigram frequency",
:elm01 => "Random effect in model elm01";
color=:isword,
) * visual(Scatter; alpha=0.2)
)
```