-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathch08frequencies.Rmd
367 lines (290 loc) · 17.9 KB
/
ch08frequencies.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
# (PART\*) Part II: Descriptive statistics {-}
# Frequencies {#ch-frequencies}
## Introduction
When analysing data, a distinction is often made between
qualitative and quantitative methods. With the first method,
observations (e.g. answers in interviews) are represented in
words, and with the second method, observations (e.g. speech
pauses in interviews) are represented in numbers. In our opinion,
the difference between qualitative and quantitative methods
lies in how observations are represented, and
how arguments are made on the basis of these observations.
Sometimes it is also possible to analyse the very same data (e.g.
interviews) both qualitatively and quantitatively. The major
advantages of quantitative methods are that the data can be summarised
relatively straightforwardly (this is the subject of this part of the
textbook), and that it is relatively simple to draw meaningful conclusions
on the basis of the observations.
## Frequencies {#sec:frequencies}
Quantitative data can be reported in various different ways.
The most straightforward way would be to report the raw data, preferably
sorted according to the observed variable's value.
The disadvantage of this is that a potential pattern in the observations
will not be easily visible.
---
> *Example 8.1*: Students $(N=50)$ in a first year
course reported the following values for their shoe size, a
variable of an interval level of measurement:\
36, 36, 37, 37, 37, 37, 37, 37, 38, 38, 38, 38, 38, 38, 39, 39, 39, 39,
39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 40, 40, 40,
40, 40, 40, 41, 41, 41, 41, 41, 42, 42, 43, 43, 44, ??.\
One of the students did not provide an answer; this missing answer is
shown here as ??.
---
It is usually more insightful and efficient to summarise observations
and report them in the form of a *frequency* for each value.
This frequency indicates the *number* of observations which have a certain value,
or which have a value in a certain interval or class. In order to get
the frequencies, we thus *count* the number of observations with a certain
value, or the number of observations in a certain interval. These
frequencies are reported in a table. We call such a table a
frequency distribution.
As a first example, Table \@ref(tab:klankfreq) provides a
frequency distribution of a discrete variable of *nominal* level of measurement,
namely the phonological class of sounds in Dutch [@LKCG07].
#52-56
```{r klankfreq, echo=FALSE}
klankfreq <- read.table( file="data/klankfreq.txt", header=T )
# 20201130 column names in English
dimnames(klankfreq)[[2]] <- c("main.class","sub.class","count")
# print(klankfreq)
knitr::kable( klankfreq,
booktabs=TRUE,
caption="Frequency distribution
of the phonological class of speech sounds
in the *Corpus of Spoken Dutch*
(C=consonant, V=vowel; lang=long vowel, kort=short vowel)." )
```
As a second example, Table \@ref(tab:shoesize) provides a
frequency distribution of a continuous variable of *interval*
level of measurement, namely the aforementioned shoe size
of first year students (Example 8.1).
Table: (#tab:shoesize) Frequency distribution of the self-reported shoe sizes of $N=50$ students in a first year course (see Example 8.1 above).
------------ ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
Shoe size 36 37 38 39 40 41 42 43 44 ??
Number 2 6 6 19 6 5 2 2 1 1
------------ ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
Nevertheless, when a numerical variable is able to assume a great many different
values, the frequency distribution thus consequently becomes large and confusing.
We then add together values in a certain interval,
and afterwards make a frequency distribution on the smaller number
of intervals or classes.
---
> *Example 8.2*: When Queen Beatrix of the Netherlands was giving her last Queen's Speech,
on 18th September 2012, she paused some $305\times$. The frequency distribution
of the pause length (measured in seconds) is shown in
Table \@ref(tab:queensspeech2012pauses).
---
Table: (#tab:queensspeech2012pauses) Frequency distribution of the length of speech pauses (seconds)
in the Queen's Speech of 18th September 2012, given by Queen Beatrix of the Netherlands
$(N=305)$.
Interval Number
------------ --------
4.50--4.99 1
4.00--4.49 0
3.50--3.99 2
3.00--3.49 7
2.50--2.99 4
2.00--2.49 25
1.50--1.99 32
1.00--1.49 16
0.50--0.99 67
0.00--0.49 151
### Intervals {#sec:intervals}
For a variable of nominal and ordinal level of measurement, we generally
use the original categories to make the frequency distribution
(see Table \@ref(tab:klankfreq)), although it is possible to add categories
together. For a variable of interval or ratio level of measurement, a
researcher can choose the number of intervals in the frequency distribution
themself. Sometimes that is not necessary, for instance because the variable has
a clear number of different discrete values (see Table \@ref(tab:shoesize)).
However, sometimes, as a researcher you have to decide for yourself how many
intervals you should distinguish, and how to determine the
interval boundaries (see Table \@ref(tab:queensspeech2012pauses)).
In this instance, the following are recommended [@Ferg89, Ch.2]:
- Ensure that all observations (i.e. the entire range) fall into
roughly 10 to 20 intervals.
- Ensure that all intervals are equally wide.
- Make the lower limit of the first or second interval the same as
the width of the intervals (see
Table \@ref(tab:queensspeech2012pauses): every interval is 0.50 s
wide, and the second interval's lower limit is also 0.50).
- Order the intervals in a frequency distribution from bottom
to top in increasing order (i.e. from top to bottom in
descending order), see
Table \@ref(tab:queensspeech2012pauses)).
The wider we make the intervals, the more information we lose
about the precise distribution within each interval.
### SPSS
```
Analyze > Descriptive Statistics > Frequencies...
```
Select variable (drag to the "Variable(s)" panel).\
Tick: `Display frequency tables`.\
Choose `Format`, choose: `Order by: Descending values`.\
Confirm with `OK`.\
### JASP
From the top bar, choose
```
Descriptives
```
Select the variable(s) to summarize and place them into the field "Variables". Check the option `Frequency tables (nominal and ordinal variables)` (below the field with Variables) to obtain frequency tables for all selected variables.
If a variable has higher measurement level ("Scale", i.e. interval or ratio, Chapter \@ref(ch-levelsofmeasurement)), and you nevertheless want a frequency table of that variable, then you need to adjust the measurement level of that variable to "Nominal" or "Ordinal". In JASP, you can do so by clicking on the icon for measurement level, preceding the variable name in the column header in the data window.
You may also want to divide or "cut up" the continuous variable into discrete intervals. This is achieved by first creating a new empty variable, and subsequently filling that variable with the discrete interval values. The resulting variable will be of ordinal measurement level.
To create a new variable, click on the **+** button to the right of the last column name in the data tab. A "Create Computed Column" panel appears, where you can enter a name for the new variable, e.g. `troon12_ordinal`. You can also choose between `R` and a pointer. These are the two options in JASP to define formulas with which the new (empty) variable is filled; using `R` code, or manually using JASP. The paragraphs below explain how to create an interval variable using these two options. Finally, you can check which measurement level the new variable should be (see Chapter \@ref(ch-levelsofmeasurement)), this should be `Ordinal`. Next, click on `Create Column` to create the new variable. The new variable (empty column) appears as the rightmost variable in the data set.
If the **R** option is chosen to define the new variable, a field with "#Enter your R code here :)" appears above the data. Here you can enter R code that generates random numbers using R functions. Enter the R code (see below) and click on `Compute column` at the bottom of the field to fill the empty variable with the numbers generated by the R code.\
This snippet of R code cuts up the continuous variable `troon2012` into the discrete intervals shown in Table \@ref(tab:queensspeech2012pauses):
```
cut( troon2012, breaks = seq(0, 5, 0.5) )
```
Parse this snippet from the innermost brackets outwards:
(i) `seq`: make a sequence from 0 to 5 (units, here: seconds) in increments of 0.5 seconds,
(ii) `cut`: cut up the variable (troon2012) in intervals based on this sequence.
If you do not know the width and range of the intervals, but instead you wish to obtain a particular number (say 10) of equal intervals, then use the following snippet:
```
cut( troon2012, 10 )
```
The first interval starts at the lowest score, and the last interval ends at the highest value of `troon12`, yielding `10` intervals in total.
If the **pointer** or **manual** option is chosen to define the new variable, a work sheet will appear above the data. To the left of the work sheet are the variables, above it are math symbols, and to the right of the work sheet are several functions. From those functions you can pick one to generate your random numbers. If something goes wrong, items on the work sheet can be erased by dragging them to the trash bin on the lower right bottom. After you have completed the specification on the work sheet, then click on the button `Compute column` under the work sheet, to fill the new variable with the ordinal values.\
In order to cut up a continuous variable into equal and discrete intervals, pick the function `cut(y)` from the right. Drag the input continuous variable to the "values" field and enter the number of resulting intervals as "numBreaks" value. If you want to specify the width and number of intervals yourself, then use the R code option (see above).
Now, at last, you can make the frequency table of the ordinal variable created above, as described in the first paragraph of this subsection. Select the newly created ordinal variable(s) to summarize and place them into the field "Variables". Check the option `Frequency tables (nominal and ordinal variables)` (below the field with Variables) to obtain frequency tables for all selected variables.
### R
```{r schoenmaat02, eval=FALSE}
enq2011 <- read.table(
file=url("http://www.hugoquene.nl/R/enq2011.txt"),
header=TRUE )
table( enq2011$shoe, useNA="ifany" )
```
The output of the above `table` command is shown in Table \@ref(tab:shoesize).
The code `NA` (Not Available) is used in R to indicate missing data.
```{r troonredes12-1, echo=FALSE}
load(file="data/pauses6.Rda")
troon2012 <- pauses6[ pauses6$jaar==2012, 12 ] # save col_12 as single vector
# troon2012 <- subset(pauses6[,c(12)], pauses6$jaar==2012)
```
```{r troonredes2012-2, eval=FALSE}
table( cut( troon2012, breaks=seq(from=0,to=5,by=0.5) ) )
```
Parse this task from the innermost brackets outwards:
(i) `seq`: make a sequence from 0 to 5 (units, here: seconds) in increments of 0.5 seconds,
(ii) `cut`: cut up the dependent variable `length` in intervals based on this sequence,
(iii) `table`: make a frequency distribution of these intervals.
This task's output is shown (in edited form) in Table \@ref(tab:queensspeech2012pauses).
## Bar charts {#sec:barcharts}
A bar chart is the graphical representation of the
frequency distribution of a discrete, categorical variable (of
nominal or ordinal level of measurement). A bar chart is constructed
of rectangles. All rectangles are equally wide, and the
rectangle's height corresponds with the frequency of that category. The
surface area of each rectangle thus also corresponds with that category's frequency.
In contrast to a histogram, the rectangles are *not*
joined up to each other along the horizontal axis, to
show that we are dealing with discrete categories.
```{r klankfreq-barplot, echo=FALSE, fig.cap="Bar chart of the frequency distribution of phonological class of speech sounds in the Corpus of Spoken Dutch (C=consonant, V=vowel)."}
klankfreq <- read.table(file="data/klankfreq.txt", header=T)
with(klankfreq, barplot( aantal, beside=T,
ylab="Number",
main=paste("Speech sound frequencies in Dutch (N=",
sum(klankfreq$aantal), " sounds)", sep=""),
col=ifelse(klankfreq[,1]=="V","grey40","grey20") ) ) -> klankfreq_barplot
axis(side=1, at=klankfreq_barplot, labels=klankfreq$hoofdklasse)
axis(side=1, at=klankfreq_barplot, tick=F, line=1, labels=klankfreq$onderklasse )
# simpler: with(klankfreq, barplot(aantal) ) # all defaults
```
A bar chart helps us to determine at a glance the most important
distributional characteristics of a discrete variable:
the most characteristic (most frequently occurring) value, and the
distribution across categories. For the sound frequencies
in Dutch (Figure \@ref(fig:klankfreq-barplot)), we
see that amongst the consonants the plosives
occur the most, that amongst the vowels the
short vowels occur the most, that diphthongs
are not used much (the sounds in Dutch *ei, ui, au*), and that
more consonants are used compared with vowels.
Tip: Avoid shading and other 3D-effects in a bar chart! These make
the width and height of a rectangle less readable, and
the visible surface area of a shaded rectangle or
of a bar no longer corresponds well with the frequency.
## Histograms {#sec:histograms}
A histogram is the graphical representation of a frequency distribution of
a continuous, numerical variable (of interval or ratio level of measurement).
A histogram is constructed of rectangles. The width of each
rectangle corresponds with the interval width (a rectangle can
also be one unit wide) and the height corresponds with the frequency
of that interval or value. The surface area of each rectangle
therefore corresponds with the frequency. In contrast to a
bar chart, the rectangles do join up to each other
along the horizontal axis.
```{r troonrede2012-hist, echo=FALSE, fig.cap="Histogram for the lengths of pauses (in seconds) in the Queen's Speech of 18 September 2012, read by Queen Beatrix (N=305)."}
hist( troon2012,
breaks=seq(0, 5, by=0.25),
col="grey80",
xlab="Pause length (s)", ylab="Number",
main="Queen's Speech 2012 (N=305 pauses)" ) -> troonrede2012pauzes_hist
# print(troonrede2012pauzes_hist$breaks)
# print(troonrede2012pauzes_hist$counts)
```
A histogram helps us to determine at a glance the most important distributional characteristics
of a continuous variable: the most characteristic
(most frequently occurring) value, the degree of dispersion, the
number of peaks in the frequency distribution, the position of the peaks,
and potential outliers.
(see §\@ref(sec:outliers)).
For the pauses in the Queen's Speech of 2012
(Figure \@ref(fig:troonrede2012-hist)), we see that the majority of pauses
last between 0.25 and 0.75 s (these are presumably pauses for breath),
that there are two peaks in the distribution (the second peak is at 2 s),
and that there is one extremely long pause (with a duration
of almost 5 s).
Tip: Avoid shading and other 3D-effects in a histogram! These
make the width and height of a rectangle less readable,
and the visible surface area of a shaded rectangle or of a bar
no longer correspond well with the frequency.
### SPSS
```
Analyze > Descriptive Statistics > Frequencies...
```
Select variable (drag to the "Variable(s)" panel).\
Choose `Charts`, then pick `Chart type: Bar chart` for a
bar chart or `Chart type: Histogram` for a histogram (see
the above text for the difference between these options).\
Confirm with `OK`.\
### JASP
From the top bar, choose
```
Descriptives
```
Select the variable(s) to summarize and place them into the field "Variables". Open the "Plots" field and check the option `Distribution plots` (under the header "Basic plots").
JASP will automatically make a bar chart if the variable is discrete (nominal or ordinal), and it will make a histogram if the variable is a scalar, i.e., numerical and continuous (interval or ratio). Ensure that the correct measurement level is specified for your variables. In JASP, you can set the measurement level by clicking on the icon for measurement level, which precedes the variable name in the column header in the data window.
### R
You can make a bar chart like Figure \@ref(fig:klankfreq-barplot) in R with the
following commands:
```{r klankfreq-barplot-voorbeeld, eval=FALSE}
# read data
klankfreq <- read.table( file="data/klankfreq.txt", header=T )
# 20201130 column names in English
dimnames(klankfreq)[[2]] <- c("main.class","sub.class","count")
# make barplot from column `count` in dataset `klankfreq`
with( klankfreq, barplot( count, beside=T,
ylab="Frequency",
main="Frequencies of speech sounds in Dutch (N=2968220)",
col=ifelse(klankfreq[,1]=="V","grey40","grey20") ) ) -> klankfreq_barplot
# make labels along the bottommost horizontal axis
axis(side=1, at=klankfreq_barplot, labels=klankfreq$main.class)
axis(side=1, at=klankfreq_barplot, tick=F, line=1, labels=klankfreq$sub.class )
# or simpler: with(klankfreq, barplot(count) ) # all defaults
```
You can make a histogram like in Figure \@ref(fig:troonrede2012-hist)
in R with the follow commands:
```{r troonrede2012-hist-voorbeeld, eval=FALSE}
# read dataset
load(file="data/pauses6.Rda")
# extract pause lengths (columns 12) for the year 2012, into a separate dataset `troon2012`
troon2012 <- pauses6[ pauses6$jaar==2012, 12 ] # save col_12 as single vector
# make histogram
hist( troon2012,
breaks=seq(0, 5, by=0.25),
col="grey80",
xlab="Length of pause (s)", ylab="Frequency",
main="Queen's Speech 2012 (N=305 pauses)" ) -> troonrede2012pauzes_hist
```