forked from data-edu/data-science-in-education
-
Notifications
You must be signed in to change notification settings - Fork 0
/
10-wt-longitudinal-analysis.Rmd
1303 lines (1065 loc) · 54.9 KB
/
10-wt-longitudinal-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Walkthrough 4: Longitudinal Analysis With Federal Students With Disabilities Data {#c10}
```{r load custom themes, echo = FALSE, message = FALSE, eval=FALSE}
knitr::opts_chunk$set(message = FALSE, warning = FALSE)
```
## Topics Emphasized
- Importing data
- Tidying data
- Transforming data
- Visualizing data
- Modeling data
- Communicating results
## Functions Introduced
- `list.files()`
- `download.file()`
- `lubridate::ymd()`
- `identical()`
- `dplyr::top_n()`
- `ggplot2::geom_jitter()`
- `dplyr::arrange()`
## Vocabulary
- aggregate data
- file path
- list
- read in
- tidy format
- statistical model
- student-level data
- longitudinal analysis
- ratio
- subset
- vector
## Chapter Overview
Data scientists working in education don't always have access to student level
data, so knowing how to model publicly available datasets, as in [the previous
chapter](#c9), is a useful skill. This walkthrough builds upon and extends the
focus on aggregate data in the last chapter to focus on a change over time in
students with disabilities in each state. We note that analyses that involve
time can go by a number of names, such as longitudinal analyses or time series
analyses, or - less formally - analyses or studies of change over time.
Here, we primarily use the term longitudinal analysis to refer to analyses of
data at multiple time points. While data from two time points would be included in
this definition, our emphasis is on data from a greater number of time points, which
can reveal more nuance in how change over time is happening.
### Background
In this chapter, we'll be learning some ways to explore data over time. In
addition, we'll be learning some techniques for exploring a publicly
available dataset. Like most public datasets (see [the previous chapter](#c9)),
this one contains aggregate data. This means that someone totaled up the student
counts so that it doesn't reveal any private information.
You can download the datasets for this walkthrough on the United States
Department of Education website (see @usdoe2019)^[The documentation for the
dataset is available here:
https://www2.ed.gov/programs/osepidea/618-data/collection-documentation/data-documentation-files/part-b/child-count-and-educational-environment/idea-partb-childcountandedenvironment-2017-18.pdf],
though they are also available in the {dataedu} package that accompanies this
book, as we describe in the Importing the Data From the {dataedu} Package section below.
### Methods
In this walkthrough, we'll learn how to read multiple datasets in using the
`map()` function. Next, we'll prepare our data for analysis by cleaning the
variable names. Finally, we'll explore this dataset by visualizing student
counts and comparing male to female ratios over time.
## Load Packages
The function `here()` from the {here} package can cause conflicts with other functions called `here()`. We can prevent problems by loading that package last and including the package name for every call to `here()`, like this: `here::here()`. This is called "including the namespace."
If you have not installed any of these packages, then you will need to do so, first, using the `install.packages()` function; see the [Packages section](#c06p) of the [Foundational Skills](#c06) chapter for instructions (and an overview of what packages are and how they work).
You can load the packages needed in this walkthrough by running this code:
```{r load packages, message=FALSE, warning=FALSE}
library(tidyverse)
library(dataedu)
library(lubridate)
library(here)
```
## Import Data
In this analysis we'll be importing and combining six datasets that describe the
number of students with disabilities in a given year. Let's spend some time
carefully reviewing how to get the `.csv` files we'll need downloaded and stored
on your computer. If you want to run the code exactly as written here, you'll
need to store the same datasets in the right location. As an alternate, we make these data files that are used in the walkthrough - like those in other walkthroughs -
available through the {dataedu} package. Last, we note that while it's possible to use this
walkthrough on different datasets or to store them in different locations on
your computer, you'll need to make adjustments to your code based on the
datasets you used and where you stored them. We suggest only doing this if you
already have some experience using R.
### What to Download
In this walkthrough, we'll be using six separate datasets of child counts, one
for each year between 2012 and 2017. If you're copying and pasting the code in
this walkthrough, we recommend downloading the datasets from our GitHub
repository for the most reliable results. As we note above, you can also access these data after they have been merged together via the {dataedu} package; see the Importing the Data From the {dataedu} Package section of this chapter. Here's a link to each file; we also include a short URL via the URL-shortener website
bit.ly:
- [2012 data](https://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2012.csv) (https[]()://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2012.csv) (https://bit.ly/3dCtVtf)
- [2013 data](https://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2013.csv) (https[]()://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2013.csv) (https://bit.ly/33WXnFX)
- [2014 data](https://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2014.csv) (https[]()://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2014.csv) (https://bit.ly/2UvSwbx)
- [2015 data](https://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2015.csv) (https[]()://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2015.csv) (https://bit.ly/39wQAUg)
- [2016 data](https://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2016.csv) (https[]()://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2016.csv) (https://bit.ly/2JubWHC)
- [2017 data](https://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2017-18.csv) (https[]()://github.com/data-edu/data-science-in-education/raw/master/data/longitudinal_data/bchildcountandedenvironments2017-18.csv) (https://bit.ly/2wPLu8w)
You can also find these files on the [United States Department of Education
website](https://www2.ed.gov/programs/osepidea/618-data/state-level-data-files/index.html)
(https[]()://www2.ed.gov/programs/osepidea/618-data/state-level-data-files/index.html)
### A Note on File Paths
When you download these files, be sure to store them in a folder in your working
directory. To get to the data in this walkthrough, we can use this file path
in our working directory: "data/longitudinal\_data". We'll be using the here *function* from the {here}
*package*, which convenieniently fills in all the folders in the file path of your
working directory all the way up to the folders you specify in the arguments. So,
when referencing the file path "data/longitudinal\_data", we'll use code like
this:
```{r demo here, eval = FALSE}
here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2012.csv")
```
You can use a different file path if you like, just take note of where your
downloaded files are so you can use the correct file path when writing your code
to import the data.
### How to Download the Files
One way to download the files is to manually download them, saving them to a working directory.
Another way to download the files is to read them directly into R, using the `download.file()` function,
and the same file path described in the previous section. This functionality works for any CSV files that you can download from webpages; the key is that the URL must be to the CSV file itself (one way to check is to ensure that the URL ends in `.csv`).
Here is how we would do it for the first dataset (from the year 2012), using the shortened URLs included along with the full URLs above.
```{r, eval = FALSE}
download.file(
# the url argument takes a URL for a CSV file
url = 'https://bit.ly/3dCtVtf',
# destfile specifies where the file should be saved
destfile = here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2012.csv"),
mode = "wb")
```
We can do this for the remaining five datasets.
```{r, eval = FALSE}
download.file(
url = 'https://bit.ly/33WXnFX',
destfile = here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2013.csv"),
mode = "wb")
download.file(
url = 'https://bit.ly/2UvSwbx',
destfile = here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2014.csv"),
mode = "wb")
download.file(
url = 'https://bit.ly/39wQAUg',
destfile = here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2015.csv"),
mode = "wb")
download.file(
url = 'https://bit.ly/2JubWHC',
destfile = here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2016.csv"),
mode = "wb")
download.file(
url = 'https://bit.ly/2wPLu8w',
destfile = here::here("data",
"longitudinal_data",
"bchildcountandedenvironments2017-18.csv"),
mode = "wb")
```
Now that the files are downloaded (either through the above code or by downloading them from GitHub), we're
ready to proceed to reading the data into R. If you were unable to download these files for any reason, they
are also available through the {dataedu} package, as we describe after the "Reading in Many Datasets" section.
### Reading in One Dataset
We'll be learning how to read in more than one dataset using the `map()`
function. Let's try it first with one dataset, then we'll scale our solution up
to multiple datasets. When you are analyzing multiple datasets that all have the
same structure, you can read in each dataset using one code chunk. This code
chunk will store each dataset as an element of a list.
Before doing that, you should explore one of the datasets to see what you can
learn about its structure. Clues from this exploration inform how you read in
all the datasets at once later on. For example, we can see that the first
dataset has some lines at the top that contain no data:
```{r read 2012 data, echo = FALSE, message = F, warning = F}
read_csv(here::here(
"data",
"longitudinal_data",
"bchildcountandedenvironments2012.csv"
))
```
The rows containing "Extraction Date:", "Updated:" and "Revised:" aren't
actually rows. They're notes the authors left at the top of the dataset to show
when the dataset was changed.
`read_csv()` uses the first row as the variable names unless told otherwise, so
we need to tell `read_csv()` to skip those lines using the `skip` argument. If
we don't, `read_csv()` assumes the very first line--the one that says
"Extraction Date:"--is the correct row of variable names. That's why calling
`read_csv()` without the `skip` argument results in column names like `X4`. When
there's no obvious column name to read in, `read_csv()` names them `X[...]` and
let's you know in a warning message.
Try using `skip = 4` in your call to `read_csv()`:
```{r read 2012 data with skip, warning = FALSE, message = FALSE}
read_csv(here::here(
"data",
"longitudinal_data",
"bchildcountandedenvironments2012.csv"
),
skip = 4)
```
The `skip` argument told `read_csv()` to make the line containing "Year", "State
Name", and so on as the first line. The result is a dataset that has "Year",
"State Name", and so on as variable names.
### Reading in Many Datasets
Will the `read_csv()` and `skip = 4` combination work on all our datasets? To
find out, we'll use this strategy:
- Store a vector of filenames and paths in a list. These paths point to our
datasets
- Pass the list of filenames as arguments to `read_csv()` using
`purrr::map()`, including `skip = 4`, in our `read_csv()` call
- Examine the new list of datasets to see if the variable names are correct
Imagine a widget-making machine that works by acting on raw materials it
receives on a conveyer belt. This machine executes one set of instructions on
each of the raw materials it receives. You are the operator of the machine and
you design instructions to get a widget out of the raw materials. Your plan
might look something like this:
- *Raw materials*: a list of filenames and their paths
- *Widget-making machine*: `purrr:map()`
- *Widget-making instructions*: \`read\_csv(path, skip = 4)
- *Expected widgets*: a list of datasets
Let's create the raw materials first. Our raw materials will be file paths to
each of the CSVs we want to read. Use `list.files` to make a vector of filename
paths and name that vector `filenames`. `list.files` returns a vector of file
names in the folder specified in the `path` argument. When we set the
`full.names` argument to "TRUE", we get a full path of these filenames. This
will be useful later when we need the file names and their paths to read our
data in.
```{r get filenames, results='hide'}
# Get filenames from the data folder
filenames <-
list.files(path = here::here("data", "longitudinal_data"),
full.names = TRUE)
# A list of filenames and paths
filenames
```
That made a vector of six filenames, one for each year of child count data
stored in the data folder. Now pass our raw materials, the vector called
`filenames`, to our widget-making machine called `map()` and give the machine
the instructions `read_csv(., skip = 4)`. Name the list of widgets it cranks out
`all_files`:
```{r read list of CSVs, , message=FALSE, warning=FALSE}
# Pass filenames to map and read_csv
all_files <-
filenames %>%
# Apply the function read_csv to each element of filenames
map(., ~ read_csv(., skip = 4))
```
It is important to think ahead here. The goal is to combine the datasets in
`all_files` into one dataset using `bind_rows()`. But that will only work if all
the datasets in our list have the same number of columns and the same column
names. We can check our column names by using `map()` and `names()`:
```{r check column names, include=FALSE}
all_files %>%
# Apply the function names to each element of all_files
map(names)
```
We can use `identical()` to see if the variables from two datasets match. We see
that the variable names of the first and second datasets don't match, but the
variables from the second and third do.
```{r check names with identical}
# Variables of first and second dataset don't match
identical(names(all_files[[1]]), names(all_files[[2]]))
# Variables of third and second files match
identical(names(all_files[[2]]), names(all_files[[3]]))
```
And we can check the number of columns by using `map()` and `ncol()`:
```{r check cols}
all_files %>%
# apply the function ncol to each element of all_files
map(ncol)
```
We have just encountered an extremely common problem in education data!
Neither the number of columns nor the column names match.
This is a problem because - with different column names - we won't be able to combine the datasets in a later step.
As we can see, when we try, `bind_rows()` returns a dataset with 100 columns, instead of the
expected 50.
```{r combine datasets 1}
# combining the datasets at this stage results in the incorrect
# number of columns
bind_rows(all_files) %>%
# check the number of columns
ncol()
```
We'll correct this in the next section by selecting and renaming our variables,
but it's good to notice this problem early in the process so you know to work on
it later.
### Loading the data from {dataedu}
After all of the hard work we've done above, it may seem painful to simply read in the final result! But,
if you were unable to download the files because you do not have Internet access (or for any other reason!), you can read in the `all_files` list of six data frames through the {dataedu} package with the following line of code:
```{r, eval = FALSE}
all_files <- dataedu::all_files
```
## Process Data
Transforming your dataset before visualizing it and fitting models is critical.
It's easier to write code when variable names are concise and informative. Many
functions in R, especially those in the {ggplot2} package, work best when
datasets are in a "tidy" format. It's easier to do an analysis when you have
just the variables you need. Any unused variables can confuse your thought
process.
Let's preview the steps we'll be taking:
1. Fix the variable names in the 2016 data
2. Combine the datasets
3. Pick variables
4. Filter for the desired categories
5. Rename the variables
6. Standardize the state names
7. Transform the column formats from wide to long using `pivot_longer`
8. Change the data types of variables
9. Explore NAs
In real life, data scientists don't always know the cleaning steps until they
dive into the work. Learning what cleaning steps are needed requires
exploration, trial and error, and clarity on the analytic questions you want to
answer.
After a lot of exploring, we settled on these steps for this analysis. When you
do your own, you will find different things to transform. As you do more and
more data analysis, your instincts for what to transform will improve.
### Fix the Variable Names in the 2016 Data
When we print the 2016 dataset, we notice that the variable names are incorrect.
Let's verify that by looking at the first ten variable names of the 2016
dataset, which is the fifth element of `all_files`:
```{r 2016 dataset}
# Look at the first 10 column names of 2016
names(all_files[[5]])[1:10]
```
We want the variable names to be `Year` and `State Name`, not `2016` and
`Alabama`. But first, let's go back and review how to get at the 2016 dataset
from `all_files`. We need to identify which element the 2016 dataset was in the
list. The order of the list elements was set all the way back when we fed
`map()` our list of filenames. If we look at `filenames` again, we see that its
fifth element is the 2016 dataset. Try looking at the first and fifth elements
of `filenames`:
```{r show filenamess, eval=FALSE}
filenames[[1]]
filenames[[5]]
```
Once we know the 2016 dataset is the fifth element of our list, we can pluck it
out by using double brackets:
```{r pluck 2016 data}
all_files[[5]]
```
We used `skip = 4` when we read in the datasets in the list. That worked for all
datasets except the fifth one. In that one, skipping four lines left out the
variable name row. To fix it, we'll read the 2016 dataset again using
`read_csv()` and the fifth element of `filenames` but this time will use the
argument `skip = 3`. We'll assign the newly read dataset to the fifth element of
the `all_files` list:
```{r fix 2016 colnames, echo = FALSE, , message=FALSE, warning=FALSE}
all_files[[5]] <-
# Skip the first 3 lines instead of the first 4
read_csv(filenames[[5]], skip = 3)
```
Try printing `all_files` now. You can confirm we fixed the problem by checking
that the variable names are correct.
### Pick Variables
Now that we know all our datasets have the correct variable names, we simplify
our datasets by picking the variables we need. This is a good place to think
carefully about which variables to pick. This usually requires a fair amount of
trial and error, but here is what we found we needed:
- Our analytic questions are about gender, so let's pick the gender variable
- Later, we'll need to filter our dataset by disability category and program
location so we'll want `SEA Education Environment` and `SEA Disability
Category`
- We want to make comparisons by state and reporting year, so we'll also pick
`State Name` and `Year`
Combining `select()` and `contains()` is a convenient way to pick these
variables without writing a lot of code. Knowing that we want variables that
contain the acronym "SEA" and variables that contain "male" in their names, we
can pass those characters to `contains()`:
```{r pick vars}
all_files[[1]] %>%
select(
Year,
contains("State", ignore.case = FALSE),
contains("SEA", ignore.case = FALSE),
contains("male")
)
```
That code chunk verifies that we got the variables we want, so now we will turn
the code chunk into a function called `pick_vars()`. We will then use `map()` to
apply `pick_vars()` to each dataset of our list, `all_files`, to the function.
In this function, we'll use a special version of `select()` called
`select_at()`, which conveniently picks variables based on criteria we give it.
The argument `vars(Year, contains("State", ignore.case = FALSE), contains("SEA",
ignore.case = FALSE), contains("male"))` tells R we want to keep any column
whose name has "State" in upper or lower case letters, has "SEA" in the title,
and has "male" in the title. This will result in a newly transformed `all_files`
list that contains six datasets, all with the desired variables.
```{r pick vars function}
# build the function
pick_vars <-
function(df) {
df %>%
select_at(vars(
Year,
contains("State", ignore.case = FALSE),
contains("SEA", ignore.case = FALSE),
contains("male")
))
}
# use the function with `all_files`
all_files <-
all_files %>%
map(pick_vars)
```
### Combine Six Datasets into One
Now we'll turn our attention to combining the datasets in our list `all_files`
into one. We'll use `bind_rows()`, which combines datasets by adding each one to
the bottom of the one before it. The first step is to check and see if our
datasets have the same number of variables and the same variable names. When we
use `names()` on our list of newly changed datasets, we see that each dataset's
variable names are the same:
```{r verify var names}
# check variable names
all_files %>%
map(names)
```
That means that we can combine all six datasets into one using `bind_rows()`.
We'll call this newly combined dataset `child_counts`:
```{r combine datasets 2}
child_counts <-
all_files %>%
# combine all datasets in `all_files`
bind_rows()
```
Since we know the following, we can conclude that all our rows
combined together correctly:
1. each of our six datasets had eight variables
1. our combined dataset also has eight variables,
But, let's use `str()` to verify:
```{r check structure}
str(child_counts)
```
### Importing the data from the {dataedu} package
If you would like to load this processed dataset (`child_counts`), then
you can run the following code to load it directly from the {dataedu} package:
```{r, eval = FALSE}
longitudinal_data <- dataedu::longitudinal_data
```
### Filter for the Desired Disabilities and Age Groups
We want to explore gender related variables, but our dataset has additional
aggregate data for other subgroups. For example, we can use `count()` to explore
all the different disability groups in the dataset. Here's the number of times
an `SEA Disability Category` appears in the dataset:
```{r count disabilities}
child_counts %>%
# count number of times the category appears in the dataset
count(`SEA Disability Category`)
```
Since we will be visualizing and modeling gender variables for all students in
the dataset, we'll filter out all subgroups except "All Disabilities" and the
age totals:
```{r filter child_counts}
child_counts <-
child_counts %>%
filter(
# filter all but the All Disabilities category
`SEA Disability Category` == "All Disabilities",
# filter all but the age totals
`SEA Education Environment` %in% c("Total, Age 3-5", "Total, Age 6-21")
)
```
### Rename the Variables
In the next section we'll prepare the dataset for visualization and modeling by
"tidying" it. When we write code to transform datasets, we'll be typing the
column names a lot, so it's useful to change them to ones with more convenient
names.
```{r rename vars}
child_counts <-
child_counts %>%
rename(
# change these columns to more convenient names
year = Year,
state = "State Name",
age = "SEA Education Environment",
disability = "SEA Disability Category",
f_3_5 = "Female Age 3 to 5",
m_3_5 = "Male Age 3 to 5",
f_6_21 = "Female Age 6 to 21",
m_6_21 = "Male Age 6 to 21"
)
```
### Clean State Names
You might have noticed that some state names in our dataset are in upper case
letters and some are in lower case letters:
```{r count states}
child_counts %>%
count(state) %>%
head()
```
If we leave it like this, R will treat state values like "CALIFORNIA" and
"California" as two different states. We can use `mutate` and `tolower` to
transform all the state names to lowercase letters.
```{r fix caps}
child_counts <-
child_counts %>%
mutate(state = tolower(state))
```
### Tidy the Dataset
Visualizing and modeling our data will be much easier if our dataset is in a
"tidy" format. In *Tidy Data* [@wickham2014], defines
tidy datasets as possessing the following characteristics:
> 1. Each variable forms a column.
> 2. Each observation forms a row.
> 3. Each type of observational unit forms a table.
*A note on the gender variable in this dataset*
This dataset uses a binary approach to data collection about gender. Students
are described as either male or female. The need for an inclusive approach to
documenting gender identity is discussed in a paper by @park2016 of The Williams
Institute at UCLA.
The gender variables in our dataset are spread across four columns, with each
one representing a combination of gender and age range. We can use
`pivot_longer()` to bring the gender variable into one column. In this
transformation, we create two new columns: a `gender` column and a `total`
column. The `total` column will contain the number of students in each row's
gender and age category.
```{r pivot_longer gender col}
child_counts <-
child_counts %>%
pivot_longer(cols = f_3_5:m_6_21,
names_to = "gender",
values_to = "total")
```
To make the values of the `gender` column more intuitive, we'll use
`case_when()` to transform the values to either "f" or "m":
```{r replace gender values}
child_counts <-
child_counts %>%
mutate(
gender = case_when(
gender == "f_3_5" ~ "f",
gender == "m_3_5" ~ "m",
gender == "f_6_21" ~ "f",
gender == "m_6_21" ~ "m",
TRUE ~ as.character(gender)
)
)
```
### Convert Data Types
The values in the `total` column represent the number of students from a
specific year, state, gender, and age group. We know from the `chr` under their
variable names that R is treating these values like characters instead of
numbers. While R does a decent job of treating numbers like numbers when needed,
it's much safer to prepare the dataset by changing these character columns to
numeric columns. We'll use `mutate()` to change the count columns.
```{r convert to numeric}
child_counts <-
child_counts %>%
mutate(total = as.numeric(total))
child_counts
```
Converting these count columns from character classes to number classes resulted
in two changes. First, the `chr` under these variable names has now changed to
`dbl`, short for "double-precision". This lets us know that R recognizes these
values as numbers with decimal points. Second, the blank values changed to `NA`.
When R sees a character class value like `"4"`, it knows to change it to numeric
class `4`. But there is no obvious number represented by a value like `""` or
`-`, so it changes it to `NA`:
```{r demo as.numeric}
# Convert a character to a number
as.numeric("4")
# Convert a blank character or a symbol to a number
as.numeric("")
as.numeric("-")
```
Similarly, the variable `year` needs to be changed from the character format to
the date format. Doing so will make sure R treats this variable like a point in
time when we plot our dataset. The package {lubridate} has a handy function
called `ymd` that can help us. We just have to use the `truncated` argument to
let R know we don't have a month and date to convert.
```{r convert date}
child_counts <-
child_counts %>%
mutate(year = ymd(year, truncated = 2))
```
### Explore and Address NAs
You'll notice that some rows in the `total` column contain an `NA`. When we used
`pivot_longer()` to create a `gender` column, R created unique rows for every
year, state, age, disability, and gender combination. Since the original dataset
had both gender and age range stored in a column like `Female Age 3 to 5`, R
made rows where the `total` value is NA . For example, there is no student count
for the `age` value "Total, Age 3-5" that also has the `gender` value for female
students who were age 6-21. You can see that more clearly by sorting the dataset
by year, state, and gender.
In our Foundational Skills chapter, we introduced a {dplyr} function called
`arrange()` to sort the rows of a dataset by the values in a column. Let's use
`arrange()` here to sort the dataset by the `year`, `state` and `gender`
columns. When you pass `arrange()` a variable, it will sort by the order of the
values in that variable. If you pass it multiple variables, `arrange()` will
sort by the first variable, then by the second, and so on. Let's see what it
does on `child_counts` when we pass it the `year`, `state`, and `gender`
variables:
```{r show NAs}
child_counts %>%
arrange(year, state, gender)
```
We can simplify our dataset by removing the rows with NAs, leaving us with one
row for each category:
- females age 3-5
- females age 6-21
- males age 3-5
- males age 6-21
Each of these categories will be associated with a state and reporting year:
```{r remove NAs}
child_counts <-
child_counts %>%
filter(!is.na(total))
```
We can verify we have the categories we want by sorting again:
```{r verify rows}
child_counts %>%
arrange(year, state, gender)
```
## Analysis
In the last section we focused on importing our dataset. In this section, we
will ask: how have child counts changed over time? First, we'll use
visualization to explore the number of students in special education over time.
In particular, we'll compare the count of male and female students. Next, we'll
use what we learn from our visualizations to quantify any differences that we
see.
### Visualize the Dataset
Showing this many states in a plot can be overwhelming, so to start we'll make a
subset of the dataset. We can use a function in the {dplyr} package called
`top_n()` to help us learn which states have the highest mean count of students
with disabilities:
```{r filter dataset with top_n}
child_counts %>%
group_by(state) %>%
summarize(mean_count = mean(total)) %>%
# which six states have the highest mean count of students with disabilities
top_n(6, mean_count)
```
These six states have the highest mean count of students in special education
over the six years we are examining. For reasons we will see in a later
visualization, we are going to exclude outlying areas and freely associated
states. That leaves us us with five states: California, Florida, New York,
Pennsylvania, and Texas. We can remove all other states but these by using
`filter()`. We'll call this new dataset `high_count`:
```{r filter most populated}
high_count <-
child_counts %>%
filter(state %in% c("california", "florida", "new york", "pennsylvania", "texas"))
```
Now we can use `high_count` to do some initial exploration. Our analysis is
about comparing counts of male and female students in special education, but
visualization is also a great way to explore related curiosities. You may
surprise yourself with what you find when visualizing your datasets. You might
come up with more interesting hypotheses, find that your initial hypothesis
requires more data transformation, or find interesting subsets of the data--we
saw a little of that in the surprisingly high `mean_count` of freely associated
states in the `state` column. Let your curiosity and intuition drive this part
of the analysis. It's one of the activities that makes data analysis a creative
process.
In that spirit, we'll start by visualizing specific genders and age groups. Feel
free to try these, but also try the other student groups for practice and more
exploration.
Start by copying and running this code in your console to see what it does:
```{r fig10-1, results = 'hide', message = FALSE, warning = F, fig.cap = "Count of Female Students in Special Education Over Time", fig.showtext = TRUE}
high_count %>%
filter(gender == "f", age == "Total, Age 6-21") %>%
ggplot(aes(x = year, y = total, color = state)) +
geom_freqpoly(stat = "identity", size = 1) +
labs(title = "Count of Female Students in Special Education Over Time",
subtitle = "Ages 6-21") +
scale_color_dataedu() +
theme_dataedu()
```
That gives us a plot that has the years in the *x*-axis and a count of female
students in the *y*-axis. Each line takes a different color based on the state it
represents.
Let's look at that closer: We used `filter()` to subset our dataset for students
who are female and ages 6 to 21. We used `aes` to connect visual elements of our
plot to our data. We connected the *x*-axis to `year`, the *y*-axis to `total`, and
the color of the line to `state`.
It's worth calling out one more thing, since it's a technique we'll be using as
we explore further. Note here that, instead of storing our new dataset in a new
variable, we filter the dataset then use the pipe operator `%>%` to feed it to
{ggplot2}. Since we're exploring freely, we don't need to create a lot of new
variables we probably won't need later.
We can also try the same plot, but subsetting for male students instead. We can
use the same code we used for the last plot, but filter for the value "m" in the
`gender` field:
```{r fig10-2, results = 'hide', message = FALSE, fig.cap="Count of Male Students in Special Education Over Time", fig.showtext = TRUE}
high_count %>%
filter(gender == "m", age == "Total, Age 6-21") %>%
ggplot(aes(x = year, y = total, color = state)) +
geom_freqpoly(stat = "identity", size = 1) +
labs(title = "Count of Male Students in Special Education Over Time",
subtitle = "Ages 6-21") +
scale_color_dataedu() +
theme_dataedu()
```
We've looked at each gender separately. What do these lines look like if we
visualized the total amount of students each year per state? To do that, we'll
need to add both gender values together and both age group values together.
We'll do this using a very common combination of functions: `group_by()` and
`summarize()`.
```{r fig10-3, results = 'hide', message = FALSE, fig.cap="Total Count of Students in Special Education Over Time", fig.showtext = TRUE}
high_count %>%
group_by(year, state) %>%
summarize(n = sum(total)) %>%
ggplot(aes(x = year, y = n, color = state)) +
geom_freqpoly(stat = "identity", size = 1) +
labs(title = "Total Count of Students in Special Education Over Time",
subtitle = "Ages 3-21") +
scale_color_dataedu() +
theme_dataedu()
```
So far we've looked at a few ways to count students over time. In each plot, we
see that while counts have grown overall for all states, each state has
different sized populations. Let's see if we can summarize that difference by
looking at the median student count for each state over the years:
```{r fig10-4, results = 'hide', message = FALSE, fig.cap="Median Students with Disabilities Count", fig.showtext = TRUE}
high_count %>%
group_by(year, state) %>%
summarize(n = sum(total)) %>%
ggplot(aes(x = state, y = n)) +
geom_boxplot(fill = dataedu_colors("yellow")) +
labs(title = "Median Students with Disabilities Count",
subtitle = "All ages and genders, 2012-2017") +
theme_dataedu()
```
The boxplots show us what we might have expected from our `freqpoly` plots
before it. The highest median student count over time is California and the
lowest is Pennsylvania.
What have we learned about our data so far? The five states in the US with the
highest total student counts (not including outlying areas and freely associated
states) do not have similar counts to each other. The student counts for each
state also appear to have grown over time.
But how can we start comparing the male student count to the female student
count? One way is to use a *ratio*, the number of times the first number contains
the second. For example, if Variable A is equal to 14, and Variable B is equal to 7,
the ratio between Variable A and Variable B is 2.00, indicating that the first number
contains twice the number of the second.
We can use the count of male students in each state and divide it by the count
of each female student. The result is the number of times male students are in
special education more or less than the female students in the same state and
year. Our coding strategy will be to
- Use `pivot_wider()` to create separate columns for male and female students.
- Use `mutate()` to create a new variable called `ratio`. The values in this
column will be the result of dividing the count of male students by the
count of female students
Note here that we can also accomplish this comparison by dividing the number of
female students by the number of male students. In this case, the result would
be the number of times female students are in special education more or less
than male students.
```{r fig10-5, results = 'hide', message = FALSE, fig.cap="Male Student to Female Student Ratio Over Time", fig.showtext = TRUE}
high_count %>%
group_by(year, state, gender) %>%
summarize(total = sum(total)) %>%
# Create new columns for male and female student counts
pivot_wider(names_from = gender,
values_from = total) %>%
# Create a new ratio column
mutate(ratio = m / f) %>%
ggplot(aes(x = year, y = ratio, color = state)) +
geom_freqpoly(stat = "identity", size = 1) +
scale_y_continuous(limits = c(1.5, 2.5)) +
labs(title = "Male Student to Female Student Ratio Over Time",
subtitle = "Ages 6-21") +
scale_color_dataedu() +
theme_dataedu()
```
By visually inspecting, we can hypothesize that there was no significant change
in the male to female ratio between the years 2012 and 2017. But very often we
want to understand the underlying properties of our education dataset. We can do
this by quantifying the relationship between two variables. In the next section,
we'll explore ways to quantify the relationship between male student counts and
female student counts.
### Model the Dataset
When you visualize your datasets, you are exploring possible relationships
between variables. But sometimes visualizations can be misleading because of the
way we perceive graphics. In his book *Data Visualization: A Practical
Introduction*, @healy2019 teaches us that
> Visualizations encode numbers in lines, shapes, and colors. That means that
> our interpretation of these encodings is partly conditional on how we perceive
> geometric shapes and relationships generally.
What are some ways we can combat these errors of perception and at the same time
draw substantive conclusions about our education dataset? When you spot a
possible relationship between variables, the relationship between female and
male counts for example, you'll want to quantify that relationship by fitting a
statistical model. Practically speaking, this means you are selecting a
distribution that represents your dataset reasonably well. This distribution
will help you quantify and predict relationships between variables. This is an
important step in the analytic process, because it acts as a check on what you
saw in your exploratory visualizations.
In this example, we'll follow our intuition about the relationship between male
and female student counts in our special education dataset. In particular, we'll
test the hypothesis that this ratio has decreased over the years. Fitting a
linear regression model that estimates the year as a predictor of the male to
female ratio will help us do just that.
In the context of modeling the dataset, we note that there are techniques available (other than a linear regression model)
for longitudinal analyses that are helpful for accounting for the way that individual
data points over time can be modeled as grouped within units (such as individual students).
Such approaches, , such as those involving structural equation models [@grimm2016growth]
and multi-level models [@west2014linear], are especially helpful for analyzing patterns of change over time--and what predicts those patterns. Both of the references cited above include R code for carrying out such analyses.
*Do we have enough information for our model?*
At the start of this section, we chose to exclude outlying areas and freely
associated states. This visualization suggests that there are some states that
have a child count so high it leaves a gap in the *x*-axis values. This can be
problematic when we try to interpret our model later. Here's a plot of female
students compared to male students. Note that the relationship appears linear,
but there is a large gap in the distribution of female student counts somewhere
between the values of 250,000 and 1,750,000:
```{r fig10-6, results='hide', message = FALSE, fig.cap="Comparison of Female Students to Male Students in Special Education", fig.showtext = TRUE}
child_counts %>%
filter(age == "Total, Age 6-21") %>%
pivot_wider(names_from = gender,
values_from = total) %>%
ggplot(aes(x = f, y = m)) +