forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 11
/
factors.Rmd
298 lines (193 loc) · 10.7 KB
/
factors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
# Factors
## Introduction
In pandas, categorical variables (factors in R) are variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non-alphabetical order. You can read the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html).
### Prerequisites
To work with categorical variables, we'll use the __category__ data type in pandas. It supports tools for dealing with categorical variables using a wide range of helper methods.
```{python setup, message = FALSE, cache=FALSE}
import pandas as pd
import altair as alt
import numpy as np
alt.data_transformers.enable('json')
```
## Creating categories
Imagine that you have a variable that records month:
```{python}
x1 = pd.Series(["Dec", "Apr", "Jan", "Mar"])
```
Using a string to record this variable has two problems:
1. There are only twelve possible months, and there's nothing saving you
from typos:
```{python}
x2 = pd.Series(["Dec", "Apr", "Jam", "Mar"])
```
1. It doesn't sort in a useful way:
```{python}
x1.sort_values()
```
You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid __levels__:
```{python}
month_levels = pd.Series([
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
])
```
Now you can create a factor:
```{python}
y1 = pd.Categorical(x1, categories=month_levels)
y1
y1.sort_values()
```
And any values not in the set will be silently converted to `nan`:
```{python}
y2 = pd.Categorical(x2, categories=month_levels)
y2
```
Sometimes you'd prefer that the order of the levels match the order of the first appearance in the data. You can do that when creating the factor by setting levels to `pd.unique(x)`:
```{python}
f1 = pd.Categorical(x1, categories=pd.unique(x1))
f1
```
If you ever need to access the set of valid levels directly, you can do so with `levels()`:
```{python}
pd.Series(f1).cat.categories
```
## General Social Survey
For the rest of this chapter, we're going to focus on `gss_cat` data found in the `forcats` R package. It's a sample of data from the [General Social Survey](http://gss.norc.org), which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago. The survey has thousands of questions, so in `gss_cat` I've selected a handful that will illustrate some common challenges you'll encounter when working with factors.
```{python, cache = FALSE}
gss_cat = pd.read_csv("https://github.com/byuidatascience/data4python4ds/raw/master/data-raw/gss_cat/gss_cat.csv")
marital_levels = ["No answer", "Never married", "Separated", "Divorced", "Widowed", "Married"]
race_levels = ["Other", "Black", "White", "Not applicable"]
income_levels = ["No answer", "Don't know", "Refused", "$25000 or more", "$20000 - 24999",
"$15000 - 19999", "$10000 - 14999", "$8000 to 9999", "$7000 to 7999", "$6000 to 6999",
"$5000 to 5999", "$4000 to 4999", "$3000 to 3999", "$1000 to 2999", "Lt $1000", "Not applicable"]
party_levels = ["No answer", "Don't know", "Other party", "Strong republican", "Not str republican",
"Ind,near rep", "Independent", "Ind,near dem", "Not str democrat", "Strong democrat"]
religion_levels = ["No answer", "Don't know", "Inter-nondenominational", "Native american", "Christian",
"Orthodox-christian", "Moslem/islam", "Other eastern", "Hinduism", "Buddhism", "Other", "None", "Jewish",
"Catholic", "Protestant", "Not applicable"]
denom_levels = ["No answer", "Don't know", "No denomination", "Other", "Episcopal", "Presbyterian-dk wh",
"Presbyterian, merged", "Other presbyterian", "United pres ch in us", "Presbyterian c in us",
"Lutheran-dk which", "Evangelical luth", "Other lutheran", "Wi evan luth synod", "Lutheran-mo synod",
"Luth ch in america", "Am lutheran", "Methodist-dk which", "Other methodist", "United methodist",
"Afr meth ep zion", "Afr meth episcopal", "Baptist-dk which", "Other baptists", "Southern baptist",
"Nat bapt conv usa", "Nat bapt conv of am", "Am bapt ch in usa", "Am baptist asso", "Not applicable"]
gss_cat = gss_cat.assign(
marital = lambda x: pd.Categorical(x.marital, categories=marital_levels),
race = lambda x: pd.Categorical(x.race, categories=race_levels),
rincome = lambda x: pd.Categorical(x.rincome, categories=income_levels),
partyid = lambda x: pd.Categorical(x.partyid, categories=party_levels),
relig = lambda x: pd.Categorical(x.relig, categories=religion_levels),
denom = lambda x: pd.Categorical(x.denom, categories=denom_levels),
)
```
(You can get more information about the variables using the [data description sheet in data4python4ds](https://github.com/byuidatascience/data4python4ds/blob/master/data.md#a-sample-of-categorical-variables-from-the-general-social-survey).)
When factors are stored in a tibble, you can't see their levels so easily. One way to see them is with `value_counts()` or you can get a high level summary with `describe()`:
```{python}
gss_cat.race.value_counts()
gss_cat.race.describe()
```
Or with a bar chart:
```{python, cache =FALSE}
chart = (alt.Chart(gss_cat).
encode(alt.X('race'), alt.Y('count()')).
mark_bar().
properties(width = 400))
chart.save("screenshots/altair_cat_1.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_cat_1.png")
```
By default, ggplot2 will drop levels that don't have any values. You can force them to display with:
```{python, cache =FALSE}
levels_use = gss_cat.race.cat.categories.to_list()
chart = (alt.Chart(gss_cat).
encode(
x = alt.X('race', scale = alt.Scale(domain = levels_use)),
y = alt.Y('count()')).
mark_bar().
properties(width = 400))
chart.save("screenshots/altair_cat_2.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_cat_2.png")
```
These levels represent valid values that simply did not occur in this dataset. When working with factors, the two most common operations are changing the order of the levels, and changing the values of the levels. Those operations are described in the sections below.
### Exercise
1. Explore the distribution of `rincome` (reported income). What makes the
default bar chart hard to understand? How could you improve the plot?
1. What is the most common `relig` in this survey? What's the most
common `partyid`?
1. Which `relig` does `denom` (denomination) apply to? How can you find
out with a table? How can you find out with a visualisation?
## Modifying factor order
It's often useful to change the order of the factor levels in a visualisation. For example, imagine you want to explore the average number of hours spent watching TV per day across religions:
```{python, cache =FALSE}
relig_summary = gss_cat.groupby('relig').agg(
age = ('age', np.mean),
tvhours = ('tvhours', np.mean),
n = ('tvhours', 'size')
).reset_index()
chart = (alt.Chart(relig_summary).
encode(alt.X('tvhours'), alt.Y('relig')).
mark_circle())
chart.save("screenshots/altair_cat_3.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_cat_3.png")
```
It is difficult to interpret this plot because there's no overall pattern. We can improve it by reordering the levels of `relig` using the `sort` argument in `alt.Y()`. The `sort` argument uses `-x` to sort largest at the top and `x` to sort with the largest at the bottom of the y-axis. If you would like to implement more intricate sortings using `alt.EncodingSortField()` with the following arguments.
* `field`, the column to use for the sorting.
* `op`, the function you would like to use for the sort.
* Optionally, `order`, allows you to take the values from the `op` argument
function and sort them as `'descending'` or `'ascending'`.
Thus, if we were going to implement more detailed sorting we would use
`alt.EncodingSortField(field = 'tvhours', op = 'sum', order = 'ascending'))`. Note that sorting within Altair for boxplots is not very functional. You would need to use `pd.Categorical()` to put the categories in your prefered order.
```{python, cache =FALSE}
chart = (alt.Chart(relig_summary).
encode(alt.X('tvhours'), alt.Y('relig')).
mark_circle())
chart.save("screenshots/altair_cat_4.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_cat_4.png")
```
Reordering religion makes it much easier to see that people in the "Don't know" category watch much more TV, and Hinduism & Other Eastern religions watch much less.
As you start making more complicated transformations, I'd recommend moving them out of Altair and into a new variable using pandas.
```{python, eval = FALSE}
chart = (alt.Chart(relig_summary).
encode(alt.X('tvhours'), alt.Y('relig', sort = '-x')).
mark_circle())
```
As you start making more complicated transformations, I'd recommend moving them out of Altair and into a new variable using pandas. What if we create a similar plot looking at how average age varies across reported income level?
```{python, cache =FALSE}
rincome_summary = gss_cat.groupby('rincome').agg(
age = ('age', np.mean),
tvhours = ('tvhours', np.mean),
n = ('tvhours', 'size')
).reset_index()
chart = (alt.Chart(rincome_summary).
encode(alt.X('age'), alt.Y('rincome', sort = '-x')).
mark_circle())
chart.save("screenshots/altair_cat_5.png")
```
```{R, echo=FALSE, fig.align="left"}
knitr::include_graphics("screenshots/altair_cat_5.png")
```
Here, arbitrarily reordering the levels isn't a good idea! That's because `rincome` already has a principled order that we shouldn't mess with. Reserve sorting for factors whose levels are arbitrarily ordered.
Why do you think the average age for "Not applicable" is so high?
### Exercises
1. There are some suspiciously high numbers in `tvhours`. Is the mean a good
summary?
1. For each factor in `gss_cat` identify whether the order of the levels is
arbitrary or principled.
## Modifying factor levels
The pandas categorical methods for editing the categories are done using three primary methods:
- `rename_categories()`: simply pass a list of the new names.
- `add_categories()`: new list names are appended.
- `remove_categories()`: Values which are removed are replaced with `np.nan`.
- `remove_unused_categories()`: Drops categories with no values.
You can read more about categories within pandas with the [categorical data documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical-data).
### Exercises
1. How have the proportions of people identifying as Democrat, Republican, and
Independent changed over time?
1. How could you collapse `rincome` into a small set of categories?