-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathsol.qmd
261 lines (177 loc) · 6.91 KB
/
sol.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
---
title: "Homework 6 (Optional)"
author: "[Solutions]{style='background-color: yellow;'}"
toc: true
title-block-banner: true
title-block-style: default
execute:
freeze: true
cache: true
# format:
html: # comment this line to get pdf
pdf:
fig-width: 7
fig-height: 7
---
[Link to the Github repository](https://github.com/psu-stat380/hw-6)
---
::: {.callout-important style="font-size: 0.8em;"}
## Due: Wed, May 3, 2023 @ 11:59pm
Please read the instructions carefully before submitting your assignment.
1. This assignment requires you to only upload a `PDF` file on Canvas
1. Don't collapse any code cells before submitting.
1. Remember to make sure all your code output is rendered properly before uploading your submission.
⚠️ Please add your name to the author information in the frontmatter before submitting your assignment ⚠️
:::
In this assignment, we will perform various tasks involving principal component analysis (PCA), principal component regression, and dimensionality reduction.
We will need the following packages:
```{R, message=FALSE, warning=FALSE, results='hide'}
packages <- c(
"tibble",
"dplyr",
"readr",
"tidyr",
"purrr",
"broom",
"magrittr",
"corrplot",
"report",
"car"
)
# renv::install(packages)
sapply(packages, require, character.only=T)
```
<br><br><br><br>
---
## Question 1
::: {.callout-tip}
## 70 points
Principal component anlaysis and variable selection
:::
###### 1.1 (5 points)
The `data` folder contains a `spending.csv` dataset which is an illustrative sample of monthly spending data for a group of $5000$ people across a variety of categories. The response variable, `income`, is their monthly income, and objective is to predict the `income` for a an individual based on their spending patterns.
Read the data file as a tibble in R. Preprocess the data such that:
1. the variables are of the right data type, e.g., categorical variables are encoded as factors
2. all column names to lower case for consistency
3. Any observations with missing values are dropped
```{R}
path <- "data/spending.csv"
df <- read_csv(path) %>%
na.omit()
df %>% head() %>% knitr::kable()
```
---
###### 1.2 (5 points)
Visualize the correlation between the variables using the `corrplot()` function. What do you observe? What does this mean for the model?
```{R}
df %>%
keep(is.numeric) %>%
cor() %>%
corrplot()
```
---
###### 1.3 (5 points)
Run a linear regression model to predict the `income` variable using the remaining predictors. Interpret the coefficients and summarize your results.
```{R}
model1 <- lm(income ~ ., df)
model1 %>% summary()
```
> For a given predictor, the coefficient represents the change in the response variable for a unit change in the predictor, holding all other predictors constant. For example, the coefficient for `groceries` is $0.07$. This means that for a unit increase in `groceries` expenditure, the expected `income` increases by $0.07$, _holding all other predictors constant_.
---
###### 1.3 (5 points)
Diagnose the model using the `vif()` function. What do you observe? What does this mean for the model?
```{R}
model1 %>% vif()
```
> The coefficients and their significance in `model1` are not reliable because of the high multicollinearity between the predictors. This is evident from the high VIF values for the predictors.
---
###### 1.4 (5 points)
Perform PCA using the `princomp` function in R. Print the summary of the PCA object.
```{R}
pca <- df %>%
select(-income) %>%
princomp(cor = T)
summary(pca)
```
---
###### 1.5 (5 points)
Make a screeplot of the proportion of variance explained by each principal component. How many principal components would you choose to keep? Why?
```{R}
plot(pca, type="l")
```
> I would choose to keep the first 4 principal components because they explain $99.8\%$ of the variance and correspond to the "elbow" in the screeplot.
---
###### 1.6 (5 points)
By setting any factor loadings below $0.2$ to $0$, summarize the factor loadings for the principal components that you chose to keep.
```{R}
clean_loadings <- ifelse(
abs(pca$loadings[, 1:4]) < 0.1,
0,
round(pca$loadings[, 1:4], 2)
) %>%
as.data.frame()
```
Visualize the factor loadings.
```{R}
clean_loadings %>%
as.matrix() %>%
t() %>%
corrplot()
```
---
###### 1.7 (15 points)
Based on the factor loadings, what do you think the principal components represent?
Provide an interpreation for each principal component you chose to keep.
> By choosing 4 latent variables to explain $99.8\%$ of the variance, we can interpret them as follows:
> 1. **Apparel and Accessories Spending:**
>
> This latent variable represents spending on clothing, shoes, accessories, jewelry, and watches.
>
> 2. **Technology Spending:**
>
> This latent variable captures spending on electronics, smartphones, tablets, laptops, desktops, audio equipment, cameras, video games, and software.
>
> 3. **Food and Beverage Spending:**
>
> This latent variable represents spending on groceries, meat, seafood, dairy products, fruits, vegetables, snacks, beverages, alcohol, fast food, restaurant meals, coffee shops, and food delivery.
>
> 4. **Entertainment and Travel Spending:**
>
> This latent variable captures spending on books, magazines, movies, music, streaming services, sports equipment, gym memberships, outdoor activities, travel, accommodation, car rentals, and public transportation.
>
---
###### 1.8 (10 points)
Create a new data frame with the original response variable `income` and the principal components you chose to keep. Call this data frame `df_pca`.
```{R}
df_pca <- predict(pca, df) %>%
as_tibble() %>%
select(1:4) %>%
mutate(income = df[["income"]])
```
Fit a regression model to predict the `income` variable using the principal components you chose to keep. Interpret the coefficients and summarize your results.
```{R}
model2 <- lm(income ~ . , df_pca)
```
Compare the results of the regression model in 1.3 and 1.9. What do you observe? What does this mean for the model?
```{R}
model2 %>% summary()
model2 %>% vif()
```
---
###### 1.10 (10 points)
Based on your interpretation of the principal components from Question 1.7, provide an interpretation of the regression model in Question 1.9.
> The regression model in **Q1.9** is a linear combination of the 4 latent variables we identified in **Q1.7**. The coefficients represent the change in the response variable for a unit change in the latent variable, holding all other latent variables constant. For example, the coefficient for `Comp.1` is $13.33$. This means that for a unit increase in `Apparel and Accessories spending`, the expected `income` increases by $13.33$, _holding all other latent variables constant_.
---
:::{.hidden unless-format="pdf"}
\pagebreak
:::
<br><br><br><br>
<br><br><br><br>
---
::: {.callout-note collapse="true"}
## Session Information
Print your `R` session information using the following command
```{R}
sessionInfo()
```
:::