-
Notifications
You must be signed in to change notification settings - Fork 2
/
HomeZilla Team Project.Rmd
327 lines (198 loc) · 12.9 KB
/
HomeZilla Team Project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
---
title: "Homezilla Project"
author: "Salil Redkar, Shaoping Huang, Yumeng Zhu, DH Daniel Suh"
date: "10/5/2018"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Question 1
## a.
```{r,echo=FALSE,error==FALSE}
#Loading the libraries
library(tidyverse)
library(readxl)
library(ggplot2)
properties<- read_xlsx("W16604-XLS-ENG.xlsx")
browsing<- read_xlsx("W16604-XLS-ENG.xlsx",sheet = "Browsing Data")
ncol(properties)
ncol(browsing)
nrow(properties)
nrow(browsing)
```
This chunk of code loads the required libraries, and reads the excel file using the "sheet" argument to specify the specific sheet.
## b.
```{r}
# Cleaning Data
properties= properties%>%filter(type == "house")
```
The filter command is used to use the specific properties that are not listed as 'vacant' in the 'type' variable.
# Question 2
## a.
```{r}
unique_customers<-browsing%>%group_by(`Web ID`)%>% summarize(Num_customers=n_distinct(a=`Customer ID`),Num_of_photos=n_distinct(`Photo ID`))
unique_customers
```
## b.
```{r}
properties <- merge(properties, unique_customers)
```
## c.
```{r}
sub_table<-properties%>% group_by(subtype, bedrooms)%>% summarize(supply=n_distinct(`Web ID`),avg_price=mean(price),cust_subtype=sum(Num_customers))
sub_table
sub_supply<-sub_table%>% ggplot(aes(x=subtype, y=supply, fill = as.factor(bedrooms)))+
geom_bar(position = position_dodge(), stat = "identity") +
geom_text(aes(label=supply), vjust=1.1, color = "black", position = position_dodge(0.9), size = 5.5) +
scale_fill_brewer(palette = "Paired", name = "Number of Bedrooms") +
labs(title = "Supply - Subtype",
x = "Subtype",
y = "Supply") +
theme(legend.position = "right")
sub_supply
```
It is evident that 'Single Family Detached' houses with 3 bedrooms have more listings on the website than any other subtype. Condos and townhouses usually have only one or two bedrooms, with 2 bedroom Condos being the next popular category. The detached houses with more than 5 bedrooms do not have as many listings as the rest.
```{r}
bed_price<-sub_table%>% ggplot(aes(x=subtype, y=avg_price, fill = as.factor(bedrooms)))+
geom_bar(position = position_dodge(), stat = "identity") +
geom_text(aes(label=avg_price), vjust=1.1, color = "black", position = position_dodge(0.9), size = 3) +
scale_fill_brewer(palette = "Paired", name = "Number of Bedrooms") +
labs(title = "Average Price - Subtype",
x = "Subtype",
y = "Average Price") +
theme(legend.position = "right")
bed_price
```
As can be expected, the prices of the single family houses increase with the number of bedrooms. However, there are some houses with 5 bedrooms that are significantly more expensive than the houses with more bedrooms. This is probabaly due to other factors, such as location of the listing in a more expensive neighbourhood.
```{r}
sub_customers<-sub_table%>% ggplot(aes(x=subtype, y=cust_subtype, fill = as.factor(bedrooms)))+
geom_bar(position = position_dodge(), stat = "identity") +
geom_text(aes(label=cust_subtype), vjust=1.1, color = "black", position = position_dodge(0.9), size = 3.5) +
scale_fill_brewer(palette = "Paired", name = "Number of Bedrooms") +
labs(title = "Total Exterior Photo",
x = "Subtype",
y = "Customer") +
theme(legend.position = "right")
sub_customers
```
The single family detached houses with 3 bedrooms are most attractive to customers. Townhouse as a subtype has the least demand.
As a result, we recommend that HomeZilla should focus more on single family detached houses, especially those that have 2 bedrooms, and condo apartments rather than townhouses.
# Question 3
```{r}
num_photos_subbed<-properties %>% group_by(subtype, bedrooms)%>% summarize(supply=n_distinct(`Web ID`), Numofph = sum(Num_of_photos), Avg_photo=mean(Num_of_photos))
num_photos_subbed
num_photos_subbed %>%ggplot(aes(x=subtype, y=Numofph,fill=as.factor(bedrooms)))+
geom_bar(position = position_dodge(), stat = "identity") +
geom_text(aes(label=Numofph), vjust=1.1, color = "black", position = position_dodge(0.9), size = 3.5) +
scale_fill_brewer(palette = "Paired", name = "Number of Bedrooms") +
labs(title = "Photo - Subtype",
x = "Subtype",
y = "Number of Photo") +
theme(legend.position = "right")
```
As before, this result shows us that the Single family detached houses are popular, with their popularity reflected in the number of photos being higher than any other subtype. Homezilla should ensure more number of photos are posted for single detached houses with 3 bedrooms. We chose to use the total count of the number of photos over the average count, because, the total count allowed us to make a better judgement about the distribution of the photos.
# Question 4
## a.
```{r}
photo_cust_time <- group_by(browsing, `Web ID`, `Customer ID`) %>% summarise(Numofph = n_distinct(`Photo ID`),Totaltime = sum(`Time Viewed`))
photo_cust_time
```
## b.
```{r}
properties<-merge(properties,photo_cust_time)
```
## c.
```{r}
avg_pic_time<- properties%>%group_by(subtype,bedrooms)%>% summarize(Avg_pic_viewed=mean(Numofph),Avg_time_spent=mean(Totaltime),num_cus=n_distinct(`Customer ID`))
avg_pic_time
properties %>% ggplot(aes(x=subtype,y=Totaltime))+geom_point()
avg_pic_time%>% ggplot(aes(x=subtype, y=Avg_pic_viewed,fill=as.factor(bedrooms)))+
geom_bar(position = position_dodge(), stat = "identity") +
geom_text(aes(label=Avg_pic_viewed), vjust=1.1, color = "black", position = position_dodge(0.9), size = 3.5) +
scale_fill_brewer(palette = "Paired", name = "Number of Bedrooms") +
labs(title = "Average picture Viewed - Subtype",
x = "Subtype",
y = "Average picture Viewed") +
theme(legend.position = "right")
avg_pic_time%>% ggplot(aes(x=as.factor(subtype), y=Avg_time_spent,fill=subtype))+
geom_bar(stat="identity",position="dodge") +
scale_x_discrete(breaks=seq(0,10,1)) +
labs(title = "Average time spent - Subtype",
x = "Subtype",
y = "Average time spent") +
theme(legend.position = "right") +
scale_fill_brewer(palette = "Paired", name = "Number of Bedrooms")
avg_pic_time%>% ggplot(aes(x=subtype, y=num_cus,fill=as.factor(bedrooms)))+
geom_bar(position = position_dodge(), stat = "identity") +
geom_text(aes(label=num_cus), vjust=1.1, color = "black", position = position_dodge(0.9), size = 3.5) +
scale_fill_brewer(palette = "Paired", name = "Number of Bedrooms") +
labs(title = "Number of Customers - Subtype",
x = "Subtype",
y = "Number of Customers") +
theme(legend.position = "right")
```
The average number of photos seen for each subtype are more or else uniform across board. The higher values for the Single Family Detached houses with bedrooms greater than 5 are a result of the higher number of photos for each listing, and fewer listings of that size. Certain outliers in the townhouse subtype affect the analysis for the average time spent looking at the photos. More analysis in required before eliminating these form the dataset. It is likely that removing these outliers from the dataset will result in the detached houses as the subtype which customers spend more time looking at pictures.
# Question 5
```{r}
photos_perprop<- browsing %>% group_by(`Web ID`, `Photo Tag 1`) %>% summarise(`Distinct Photos`= n_distinct(`Photo ID`))
d5<- spread(photos_perprop, `Photo Tag 1`, `Distinct Photos`)
d5.1<- merge(properties, d5)
a6 <- d5.1 %>%
group_by(subtype, bedrooms) %>%
summarise(Tot_photo_ex = sum(exterior),
Tot_photo_in = sum(interior),
Tot_photo_floor = sum(floor))
a6 %>% ggplot(aes(x = subtype,
y = Tot_photo_ex,
fill = as.factor(bedrooms))) +
geom_bar(position = position_dodge(), stat = "identity") +
geom_text(aes(label=Tot_photo_ex), vjust=1.1, color = "white", position = position_dodge(0.9), size = 5.5) +
scale_fill_brewer(palette = "Paired", name = "Number of Bedrooms") +
labs(title = "Total Exterior Photo",
x = "Subtype",
y = "Total Exterior Photo") +
theme(legend.position = "right")
a6 %>% ggplot(aes(x = subtype,
y = Tot_photo_in,
fill = as.factor(bedrooms))) +
geom_bar(position = position_dodge(), stat = "identity") +
geom_text(aes(label=Tot_photo_in), vjust=1.1, color = "black", position = position_dodge(0.9), size = 3) +
scale_fill_brewer(palette = "Paired", name = "Number of Bedrooms") +
labs(title = "Total Interior Photo",
x = "Subtype",
y = "Total Interior Photo") +
theme(legend.position = "right")
cust_ph_time <- group_by(browsing, `Web ID`, `Customer ID`, `Photo Tag 1`) %>%summarise(Numofph = n_distinct(`Photo ID`),Totaltime = sum(`Time Viewed`))
merged_phtime<-merge(properties, cust_ph_time)
sub_tag_avg <- merged_phtime %>% group_by(subtype,bedrooms, `Photo Tag 1`) %>%
summarise(Numofcus = n_distinct(`Customer ID`),
Avg_pic_viewed=mean(Numofph),
Avg_time_spent_in_sec=mean(Totaltime))
sub_tag_avg %>% ggplot(aes(x=subtype,y=Numofcus,fill=`Photo Tag 1`))+
geom_bar(position = position_dodge(), stat = "identity") +
scale_fill_brewer(palette = "Paired", name = "Number of Bedrooms") +
labs(title = "Number of Customers - Subtype",
x = "Subtype",
y = "Number of Customers") +
theme(legend.position = "right")
sub_tag_avg %>% ggplot(aes(x=as.factor(bedrooms),y=Avg_pic_viewed,fill=`Photo Tag 1`))+
geom_bar(position = position_dodge(), stat = "identity") +
scale_fill_brewer(palette = "Paired", name = "Number of Bedrooms") +
labs(title = "Average picture viewed - Subtype",
x = "Subtype",
y = "Average picture viewed") +
theme(legend.position = "right") +
facet_grid(.~subtype)+scale_x_discrete(breaks=seq(0,10,1))
sub_tag_avg %>% ggplot(aes(x=as.factor(bedrooms),y=Avg_time_spent_in_sec,fill=`Photo Tag 1`))+
geom_bar(position = position_dodge(), stat = "identity") +
scale_fill_brewer(palette = "Paired", name = "Number of Bedrooms") +
labs(title = "Average time spent - Subtype",
x = "Subtype",
y = "Average time spent") +
theme(legend.position = "right") +
facet_grid(.~subtype)+scale_x_discrete(breaks=seq(0,10,1))
```
Based on the total exterior photos that customers have clicked on, it is clear that this echoes the results from our previous analysis. The detached houses with 3 bedrooms have significantly higher numbers of the time that customers spent on exterior photos, compared to other subtype and categories. This is also seen in the time that customers spent on the interior photos of these houses. It is interesting to note that, for condos with 2 bedrooms also have a significant number of customers looking at indoor photos, since condos wwould have fewer exterior photos. HomeZilla should cnocentrate on ensuring that interior photos for condos, and interior and exterior photos for detached houses are more appealling since these are the photos most customers spend time looking at. Surprisingly, the number of customers is higher for the condo apartments, with many of these customers more interested in the interiors. It is evident that customers spend more time looking at interior photos rather than exterior ones, and it would be in Homezilla's best interest to focus on the interior photos. The outliers in the dataset affect the analysis for the the average time spent looking at the photo and they are concentrated in the detached houses with 6 bedrooms.
# Question 6
The most important takeaway from the analysis is that Homezilla's focus should be on their 'Single Family House' listings. The most popular house in terms of supply is of the detached type with 3 bedrooms. These are not just high in supply, but also, have many customers looking at the interiors. We also saw certain houses from this category whose price did not increase linearly with the number of bedrooms. As mentioned before, the location of the houses may play a role in their price. The high number of photos for the the single family detached houses may also play a role in generating interest for this category, and if it is so, then this may be replicated for the other categories as well. With regards to customer browsing data, customers are in geeneral more interested in the housing interiors, and more appealing interior photos will generate more interest in a house. It is intersting to note that many customers viewed the photos for condo apartments, but that did not translate into interest as measured by the time spent looking at the photos. In this category again, the single family detached houses dominated. In conclusion, HomeZilla should expend resources in making sure the single family detached houses and more specifically their interiors, are a focus in determining their future strategy.