forked from mini-pw/2020L-WB-Book
-
Notifications
You must be signed in to change notification settings - Fork 1
/
3-7-house-prices.Rmd
185 lines (126 loc) · 17.7 KB
/
3-7-house-prices.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
## Which Neighbours Affected House Prices in the '90s?
*Authors: Hubert Baniecki, Mateusz Polakowski (Warsaw University of Technology)*
### Abstract
*The house price estimation task has a long-lasting history in economics and statistics. Nowadays, both worlds unite to exploit the machine learning approach; thus, achieve the best predictive results. In the literature, there are myriad of works discussing the performance-interpretability tradeoff apparent in the modelling of real estate values. In this paper, we propose a solution to this problem, which is a highly interpretable stacked model that outperforms the black-box models. We use it to examine neighbourhood parameters affecting the median house price in the United States regions in 1990.*
### Introduction
Real estate value varies over numerous factors. These may be obvious like location or interior design, but also less apparent like the ethnicity and age of neighbours. Therefore, property price estimation is a demanding job that often requires a lot of experience and market knowledge. Is or was, because nowadays, Artificial Intelligence (AI) surpasses humans in this task [@3-7-realestate-ai]. Interested parties more often use tools like supervised Machine Learning (ML) models to precisely evaluate the property value and gain a competitive advantage [@3-7-realestate-ml1; @3-7-realestate-ml2; @3-7-realestate-ml3].
The dilemma is in blindly trusting the prediction given by so-called black-box models. These are ML algorithms that take loads of various real estate data as input and return a house price estimation without giving their reasoning. Black-box complex nature is its biggest strength and weakness at the same time. This trait regularly entails high effectiveness but does not allow for interpretation of model outputs [@3-7-multiple-models]. Because of that, specialists interested in supporting their work with automated ML decision-making are more eager to use white-box models like linear regression or decision trees [@3-7-wb-vs-bb]. These do not achieve state-of-the-art performance efficiently, but instead, provide valuable information about the relationships present in data through model interpretation.
For many years houses have been popular properties; thus, they are of particular interest for ordinary people. What exact influence had the demographic characteristics of the house neighbourhood on its price in the '90s? Although in the absence of current technology, it has been hard to answer such question years ago [@3-7-history2001], now we can.
In this paper, we perform a case study on the actual United States Census data from 1990 [@3-7-housepricesdata] and deliver an interpretable white-box model that estimates the median house price by the region. We present multiple approaches to this problem and choose the best model, which achieves similar performance to complex black-boxes. Finally, using its interpretable nature, we answer various questions that give a new life to this historical data.
### Related Work
The use of ML in the real estate domain is a well-documented ground [@3-7-realestate-ai] and not precisely a topic of this contribution. We relate to the works that aim to use Interpretable ML techniques [@3-7-molnar] to interpret models predictions in the house price estimation problem.
The state-of-the-art approach to house price estimation is to combine linear and semi-log regression models with the Hedonic Pricing Method [@3-7-hpm1; @3-7-hpm2], which aims to determine the extent that environmental or ecosystem factors affect the price of a good. [@3-7-hpm-iml] deliberately seeks to interpret the outcome and provide information about the parameters that affect property value. There are also comparisons between the linear white-box and black-box models [@3-7-wb-vs-bb; @3-7-multiple-models] which showcase the performance-interpretability tradeoff [@3-7-gosiewska-lifting].
Nature of the topic might entail that data is place-specific; therefore, part of the studies focus on a single location with the use of geospatial data. The case study on London [@3-7-localarea-network] links the street network community structure with house price, which takes into consideration the topology of the city. In contradiction, [@3-7-relative-absolute] uses the Oslo city data to explore the differences between the relative and absolute location attributes. Applying data like the distance to the nearest shop and transportation is place-agnostic.
One of the new ideas is to utilize location data from multiple sources in a Multi-Task Learning approach [@3-7-mlt]. It also studies the relationships between the tasks, which gives an extensive insight on prediction attributions.
In this paper, we partially aim to enhance the use of regression decision tree models, which had been utilized to estimate the house prices based on their essential characteristics in [@3-7-houseprices-tree].
### Data
For this case study we use the *house_8L* dataset crafted from the data collected in 1990 by the United States Census Bureau. Each record stands for a distinct United States state while the target value is a median house price in a given region. The variables are presented in Table \@ref(tab:3-7-dataset).
```{r 3-7-dataset, echo = FALSE}
library(kableExtra)
text_tbl <- data.frame(
"Original name" = c("price", "P3", "H15.1", "H5.2", "H40.4", "P11.3", "P16.2", "P19.2", "P6.4"),
"New name" = c("price", "house_n", "avg_room_n", "forsale_h_pct", "forsale_6mplus_h_pct",
"age_25_64_pct", "family_2plus_h_pct", "black_h_pct", "asian_p_pct"),
"Description" = c(
"median price of the house in the region",
"total number of households",
"average number of rooms in an owner-occupied Housing Units",
"percentage of vacant Housing Units which are for sale only",
"percentage of vacant-for-sale Housing Units vacant more then 6 months",
"percentage of people between 25-64 years of age",
"percentage of households with 2 or more persons which are family households",
"percentage of households with black Householder",
"percentage of people which are of Asian or Pacific Islander race"
), "Median" = c("33100", "505", "5.957", "0.148", "0.500", "0.483", "0.714", "0.003", "0.002")
)
kable(text_tbl, align=c('c', 'c', 'c'), caption="Description of variables present in the house_8L dataset.") %>%
kable_styling("bordered", full_width = F) %>%
column_spec(1, width = "10em", bold = F) %>%
column_spec(2, width = "10em", bold = T) %>%
column_spec(4, width = "10em", bold = F) %>%
column_spec(3, width = "30em")
```
Furthermore, we will apply our Metodology (Section \@ref(sec3-7-methodology)) on a corresponding *house_16H* dataset, which has the same target but a different set of variables.
More correlated variables of a higher variance make it significantly harder to estimate the median house price in a given region.
Such validation will allow us to evaluate our model on a more demanding task.
The comprehensive description of used data can be found in [@3-7-housepricesdata].
### Methodology {#sec3-7-methodology}
In this section, we are going to focus on developing the best white-box model, which provides interpretability of features. Throughout this case study, we use the Mean Absolute Error (MAE) measure to evaluate the model performance, because we focus on the residuals while the mean of absolute values of residuals is the easiest to interpret.
#### Exploratory Data Analysis
Performing the Exploratory Data Analysis highlighted multiple vital issues with the data. Firstly, Figure \@ref(fig:3-7-eda) presents the immerse skewness of the target, which usually leads to harder modelling. However, we decided not to transform the price variable because this might provide less interpretability in the end. Secondly, there are 46 data points with unnaturally looking target value. We suspect that the target value of 500001 is artificially made, so we removed these outliers. Finally, there are six percentage and two count variables, which indicates that there are not many possibilities for feature engineering.
```{r 3-7-eda, cache=FALSE, out.width="800", fig.align="center", echo=FALSE, eval = is.html, fig.cap="(L) Histogram of the target values shows that the distribution is very skewed. (R) Exemplary variable correlation with the target shows that there are few data points with the same, unnaturally high target value."}
knitr::include_graphics('images/3-7-eda.png')
```
It is also worth noting that there are no missing values and the dataset has over 22k data points which allow us to reliably split the data into train and test subsets using the 2:1 ratio.
Throughout this case study, we use the Mean Absolute Error (MAE) measure to evaluate the model performance, because we later focus on the residuals while the mean of absolute values of residuals is the easiest to interpret.
#### SAFE
The first approach was using the SAFE [@gosiewska2019safe] technique to engineer new features and produce a linear regression model. We trained a well-performing black-box *ranger* [@ranger] model and extracted new interpretable features using its Partial Dependence Profiles [@pdp]. Then we used these features to craft a new linear model which indeed was better than the baseline linear model by about 10%. It is worth noting that both of these linear models had a hard time succeeding because of the target skewness.
#### Divide-and-conquer
In this section, we present the main contribution of this paper.
The divide-and-conquer idea has many computer science applications, e.g. in sorting algorithms, natural language processing, or parallel computing.
We decided to make use of its core principles in constructing the method for fitting the enhanced white-box model.
The final result is multiple tree models combined which decisions are easily interpretable.
The proposed algorithm presented in Figure \@ref(fig:3-7-algorithm) is:
1. Divide the target variable with `k` middle points into `k+1` groups.
2. Fit a black-box classifier on train data which predicts the belonging to the `i-th` group.
3. Use this classifier to divide the train and test data into `k+1` train and test subsets.
4. For every `i-th` subset fit a white-box estimator of target variable on the `i-th` train data.
5. Use the `i-th` estimator to predict the outcome of the `i-th` test data.
```{r 3-7-algorithm, cache=FALSE, out.width="800", fig.align="center", echo=FALSE, eval = is.html, fig.cap="Following the steps 1-5 presented in the diagram, the divide-and-conquer algorithm is used to construct an enhanced white-box model. Such a stacked model consists of the black-box classifier and white-box estimators."}
knitr::include_graphics('images/3-7-algorithm.png')
```
The final product is a stacked model with one classifier and `k+1` estimators. The exact models are for engineers to choose. It is worth noting that the unsupervised clustering method might be used instead of the classification model.
### Results
#### The stacked model
For the house price task, we chose `k = 1`, and the middle point was arbitrary chosen as 100k, which divides the data into two groups in about a 10:1 ratio. We used the *ranger* random forest model as a black-box classifier and the *rpart* [@rpart] decision tree model as a white-box estimator.
The *ranger* model had default parameters with `mtry = 3`. The parameters of *rpart* models were:
- `maxdepth = 4` - low depth reassures the interpretability of the model
- `cp = 0.001` - lower complexity helps with the skewed target
- `minbucket = 1% of the training data` - more filled tree leaves adds up to higher interpretability
Figure \@ref(fig:3-7-tree1) depicts the tree that estimates cheaper houses, while Figure \@ref(fig:3-7-tree2) presents the tree that estimates more expensive houses.
```{r 3-7-tree1, cache=FALSE, out.width="800", fig.align="center", echo=FALSE, eval = is.html, fig.cap="The tree that estimates cheaper houses. Part of the stacked model."}
knitr::include_graphics('images/3-7-tree-cheap.svg')
```
```{r 3-7-tree2, cache=FALSE, out.width="800", fig.align="center", echo=FALSE, eval = is.html, fig.cap="The tree that estimates more expensive houses. Part of the stacked model."}
knitr::include_graphics('images/3-7-tree-rich.svg')
```
Interpreting the stacked model presented in Figures \@ref(fig:3-7-tree1) & \@ref(fig:3-7-tree2) leads to multiple conclusions. Firstly, we can observe the noticeable impact of features like the total number of households or the average number of rooms on the median price of the house in the region, which is compliant with basic intuitions. It is also evident that the bigger percentage of people between 25-64 years of age the higher the prices.
Finally, we can observe the impact of critical features. The percentage of people which are of Asian or Pacific Islander race divides the prices in an opposing direction to the percentage of households with black Householder. The corresponding tree splits showcase which neighbours, and in what manner, affected house prices in the '90s. Whether it is a correlation or causality is a valid debate that could be further investigated.
#### Comparison of the residuals
In this section, we compare our stacked model with baseline *ranger* and *rpart* models, respectively referred to as black-box and white-box. Our solution achieves competitive performance with interpretable features.
The main idea behind the divide-and-conquer technique was to minimize the maximum value of the absolute residuals, which reassures that no significant errors will happen. Such an approach may inevitably lead to minimizing the sum of the absolute residual values (MAE). In Figure \@ref(fig:3-7-boxplot), we can see that the targets mentioned above were indeed met. The stacked model not only has the lowest maximum error value but also has the best performance on average, as the red dot highlights the MAE score.
```{r 3-7-boxplot, cache=FALSE, out.width="800", fig.align="center", echo=FALSE, eval = is.html, fig.cap="Boxplots of residuals for the stacked model compared to black-box and white-box models. The plot is divided for cheaper and more expensive houses. The red dot highlights the MAE score."}
knitr::include_graphics('images/3-7-boxplot.png')
```
Figure \@ref(fig:3-7-density) presents a more in-depth analysis of model residuals. In the top, we can observe that the black-box model has the lowest absolute residual mode (tip of the distribution), but the stacked model lays more in the centre (base of the distribution), which leads to more even spread of residuals. In the bottom, we can observe that the black-box model tends to undervalue house prices, while our model overestimates them. Looking at the height of the tip of the distribution and its shape, we can conclude that the stacked model provides more reliable estimations.
```{r 3-7-density, cache=FALSE, out.width="800", fig.align="center", echo=FALSE, eval = is.html, fig.cap="Density of residuals for the stacked model compared to black-box and white-box models. The plot is divided for cheaper and more expensive houses."}
knitr::include_graphics('images/3-7-density.png')
```
#### Comparison of the scores
Finally, we present the comparison of MAE scores for all of the used models in this case study in Table \@ref(tab:3-7-table-results). There are two tasks with different variables, complexity and correlations. We calculate the scores on the test subsets.
We can see that the linear models performed the worse, although the SAFE approach noticeably lowered the MAE. Then there is a decision tree which performed better but not so on the more laborious task. Both of the black-box models did a far better job at house price estimation than interpretable models. Finally, our stacked model is a champion with the best performance on both of the tasks.
```{r 3-7-table-results, echo = FALSE}
library(kableExtra)
text_tbl <- data.frame(
'Model' = c('ranger', "xgboost", 'linear model', "SAFE on ranger", 'rpart', "stacked model"),
'house_8L' = c(14829, 16035, 23057, 21408, 19195, 14605),
'house_16H' = c(15602, 16499, 24051, 22601, 22145, 15273)
)
text_tbl <- text_tbl[c(3,4,5,2,1,6),]
text_tbl$house_8L <- paste0(signif(text_tbl$house_8L, 3)/1000, "k")
text_tbl$house_16H <- paste0(signif(text_tbl$house_16H, 3)/1000, "k")
rownames(text_tbl) <- NULL
kable(text_tbl, align=c('c', 'c', 'c'), caption="Comparison of the MAE score for all of the used models on test datasets. Light colour highlights white-box models, while the dark colour is for black-box models.") %>%
kable_styling("bordered", full_width = F) %>%
column_spec(1, width = "20em", bold = F) %>%
column_spec(2, width = "10em", bold = F) %>%
column_spec(3, width = "10em", bold = F) %>%
row_spec(6, background = "#D3D3D3", color="black", bold=T) %>%
row_spec(c(4,5), background = "#B8B8B8", color = "black") %>%
row_spec(c(1,2,3 ), background = "#D3D3D3", color = "black") %>%
add_header_above(c(" " = 1, "Dataset (test)" = 2), color = "black", background = "#f8f8f8")
#row_spec(c(6), background = "#4378bf", color = "white")
```
### Conclusions
This case study aimed to provide an interpretable machine learning model that achieves state-of-the-art performance on the datasets crafted from the data collected in 1990 by the United States Census Bureau. We not only provided such a model but also examined its decision-making process to determine how it estimates the median house price in the United States regions. The stacked model had prominently shown the impact of neighbours' age and race in predicting the outcome.
We believe that the principles of the divide-and-conquer algorithm can be successfully applied in other domains to neglect the performance-interpretability tradeoff apparent while using machine learning models. In further research, we would like to generalize this approach into a well-defined framework and apply it to several different problems.