-For this post, I decided to hide the code chunks by default to improve legibility. You can click the Show code-button to expand the code.
-
+
+{{< standout bg="#acc8d4" >}}
+For this post, I decided to hide the code chunks by default to improve legibility. You can click on *Show code* to expand the code.
+{{< /standout >}}
### Introduction
I'm at the age where a lot of my peers start thinking about buying a home. It is a major conversation topic around the office, and these conversations are not often filled with optimism and excitement. Oslo is a notoriously expensive city to live in, and the housing market is very tough.
-Of all the worrying events in 2020, Millenials and Gen Z kids ranked financial instability as the second most common driver of stress, the most common being the welfare of their family ([link](https://www2.deloitte.com/global/en/pages/about-deloitte/articles/millennialsurvey.html)). Today I want to dissect one of the possible causes of this *economic anxiety*: **the housing market**. Now, I'm not an economist nor a financial expert. Nonetheless, I believe I can leverage some of my knowledge of data wrangling and analysis to contribute a small piece to this topic. My insight into the Oslo housing market is fairly limited to so far and there's many nuances I don't fully understand yet, but I do think I have some relevant knowledge of the Dutch housing market. So today I'll dive into some data on the Dutch market, and particularly the Amsterdam housing market. There's countless stories about the Amsterdam housing market and how terrible it is for new buyers, similar to the Oslo housing market. However, in my opinion the Amsterdam housing market is already a few steps ahead of the Oslo market, and hopefully the Amsterdam market can offer some warnings about what the Oslo housing market might look like in a few years without some policy corrections.
+{{< sidenote >}}
+This conclusion is based on a survey from Deloitte in 2020 where they surveyed millennials and Gen Z kids both before and after the onset of the pandemic
+{{< /sidenote >}}
+
+Of all the worrying events in 2020, millennials and Gen Z kids ranked financial instability as the second most common driver of stress, the most common being the welfare of their family ([source](https://www2.deloitte.com/content/dam/Deloitte/global/Documents/About-Deloitte/deloitte-2020-millennial-survey.pdf)). Today I want to dissect one of the possible causes of this *economic anxiety*: **the housing market**. Now, I'm not an economist nor a financial expert. Nonetheless, I believe I can leverage some of my knowledge of data wrangling and analysis to contribute a small piece to this topic. My insight into the Oslo housing market is fairly limited to so far and there's many nuances I don't fully understand yet, but I do think I have some relevant knowledge of the Dutch housing market. So today I'll dive into some data on the Dutch market, and particularly the Amsterdam housing market. There's countless stories about the Amsterdam housing market and how terrible it is for new buyers, similar to the Oslo housing market. However, in my opinion the Amsterdam housing market is already a few steps ahead of the Oslo market, and hopefully the Amsterdam market can offer some warnings about what the Oslo housing market might look like in a few years without some policy corrections.
This will be a fairly basic data analysis and visualization post, I can't claim that this is an exhaustive list and that I didn't miss some nuances, but I'll give it . I collected some data from the Amsterdam Real Estate Association ([Makelaars Vereniging Amsterdam; MVA](https://www.mva.nl)) and statistics from the Central Bureau for Statistics ([Centraal Bureau for de Statistiek; CBS](https://www.cbs.nl)). As usual, I'll be using `{tidyverse}` *a lot*. I've also recently started using the `{ggtext}` package to manage the text elements in my plots, inspired by Cédric Scherer ([@CedScherer](https://twitter.com/CedScherer)). I'll use the `{gt}` package to spice up some of the tables, and `{patchwork}` for arranging plots. I'll use the `{cbsodataR}` to download data from the Central Bureau for Statistics ([CBS](https://www.cbs.nl)).
@@ -60,27 +67,46 @@ font_add_google(name = "Nunito Sans", family = "nunito-sans")
showtext_auto()
theme_set(ggthemes::theme_economist(base_family = "nunito-sans") +
- theme(rect = element_rect(fill = "#EBEBEB", color = "transparent"),
- plot.background = element_rect(fill = "#EBEBEB", color = "transparent"),
- panel.background = element_rect(fill = "#EBEBEB", color = "transparent"),
- plot.title = element_textbox(margin = margin(0,0,5,0,"pt")),
- plot.title.position = "plot",
- plot.subtitle = element_textbox(hjust = 0, margin = margin(0,0,15,0,"pt")),
- plot.caption = element_textbox(hjust = 1),
- plot.caption.position = "plot",
- axis.title.y = element_textbox(orientation = "left-rotated", face = "bold",
- margin = margin(0,0,5,0,"pt")),
- axis.text.y = element_text(hjust = 1),
- legend.position = "bottom",
- legend.box = "vertical",
- legend.text = element_text(size = 10)))
+ theme(
+ rect = element_rect(fill = "#EBEBEB", color = "transparent"),
+ plot.background = element_rect(
+ fill = "#EBEBEB",
+ color = "transparent"
+ ),
+ panel.background = element_rect(
+ fill = "#EBEBEB",
+ color = "transparent"
+ ),
+ plot.title = element_textbox(
+ margin = margin(0, 0, 5, 0, "pt")
+ ),
+ plot.title.position = "plot",
+ plot.subtitle = element_textbox(
+ hjust = 0,
+ margin = margin(0, 0, 15, 0, "pt")
+ ),
+ plot.caption = element_textbox(hjust = 1),
+ plot.caption.position = "plot",
+ axis.title.y = element_textbox(
+ orientation = "left-rotated", face = "bold",
+ margin = margin(0, 0, 5, 0, "pt")
+ ),
+ axis.text.y = element_text(hjust = 1),
+ legend.position = "bottom",
+ legend.box = "vertical",
+ legend.text = element_text(size = 10)
+ ))
```
### Getting the data
-The first piece of data I'll use comes from the Amsterdam Real Estate Association. They publish quarterly data on a number of variables about the Amsterdam housing market ([link](https://www.mva.nl/over-de-mva/mva/kwartaalcijfers)), inclusing asking price, final price paid, number of properties put on the market, number of properties sold and a few more back to the first quarter of 2012. *Obviously*, these numbers all come in pdf-format, because the people writing quarterly reports apparently have a *massive hatred* towards people that want to analyze this data. I downloaded the reports, used the online tool [PDFTables](https://pdftables.com) to convert them to Excel, and then stitched the tables together manually. Of course (remember the authors have a *massive hatred* towards us), the formatting of the numbers in the tables weren't consistent between different quarterly reports, so I had to do some cleaning in R. I put each table in a separate sheet and then used functionality from the `{readxl}` package to load each table into a different variable and then do the cleaning per table. This was a bit cumbersome. Since this will probably be a very long post, I'll hide the code, but you can see it in the [Rmarkdown file](https://github.com/danielroelfs/danielroelfs.com/blob/main/content/blog/2020-running-an-ica-on-questionnaires/index.Rmd) (at least until I figure out how to get code folding to work on my laptop).
+{{< sidenote br="10em" >}}
+There are [R packages](https://docs.ropensci.org/tabulizer/) that can parse PDF files, but in my experience they can be clunky. In this case the simplest solution seemed the best, despite requiring some manual work
+{{< /sidenote >}}
+
+The first piece of data I'll use comes from the Amsterdam Real Estate Association. They publish quarterly data on a number of variables about the Amsterdam housing market ([link](https://www.mva.nl/over-de-mva/mva/kwartaalcijfers)), inclusing asking price, final price paid, number of properties put on the market, number of properties sold and a few more back to the first quarter of 2012. *Obviously*, these numbers all come in pdf-format, because the people writing quarterly reports apparently have a *massive hatred* towards people that want to analyze this data. I downloaded the reports, used the online tool [PDFTables](https://pdftables.com) to convert them to Excel, and then stitched the tables together manually. Of course (remember the authors have a *massive hatred* towards us), the formatting of the numbers in the tables weren't consistent between different quarterly reports, so I had to do some cleaning in R. I put each table in a separate sheet and then used functionality from the `{readxl}` package to load each table into a different variable and then do the cleaning per table. This was a bit cumbersome.
You can look at the code I used to load and merge the files. It's a bit of a mess:
@@ -88,61 +114,135 @@ You can look at the code I used to load and merge the files. It's a bit of a mes
Show code
``` r
-asking_price <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 2) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.double), ~ .x * 1e3),
- across(where(is.character), ~ parse_number(str_remove_all(.x, fixed(" "))))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "asking_price")
-
-transaction_price <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 3) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.character), ~ parse_number(str_remove_all(.x, fixed(" "))))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "transaction_price")
-
-price_per_m2 <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 4) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.double), ~ .x * 1e3),
- across(where(is.character), ~ parse_number(str_remove_all(.x, fixed(" "))))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "price_per_m2")
-
-n_offered <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 5) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.character), ~ parse_number(str_remove_all(.x, fixed(" "))))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "n_offered")
-
-n_sold <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 6) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.character), ~ parse_number(str_remove_all(.x, fixed(" "))))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "n_sold")
-
-mortgage_months <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 7) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.character), ~ parse_number(str_remove_all(.x, fixed(" "))))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "mortgage_months")
-
-tightness_index <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 8) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.character), ~ parse_number(str_replace_all(.x, ",", ".")))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "tightness_index")
-
-data_merged <- inner_join(asking_price, transaction_price) |>
- inner_join(price_per_m2) |>
- inner_join(n_offered) |>
- inner_join(n_sold) |>
- inner_join(mortgage_months) |>
- inner_join(tightness_index) |>
- mutate(asking_price = ifelse(asking_price < 1e5, asking_price * 1e3, asking_price),
- transaction_price = ifelse(transaction_price < 1e5, transaction_price * 1e3, transaction_price),
- price_per_m2 = ifelse(price_per_m2 > 1e4, price_per_m2 / 1e3, price_per_m2))
-
-write_rds(data_merged, "data_merged.rds")
+asking_price <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 2
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(where(is.double), ~ .x * 1e3),
+ across(
+ where(is.character),
+ ~ parse_number(str_remove_all(.x, fixed(" ")))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "asking_price"
+ )
+
+transaction_price <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 3
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(
+ where(is.character),
+ ~ parse_number(str_remove_all(.x, fixed(" ")))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "transaction_price"
+ )
+
+price_per_m2 <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 4
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(where(is.double), ~ .x * 1e3),
+ across(
+ where(is.character),
+ ~ parse_number(str_remove_all(.x, fixed(" ")))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "price_per_m2"
+ )
+
+n_offered <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 5
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(
+ where(is.character),
+ ~ parse_number(str_remove_all(.x, fixed(" ")))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "n_offered"
+ )
+
+n_sold <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 6
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(
+ where(is.character),
+ ~ parse_number(str_remove_all(.x, fixed(" ")))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "n_sold"
+ )
+
+mortgage_months <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 7
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(
+ where(is.character),
+ ~ parse_number(str_remove_all(.x, fixed(" ")))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "mortgage_months"
+ )
+
+tightness_index <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 8
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(
+ where(is.character),
+ ~ parse_number(str_replace_all(.x, ",", "."))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "tightness_index"
+ )
+
+data_merged <- inner_join(asking_price, transaction_price) |>
+ inner_join(price_per_m2) |>
+ inner_join(n_offered) |>
+ inner_join(n_sold) |>
+ inner_join(mortgage_months) |>
+ inner_join(tightness_index) |>
+ mutate(
+ asking_price = ifelse(asking_price < 1e5,
+ asking_price * 1e3,
+ asking_price
+ ),
+ transaction_price = ifelse(transaction_price < 1e5,
+ transaction_price * 1e3,
+ transaction_price
+ ),
+ price_per_m2 = ifelse(price_per_m2 > 1e4,
+ price_per_m2 / 1e3,
+ price_per_m2
+ )
+ )
+
+write_rds(data_merged, "./data/data_merged.rds")
```
@@ -153,7 +253,7 @@ Let's have a look at the dataset.
Show code
``` r
-data_merged <- read_rds("data_merged.rds")
+data_merged <- read_rds("./data/data_merged.rds")
glimpse(data_merged)
```
@@ -178,28 +278,34 @@ From this dataset, I want to create a few new variables. I want to create a date
Show code
``` r
-data <- data_merged |>
- rename(type = type_woning) |>
- mutate(quarter = str_extract(date, pattern = "x(.*?)e"),
- quarter = parse_number(quarter),
- year = str_extract(date, pattern = "kw_(.*)"),
- year = parse_number(year),
- date = as.Date(str_glue("{year}-{(quarter * 3)}-01")),
- diff_ask_paid = transaction_price - asking_price,
- diff_ask_paid_perc = diff_ask_paid / asking_price,
- diff_offered_sold = n_offered - n_sold,
- diff_offered_sold_perc = diff_offered_sold / n_offered,
- perc_sold = n_sold / n_offered,
- type = case_when(str_detect(type,"Totaal") ~ "Total",
- str_detect(type,"<= 1970") ~ "Apartments (pre-1970)",
- str_detect(type,"> 1970") ~ "Apartments (post-1970)",
- str_detect(type,"Tussenwoning") ~ "Terraced house",
- str_detect(type,"Hoekwoning") ~ "Corner house",
- str_detect(type,"Vrijstaand") ~ "Detached house",
- str_detect(type,"2-onder-1-kap") ~ "Semi-detached house"),
- type = factor(type, levels = c("Apartments (pre-1970)","Apartments (post-1970)",
- "Terraced house","Corner house","Detached house",
- "Semi-detached house","Total"))) |>
+data <- data_merged |>
+ rename(type = type_woning) |>
+ mutate(
+ quarter = str_extract(date, pattern = "x(.*?)e"),
+ quarter = parse_number(quarter),
+ year = str_extract(date, pattern = "kw_(.*)"),
+ year = parse_number(year),
+ date = as.Date(str_glue("{year}-{(quarter * 3)}-01")),
+ diff_ask_paid = transaction_price - asking_price,
+ diff_ask_paid_perc = diff_ask_paid / asking_price,
+ diff_offered_sold = n_offered - n_sold,
+ diff_offered_sold_perc = diff_offered_sold / n_offered,
+ perc_sold = n_sold / n_offered,
+ type = case_when(
+ str_detect(type, "Totaal") ~ "Total",
+ str_detect(type, "<= 1970") ~ "Apartments (pre-1970)",
+ str_detect(type, "> 1970") ~ "Apartments (post-1970)",
+ str_detect(type, "Tussenwoning") ~ "Terraced house",
+ str_detect(type, "Hoekwoning") ~ "Corner house",
+ str_detect(type, "Vrijstaand") ~ "Detached house",
+ str_detect(type, "2-onder-1-kap") ~ "Semi-detached house"
+ ),
+ type = factor(type, levels = c(
+ "Apartments (pre-1970)", "Apartments (post-1970)",
+ "Terraced house", "Corner house", "Detached house",
+ "Semi-detached house", "Total"
+ ))
+ ) |>
glimpse()
```
@@ -224,39 +330,58 @@ data <- data_merged |>
$ diff_offered_sold_perc 0.848854962, 0.837431694, 0.850767085, 0.786610…
$ perc_sold 0.1511450, 0.1625683, 0.1492329, 0.2133891, 0.1…
-The first thing that seems interesting to do is to plot the percentage difference between the asking price and the price paid. This will give us an indication of the trend in overpaying on different types of properties. I'll use a color palette and legend design I again shamelessly stole from Cédric Scherer ([@CedScherer](https://twitter.com/CedScherer)). I'll use a simple line graph to visualize the percentage overpay.
+{{< sidenote br="1em" >}}
+I'll use a color palette and legend design I shamelessly stole from [Cédric Scherer](https://twitter.com/CedScherer)
+{{< /sidenote >}}
+
+The first thing that seems interesting to do is to plot the percentage difference between the asking price and the price paid. This will give us an indication of the trend in overpaying on different types of properties. Let's use a simple line graph to visualize the percentage overpay.
Show code
``` r
-colors <- c("#019868","#9dd292","#ec0b88","#651eac","#e18a1e","#2b7de5")
-
-data |>
- filter(type != "Total") |>
- group_by(type) |>
- mutate(n_total = sum(n_sold),
- type_label = str_glue("{type} (n={format(n_total, big.mark = \".\", decimal.mark = \",\")})"),
- type_label = str_replace(type_label,"\\) \\(", ", ")) |>
- arrange(type) |>
- mutate(type_label = factor(type_label)) |>
- ggplot(aes(x = date, y = diff_ask_paid_perc, color = type_label)) +
+colors <- c("#019868", "#9dd292", "#ec0b88", "#651eac", "#e18a1e", "#2b7de5")
+
+data |>
+ filter(type != "Total") |>
+ group_by(type) |>
+ mutate(
+ n_total = sum(n_sold),
+ type_label = str_glue(
+ "{type} (n={format(n_total, big.mark = \".\", decimal.mark = \",\")})"
+ ),
+ type_label = str_replace(type_label, "\\) \\(", ", ")
+ ) |>
+ arrange(type) |>
+ mutate(type_label = factor(type_label)) |>
+ ggplot(aes(x = date, y = diff_ask_paid_perc, color = type_label)) +
geom_hline(yintercept = 0, color = "grey30", size = 1) +
- geom_line(size = 1.2, alpha = 0.8, lineend = "round", key_glyph = "point") +
- labs(title = "Overbidding has become the new normal",
- subtitle = "_Paying as much as 5% over asking price is common the past few years_",
- x = NULL,
- y = "Percentage difference between\nasking price and price paid",
- color = NULL,
- caption = "_**Data**: MVA_") +
- scale_y_continuous(labels = scales::label_percent()) +
- scale_color_manual(values = colors,
- guide = guide_legend(title.position = "top", title.hjust = 0.5, nrow = 2,
- label.position = "right",
- override.aes = list(fill = "transparent", size = 6, alpha = 1))) +
- theme(plot.title = element_textbox(size = 20),
- axis.title.y = element_textbox(width = grid::unit(2.5, "in")),
- legend.key = element_rect(fill = "transparent", color = "transparent"))
+ geom_line(
+ size = 1.2, alpha = 0.8,
+ lineend = "round", key_glyph = "point"
+ ) +
+ labs(
+ title = "Overbidding has become the new normal",
+ subtitle = "_Paying as much as 5% over asking price is common the past few years_",
+ x = NULL,
+ y = "Percentage difference between\nasking price and price paid",
+ color = NULL,
+ caption = "_**Data**: MVA_"
+ ) +
+ scale_y_continuous(labels = scales::label_percent()) +
+ scale_color_manual(
+ values = colors,
+ guide = guide_legend(
+ title.position = "top", title.hjust = 0.5, nrow = 2,
+ label.position = "right",
+ override.aes = list(fill = "transparent", size = 6, alpha = 1)
+ )
+ ) +
+ theme(
+ plot.title = element_textbox(size = 20),
+ axis.title.y = element_textbox(width = grid::unit(2.5, "in")),
+ legend.key = element_rect(fill = "transparent", color = "transparent")
+ )
```
@@ -269,27 +394,40 @@ Prior to 2014, most properties in Amsterdam were sold at about 6% below asking p
Show code
``` r
-data |>
- filter(type != "Total",
- date == max(date)) |>
- select(type, diff_ask_paid_perc, diff_ask_paid, transaction_price, n_sold) |>
- arrange(-diff_ask_paid_perc) |>
- gt() |>
- cols_align(columns = "type", align = "left") |>
- fmt_percent(columns = "diff_ask_paid_perc") |>
- fmt_currency(columns = "diff_ask_paid", currency = "EUR", decimals = 0, sep_mark = " ") |>
- fmt_currency(columns = "transaction_price", currency = "EUR", decimals = 0, sep_mark = " ") |>
- fmt_number(columns = "n_sold", sep_mark = " ", drop_trailing_zeros = TRUE) |>
+data |>
+ filter(
+ type != "Total",
+ date == max(date)
+ ) |>
+ select(type, diff_ask_paid_perc, diff_ask_paid, transaction_price, n_sold) |>
+ arrange(-diff_ask_paid_perc) |>
+ gt() |>
+ cols_align(columns = "type", align = "left") |>
+ fmt_percent(columns = "diff_ask_paid_perc") |>
+ fmt_currency(
+ columns = "diff_ask_paid", currency = "EUR",
+ decimals = 0, sep_mark = " "
+ ) |>
+ fmt_currency(
+ columns = "transaction_price", currency = "EUR",
+ decimals = 0, sep_mark = " "
+ ) |>
+ fmt_number(
+ columns = "n_sold", sep_mark = " ",
+ drop_trailing_zeros = TRUE
+ ) |>
cols_label(
type = html("
")
+ ) |>
+ tab_source_note(source_note = md("_**Data**: MVA_")) |>
tab_options(table.background.color = "#EBEBEB") |>
gtsave("overpay-table.png", expand = 0)
```
@@ -302,42 +440,69 @@ What contributed to this price increase? A simple supply-and-demand plays a part
Show code
``` r
-data |>
- filter(type != "Total") |>
- group_by(type) |>
- mutate(n_total = sum(n_sold),
- type_label = str_glue("{type} (n={format(n_total, big.mark = \".\", decimal.mark = \",\")})"),
- type_label = str_replace(type_label,"\\) \\(", ", ")) |>
- arrange(type) |>
- mutate(type_label = factor(type_label)) |>
- ggplot(aes(x = date, y = tightness_index, color = type_label)) +
- geom_rect(data = tibble(),
- aes(xmin = as.Date(-Inf), xmax = as.Date(Inf), ymin = c(0,5,10), ymax = c(5,10,Inf),
- fill = c("Sellers market", "Balanced market", "Buyers market")),
- color = "transparent", alpha = 0.2, key_glyph = "point", inherit.aes = FALSE) +
+data |>
+ filter(type != "Total") |>
+ group_by(type) |>
+ mutate(
+ n_total = sum(n_sold),
+ type_label = str_glue(
+ "{type} (n={format(n_total, big.mark = \".\", decimal.mark = \",\")})"
+ ),
+ type_label = str_replace(type_label, "\\) \\(", ", ")
+ ) |>
+ arrange(type) |>
+ mutate(type_label = factor(type_label)) |>
+ ggplot(aes(x = date, y = tightness_index, color = type_label)) +
+ geom_rect(
+ data = tibble(),
+ aes(
+ xmin = as.Date(-Inf), xmax = as.Date(Inf),
+ ymin = c(0, 5, 10), ymax = c(5, 10, Inf),
+ fill = c("Sellers market", "Balanced market", "Buyers market")
+ ),
+ color = "transparent", alpha = 0.2, key_glyph = "point", inherit.aes = FALSE
+ ) +
geom_hline(yintercept = 0, color = "grey30", size = 1) +
- geom_line(size = 1.2, alpha = 0.8, lineend = "round", key_glyph = "point") +
- labs(title = "Amsterdam has had a sellers market for nearly 5 years",
- x = NULL,
- y = "Indicator of _\"density\"_ on the housing market",
- color = NULL,
- fill = "Type of market:",
- caption = "_**Data**: MVA_") +
- scale_x_date(expand = c(0,0)) +
- scale_y_continuous(trans = "reverse", expand = c(0,0)) +
- scale_color_manual(values = colors,
- guide = guide_legend(title.position = "top", title.hjust = 0.5, nrow = 2,
- label.position = "right", order = 1,
- override.aes = list(fill = "transparent", size = 6, alpha = 1))) +
- scale_fill_manual(values = c("#F5E000","#00B300","#D1000E"),
- limits = c("Buyers market","Balanced market","Sellers market"),
- guide = guide_legend(order = 2, override.aes = list(shape = 21, size = 6, alpha = 1, stroke = 0))) +
+ geom_line(size = 1.2, alpha = 0.8, lineend = "round", key_glyph = "point") +
+ labs(
+ title = "Amsterdam has had a sellers market for nearly 5 years",
+ x = NULL,
+ y = "Indicator of _\"density\"_ on the housing market",
+ color = NULL,
+ fill = "Type of market:",
+ caption = "_**Data**: MVA_"
+ ) +
+ scale_x_date(expand = c(0, 0)) +
+ scale_y_continuous(trans = "reverse", expand = c(0, 0)) +
+ scale_color_manual(
+ values = colors,
+ guide = guide_legend(
+ title.position = "top", title.hjust = 0.5, nrow = 2,
+ label.position = "right", order = 1,
+ override.aes = list(fill = "transparent", size = 6, alpha = 1)
+ )
+ ) +
+ scale_fill_manual(
+ values = c("#F5E000", "#00B300", "#D1000E"),
+ limits = c("Buyers market", "Balanced market", "Sellers market"),
+ guide = guide_legend(
+ order = 2,
+ override.aes = list(
+ shape = 21, size = 6,
+ alpha = 1, stroke = 0
+ )
+ )
+ ) +
coord_cartesian(clip = "off") +
- theme(plot.title = element_textbox(size = 16),
- plot.subtitle = element_textbox(size = 10),
- axis.title.y = element_textbox(orientation = "left-rotated",
- width = grid::unit(2, "in")),
- legend.key = element_rect(fill = "transparent"))
+ theme(
+ plot.title = element_textbox(size = 16),
+ plot.subtitle = element_textbox(size = 10),
+ axis.title.y = element_textbox(
+ orientation = "left-rotated",
+ width = grid::unit(2, "in")
+ ),
+ legend.key = element_rect(fill = "transparent")
+ )
```
@@ -346,36 +511,53 @@ data |>
So, there's a lot of competition among buyers, and people looking to sell their houses can expect to be paid more than they anticipated. Dozens of buyers compete for the same properties, driving up the price. The figure below shows the percentage of properties sold compared to the number of properties offered. It's clear that after the 2008 housing bubble crisis, the housing market was still recovering in 2012 and 2013. However, since 2016, more apartments were sold than were put on the market. This means that the number of properties available for the growing number of people wanting to move to Amsterdam is decreasing. This decreases supply in a time with increasing demand, thus pushing the prices higher twice over.
-
-A new phenomenon that entered the scene a little while ago may indicate how skewed the market is. People wanting to buy a house in a certain neighborhood will put notes and letters in the mailboxes of people living in that area saying "Nice house! Are you interested in selling it to me?". This is now not an uncommon strategy to find a house. Some people living in popular neighborshoods are inundated with notes from agencies facilitating these practices. See also a news item by RTL.
-
+{{< standout bg="#acc8d4" >}}
+A new phenomenon that entered the scene a little while ago may indicate how skewed the market is. People wanting to buy a house in a certain neighborhood will put notes and letters in the mailboxes of people living in that area saying *"Nice house! Are you interested in selling it to me?"*. This is now not an uncommon strategy to find a house. Some people living in popular neighborshoods are inundated with notes from agencies facilitating these practices. See also a [news item by RTL](https://www.rtlnieuws.nl/geld-en-werk/artikel/19001/droomhuis-gezien-maar-niet-te-koop-stop-dan-een-briefje-de-bus)
+{{< /standout >}}
+
Show code
``` r
-data |>
- filter(type != "Total") |>
- ggplot(aes(x = date, y = perc_sold, fill = type)) +
- geom_col(key_glyph = "point") +
+data |>
+ filter(type != "Total") |>
+ ggplot(aes(x = date, y = perc_sold, fill = type)) +
+ geom_col(key_glyph = "point") +
geom_hline(yintercept = 1, linetype = "dashed", color = "grey20") +
- labs(title = "Some years, more houses are sold than are being put on the market",
- x = NULL,
- y = "Percentage of offered houses sold",
- fill = NULL,
- caption = "_**Data**: MVA_") +
+ labs(
+ title = "Some years, more houses are sold than are being put on the market",
+ x = NULL,
+ y = "Percentage of offered houses sold",
+ fill = NULL,
+ caption = "_**Data**: MVA_"
+ ) +
coord_cartesian(clip = "off") +
- scale_y_continuous(labels = scales::label_percent(), expand = c(0,NA), n.breaks = 4) +
- scale_fill_manual(values = colors,
- guide = guide_legend(title.position = "top", title.hjust = 0.5, nrow = 2,
- label.position = "right",
- override.aes = list(shape = 21, size = 6, alpha = 1, stroke = 0))) +
- theme(plot.title = element_textbox(size = 20,
- width = grid::unit(6,"in")),
- plot.subtitle = element_textbox(size = 10),
- strip.text.x = element_markdown(face = "bold", padding = margin(10,0,5,0,"pt")),
- strip.background = element_rect(fill = "transparent"),
- legend.key = element_rect(fill = "transparent")) +
- facet_wrap(~ type, strip.position = "top", scales = "free_x")
+ scale_y_continuous(
+ labels = scales::label_percent(),
+ expand = c(0, NA), n.breaks = 4
+ ) +
+ scale_fill_manual(
+ values = colors,
+ guide = guide_legend(
+ title.position = "top", title.hjust = 0.5, nrow = 2,
+ label.position = "right",
+ override.aes = list(shape = 21, size = 6, alpha = 1, stroke = 0)
+ )
+ ) +
+ theme(
+ plot.title = element_textbox(
+ size = 20,
+ width = grid::unit(6, "in")
+ ),
+ plot.subtitle = element_textbox(size = 10),
+ strip.text.x = element_markdown(
+ face = "bold",
+ padding = margin(10, 0, 5, 0, "pt")
+ ),
+ strip.background = element_rect(fill = "transparent"),
+ legend.key = element_rect(fill = "transparent")
+ ) +
+ facet_wrap(~type, strip.position = "top", scales = "free_x")
```
@@ -389,55 +571,88 @@ This adds fuel to the fire. I guess I'm trying to show that there are a number o
``` r
decades <- tibble(
- x1 = seq(1920,2020,10),
+ x1 = seq(1920, 2020, 10),
x2 = x1 + 10
-) |>
- slice(seq(1,1e2,2)) |>
+) |>
+ slice(seq(1, 1e2, 2)) |>
filter(x2 < 2030)
-hist_data_homes <- cbsodataR::cbs_get_data("82235NED") |>
- janitor::clean_names() |>
- rename(stock_end = eindstand_voorraad_8) |>
- mutate(year = parse_number(perioden),
- stock_end = stock_end * 1e3,
- diff = stock_end - lag(stock_end))
+hist_data_homes <- cbsodataR::cbs_get_data("82235NED") |>
+ janitor::clean_names() |>
+ rename(stock_end = eindstand_voorraad_8) |>
+ mutate(
+ year = parse_number(perioden),
+ stock_end = stock_end * 1e3,
+ diff = stock_end - lag(stock_end)
+ )
-total_hist_plot <- hist_data_homes |>
+total_hist_plot <- hist_data_homes |>
ggplot(aes(x = year, y = stock_end)) +
- geom_rect(data = tibble(), aes(xmin = 1940.01, xmax = 1945, ymin = -Inf, ymax = Inf),
- fill = "red", alpha = 0.2, inherit.aes = FALSE) +
- geom_text(data = tibble(), aes(x = 1942.5, y = max(hist_data_homes$stock_end, na.rm = TRUE), vjust = 0, label = "WWII"), size = 3,
- fontface = "bold", family = "nunito-sans", inherit.aes = FALSE) +
- geom_rect(data = decades, aes(xmin = x1, xmax = x2, ymin = -Inf, ymax = Inf),
- fill = "grey30", alpha = 0.2, color = "transparent", inherit.aes = FALSE) +
+ geom_rect(
+ data = tibble(), aes(xmin = 1940.01, xmax = 1945, ymin = -Inf, ymax = Inf),
+ fill = "red", alpha = 0.2, inherit.aes = FALSE
+ ) +
+ geom_text(
+ data = tibble(),
+ aes(x = 1942.5, y = max(hist_data_homes$stock_end, na.rm = TRUE)),
+ label = "WWII", fontface = "bold", family = "nunito-sans",
+ vjust = 0, size = 3, inherit.aes = FALSE
+ ) +
+ geom_rect(
+ data = decades, aes(xmin = x1, xmax = x2, ymin = -Inf, ymax = Inf),
+ fill = "grey30", alpha = 0.2, color = "transparent", inherit.aes = FALSE
+ ) +
geom_line(color = "darkred", size = 1.5, lineend = "round") +
- labs(x = NULL,
- y = "Total number of homes") +
- scale_x_continuous(expand = c(0,0), breaks = c(decades$x1,decades$x2,2020)) +
+ labs(
+ x = NULL,
+ y = "Total number of homes"
+ ) +
+ scale_x_continuous(
+ expand = c(0, 0),
+ breaks = c(decades$x1, decades$x2, 2020)
+ ) +
scale_y_continuous(labels = scales::label_number()) +
coord_cartesian(clip = "off")
-diff_hist_plot <- hist_data_homes |>
+diff_hist_plot <- hist_data_homes |>
ggplot(aes(x = year, y = diff)) +
geom_hline(yintercept = 0.5, color = "grey30", size = 1) +
- geom_rect(data = tibble(), aes(xmin = 1940.01, xmax = 1945, ymin = -Inf, ymax = Inf),
- fill = "red", alpha = 0.2, inherit.aes = FALSE) +
- geom_text(data = tibble(), aes(x = 1942.5, y = max(hist_data_homes$diff, na.rm = TRUE), vjust = 0, label = "WWII"), size = 3,
- fontface = "bold", family = "nunito-sans", inherit.aes = FALSE) +
- geom_rect(data = decades, aes(xmin = x1, xmax = x2, ymin = -Inf, ymax = Inf),
- fill = "grey30", alpha = 0.2, color = "transparent", inherit.aes = FALSE) +
+ geom_rect(
+ data = tibble(), aes(xmin = 1940.01, xmax = 1945, ymin = -Inf, ymax = Inf),
+ fill = "red", alpha = 0.2, inherit.aes = FALSE
+ ) +
+ geom_text(
+ data = tibble(), aes(
+ x = 1942.5,
+ y = max(hist_data_homes$diff, na.rm = TRUE),
+ vjust = 0, label = "WWII"
+ ),
+ size = 3, fontface = "bold",
+ family = "nunito-sans", inherit.aes = FALSE
+ ) +
+ geom_rect(
+ data = decades, aes(xmin = x1, xmax = x2, ymin = -Inf, ymax = Inf),
+ fill = "grey30", alpha = 0.2, color = "transparent", inherit.aes = FALSE
+ ) +
geom_line(color = "grey30", alpha = 0.5, size = 1.5, lineend = "round") +
geom_smooth(color = "darkred", size = 1.5, se = FALSE) +
- labs(x = NULL,
- y = "Net homes added per year") +
- scale_x_continuous(expand = c(0,0), breaks = c(decades$x1,decades$x2,2020)) +
+ labs(
+ x = NULL,
+ y = "Net homes added per year"
+ ) +
+ scale_x_continuous(
+ expand = c(0, 0),
+ breaks = c(decades$x1, decades$x2, 2020)
+ ) +
scale_y_continuous(labels = scales::label_number_auto()) +
coord_cartesian(clip = "off")
total_hist_plot / diff_hist_plot +
- plot_annotation(title = "Number of houses available",
- subtitle = "Data covers development in the Netherlands nationwide",
- caption = "**Data**: CBS") &
+ plot_annotation(
+ title = "Number of houses available",
+ subtitle = "Data covers development in the Netherlands nationwide",
+ caption = "**Data**: CBS"
+ ) &
theme(plot.title = element_textbox(size = 20))
```
@@ -449,6 +664,10 @@ The figure displays data from from the CBS through the `{cbsodataR}` package. It
It's also important to mention that between the 60s and 70s, the Netherlands started building entire cities from scratch on newly-reclaimed land from the sea. The province of [Flevoland](https://en.wikipedia.org/wiki/Flevoland) is currently home to about 415 000 people, and until the early 50s this province was non-existent, the current land was at the bottom of a bay of the North sea (called the Zuiderzee or "Southern Sea" in English). Other than a general enthusiasm for building, I think this contributed considerably to the increase in the number of homes added in the 60s and 70s. This new province has good access to Amsterdam, and if it weren't for this new piece of land, the shortage might have peaked earlier.
+{{< sidenote >}}
+There's actually quite a few more, but I'll focus on the ones I can quantify at the moment
+{{< /sidenote >}}
+
But that's not all, there's a few other features that contribute to the gridlock. See, not only young people are on the market to buy apartments in Amsterdam. There's a thriving market for investors looking to take advantage of the rising prices in the Amsterdam housing market (source: [Algemeen Dagblad](https://www.ad.nl/wonen/beleggers-verwerven-massaal-koophuizen-voor-verhuur-in-amsterdam~afedc50c6/)). According to the Dutch central bank, about 1 in 5 properties are sold to investors, who are mostly looking to convert it to a rental property or flip it for a profit. I couldn't find the data the Dutch central bank relied on, but I found something else. The Central Bureau for Statistics collects data on the number of "mutations" among properties. The "mutation" in this case refers to the change of purpose of a property, e.g. if a house meant for sale is bought and then transformed to a rental property by either private individuals or corporations, or vice versa. I collected this data from the governmental yearly report on the housing market of 2020 ("*Staat van de Woningmarkt - Jaarrapportage 2020*", [link](https://www.rijksoverheid.nl/documenten/rapporten/2020/06/15/staat-van-de-woningmarkt-jaarrapportage-2020)). Instead of per quarter, this data is reported per year. Unfortunately, the data on housing mutations in the report (from 2020 mind you) only goes until 2017. It's important to note that these numbers are country-wide, not specific for Amsterdam. That means there could be some other factors in play. Many of these trends are present across the country, but they're massively amplified in the larger cities. The data was contained in a pdf that wasn't easily machine-readable, so I had to manually copy the numbers into a tibble, which was great...
@@ -456,7 +675,8 @@ But that's not all, there's a few other features that contribute to the gridlock
``` r
house_mutations_in <- tribble(
- ~year, ~buy_to_rent_corp, ~buy_to_rent_other, ~rent_corp_to_buy, ~rent_corp_to_rent_other, ~rent_other_to_buy, ~rent_other_to_rent_corp,
+ ~year, ~buy_to_rent_corp, ~buy_to_rent_other, ~rent_corp_to_buy,
+ ~rent_corp_to_rent_other, ~rent_other_to_buy, ~rent_other_to_rent_corp,
2012, 900, 58000, 14600, 5500, 50600, 4900,
2013, 800, 62200, 15200, 11500, 50900, 6000,
2014, 1000, 62200, 15400, 9300, 59900, 39000,
@@ -465,19 +685,28 @@ house_mutations_in <- tribble(
2017, 1600, 98900, 6400, 11000, 7300, 9000
)
-house_mutations <- house_mutations_in |>
- pivot_longer(cols = -year, names_to = "mutation", values_to = "n_mutations") |>
- mutate(from = str_extract(mutation, "(.*?)_to"),
- from = str_remove_all(from, "_to"),
- to = str_extract(mutation, "to_(.*?)$"),
- to = str_remove_all(to, "to_")) |>
- group_by(year) |>
- mutate(total_mutations = sum(n_mutations),
- perc_mutations = n_mutations / total_mutations,
- across(c(from,to), ~ case_when(str_detect(.x,"buy") ~ "buy",
- str_detect(.x,"rent_corp") ~ "rent (corporation)",
- str_detect(.x,"rent_other") ~ "rent (other)")),
- mutation_label = str_glue("From {from} to {to}")) |>
+house_mutations <- house_mutations_in |>
+ pivot_longer(
+ cols = -year,
+ names_to = "mutation", values_to = "n_mutations"
+ ) |>
+ mutate(
+ from = str_extract(mutation, "(.*?)_to"),
+ from = str_remove_all(from, "_to"),
+ to = str_extract(mutation, "to_(.*?)$"),
+ to = str_remove_all(to, "to_")
+ ) |>
+ group_by(year) |>
+ mutate(
+ total_mutations = sum(n_mutations),
+ perc_mutations = n_mutations / total_mutations,
+ across(c(from, to), ~ case_when(
+ str_detect(.x, "buy") ~ "buy",
+ str_detect(.x, "rent_corp") ~ "rent (corporation)",
+ str_detect(.x, "rent_other") ~ "rent (other)"
+ )),
+ mutation_label = str_glue("From {from} to {to}")
+ ) |>
ungroup()
```
@@ -489,46 +718,77 @@ So not every year there's the same number of "mutations" (transformations of pur
Show code
``` r
-mutations_n_plot <- house_mutations |>
- arrange(from,to) |>
- mutate(mutation_label = fct_inorder(mutation_label)) |>
- ggplot(aes(x = year, y = n_mutations, alluvium = mutation_label, fill = mutation_label)) +
+mutations_n_plot <- house_mutations |>
+ arrange(from, to) |>
+ mutate(mutation_label = fct_inorder(mutation_label)) |>
+ ggplot(aes(
+ x = year, y = n_mutations,
+ alluvium = mutation_label, fill = mutation_label
+ )) +
geom_point(shape = 21, color = "transparent", size = NA) +
ggalluvial::geom_flow(width = 0, show.legend = FALSE) +
- labs(x = NULL,
- y = "Number of mutations",
- fill = NULL) +
- scale_x_continuous(expand = c(0,0)) +
- scale_y_continuous(labels = scales::label_number_auto(), expand = c(0,0)) +
- scale_fill_manual(values = rev(colors),
- guide = guide_legend(title.position = "top", title.hjust = 0.5, nrow = 2,
- label.position = "right",
- override.aes = list(color = "transparent", size = 6, alpha = 1, stroke = 0))) +
+ labs(
+ x = NULL,
+ y = "Number of mutations",
+ fill = NULL
+ ) +
+ scale_x_continuous(expand = c(0, 0)) +
+ scale_y_continuous(
+ labels = scales::label_number_auto(),
+ expand = c(0, 0)
+ ) +
+ scale_fill_manual(
+ values = rev(colors),
+ guide = guide_legend(
+ title.position = "top", title.hjust = 0.5, nrow = 2,
+ label.position = "right",
+ override.aes = list(
+ color = "transparent", size = 6,
+ alpha = 1, stroke = 0
+ )
+ )
+ ) +
theme(legend.key = element_rect(fill = "transparent"))
-mutations_perc_plot <- house_mutations |>
- arrange(from,to) |>
- mutate(mutation_label = fct_inorder(mutation_label)) |>
- ggplot(aes(x = year, y = perc_mutations, alluvium = mutation_label, fill = mutation_label)) +
+mutations_perc_plot <- house_mutations |>
+ arrange(from, to) |>
+ mutate(mutation_label = fct_inorder(mutation_label)) |>
+ ggplot(aes(
+ x = year, y = perc_mutations,
+ alluvium = mutation_label, fill = mutation_label
+ )) +
geom_point(shape = 21, color = "transparent", size = NA) +
ggalluvial::geom_flow(width = 0, show.legend = FALSE) +
- labs(x = NULL,
- y = "Percentage of total mutations per quarter",
- fill = NULL) +
- scale_x_continuous(expand = c(0,0)) +
- scale_y_continuous(labels = scales::label_percent(), expand = c(0,0)) +
- scale_fill_manual(values = rev(colors),
- guide = guide_legend(title.position = "top", title.hjust = 0.5, nrow = 2,
- label.position = "right",
- override.aes = list(color = "transparent", size = 6, alpha = 1, stroke = 0))) +
- theme(axis.title.y = element_textbox(width = grid::unit(2,"in")),
- legend.key = element_rect(fill = "transparent"))
+ labs(
+ x = NULL,
+ y = "Percentage of total mutations per quarter",
+ fill = NULL
+ ) +
+ scale_x_continuous(expand = c(0, 0)) +
+ scale_y_continuous(labels = scales::label_percent(), expand = c(0, 0)) +
+ scale_fill_manual(
+ values = rev(colors),
+ guide = guide_legend(
+ title.position = "top", title.hjust = 0.5, nrow = 2,
+ label.position = "right",
+ override.aes = list(
+ color = "transparent", size = 6,
+ alpha = 1, stroke = 0
+ )
+ )
+ ) +
+ theme(
+ axis.title.y = element_textbox(width = grid::unit(2, "in")),
+ legend.key = element_rect(fill = "transparent")
+ )
mutations_n_plot / mutations_perc_plot +
- plot_annotation(title = "Shift to a renters market",
- subtitle = "_A lot more houses meant for sale are being transformed into rental properties than vice versa_",
- caption = "_**Data**: CBS_") +
- plot_layout(guides = 'collect') &
+ plot_annotation(
+ title = "Shift to a renters market",
+ subtitle = "_A lot more houses meant for sale are being transformed into rental properties than vice versa_",
+ caption = "_**Data**: CBS_"
+ ) +
+ plot_layout(guides = "collect") &
theme(plot.title = element_textbox(size = 20))
```
@@ -544,51 +804,65 @@ We can look at the net number of houses added to the rental market by adding up
Show code
``` r
-net_house_mutations <- house_mutations |>
- mutate(across(c(from,to), ~ case_when(str_detect(.x, "rent") ~ "rent",
- str_detect(.x, "buy") ~ "buy"))) |>
- group_by(year) |>
- filter(from != to) |>
- mutate(total_mutations = sum(n_mutations)) |>
- group_by(year, from, to) |>
- summarise(n_mutations = sum(n_mutations),
- total_mutations = first(total_mutations)) |>
- pivot_wider(id_cols = c(year,total_mutations), names_from = c(from,to), values_from = n_mutations,
- names_sep = "_to_") |>
- mutate(net_buy_to_rent = buy_to_rent - rent_to_buy,
- perc_buy_to_rent = buy_to_rent / total_mutations)
-
-net_mutation_plot <- net_house_mutations |>
- ggplot(aes(x = year, y = net_buy_to_rent)) +
+net_house_mutations <- house_mutations |>
+ mutate(across(c(from, to), ~ case_when(
+ str_detect(.x, "rent") ~ "rent",
+ str_detect(.x, "buy") ~ "buy"
+ ))) |>
+ group_by(year) |>
+ filter(from != to) |>
+ mutate(total_mutations = sum(n_mutations)) |>
+ group_by(year, from, to) |>
+ summarise(
+ n_mutations = sum(n_mutations),
+ total_mutations = first(total_mutations)
+ ) |>
+ pivot_wider(
+ id_cols = c(year, total_mutations),
+ names_from = c(from, to),
+ values_from = n_mutations,
+ names_sep = "_to_"
+ ) |>
+ mutate(
+ net_buy_to_rent = buy_to_rent - rent_to_buy,
+ perc_buy_to_rent = buy_to_rent / total_mutations
+ )
+
+net_mutation_plot <- net_house_mutations |>
+ ggplot(aes(x = year, y = net_buy_to_rent)) +
geom_hline(yintercept = 0, color = "grey30", size = 1) +
- #geom_ribbon(aes(ymin = 0, ymax = net_buy_to_rent), fill = "grey30", alpha = 0.2) +
geom_line(color = "darkred", size = 1.5, lineend = "round") +
- #geom_textbox(data = tibble(), aes(x = 2014, y = 4e4,
- # label = "Net number of properties meant for sale withdrawn from the market"),
- # family = "nunito-sans", size = 4, fill = "transparent",
- # maxwidth = grid::unit(2,"in"), hjust = 0.5, vjust = 0) +
- #geom_curve(data = tibble(), aes(x = 2014, y = 4e4, xend = 2016.5, yend = 2.5e4),
- # curvature = 0.3, size = 0.75, arrow = arrow(length = unit(2,"mm")),
- # lineend = "round") +
- labs(x = NULL,
- y = "**Net number of properties changed from properties meant for sale to rental properties**") +
- scale_y_continuous(labels = scales::label_number_auto(), limits = c(-3e4, NA), n.breaks = 5) +
- theme(axis.title.y = element_textbox(width = grid::unit(3.5,"in")))
-
-perc_mutation_plot <- net_house_mutations |>
- ggplot(aes(x = year, y = perc_buy_to_rent)) +
+ labs(
+ x = NULL,
+ y = "**Net number of properties changed from properties meant for sale to rental properties**"
+ ) +
+ scale_y_continuous(
+ labels = scales::label_number_auto(),
+ limits = c(-3e4, NA), n.breaks = 5
+ ) +
+ theme(axis.title.y = element_textbox(width = grid::unit(3.5, "in")))
+
+perc_mutation_plot <- net_house_mutations |>
+ ggplot(aes(x = year, y = perc_buy_to_rent)) +
geom_hline(yintercept = 0.5, color = "grey30", size = 1) +
geom_line(color = "darkred", size = 1.5, lineend = "round") +
- labs(x = NULL,
- y = "**Percentage of mutations that changed properties meant for sale to rental properties**") +
- scale_y_continuous(labels = scales::label_percent(), limits = c(0,1), expand = c(0,0)) +
+ labs(
+ x = NULL,
+ y = "**Percentage of mutations that changed properties meant for sale to rental properties**"
+ ) +
+ scale_y_continuous(
+ labels = scales::label_percent(),
+ limits = c(0, 1), expand = c(0, 0)
+ ) +
coord_cartesian(clip = "off") +
- theme(axis.title.y = element_textbox(width = grid::unit(3.5,"in")))
-
-net_mutation_plot + perc_mutation_plot +
- plot_annotation(title = "Major net shift towards rental market",
- subtitle = "_A lot more houses meant for sale are being transformed into rental properties than vice versa_",
- caption = "_**Data**: CBS_") &
+ theme(axis.title.y = element_textbox(width = grid::unit(3.5, "in")))
+
+net_mutation_plot + perc_mutation_plot +
+ plot_annotation(
+ title = "Major net shift towards rental market",
+ subtitle = "_A lot more houses meant for sale are being transformed into rental properties than vice versa_",
+ caption = "_**Data**: CBS_"
+ ) &
theme(plot.title = element_textbox(size = 20))
```
@@ -602,97 +876,123 @@ So in 2017 nearly 90 000 houses were mutated from sale to rental properties. In
Show code
``` r
-house_mutations |>
- filter(year == max(year)) |>
- select(from, to, n_mutations, perc_mutations) |>
- mutate(across(c(from,to), ~ str_to_sentence(.x))) |>
- arrange(-perc_mutations) |>
- gt() |>
- fmt_number(columns = "n_mutations", sep_mark = " ", drop_trailing_zeros = TRUE) |>
- fmt_percent(columns = "perc_mutations") |>
+house_mutations |>
+ filter(year == max(year)) |>
+ select(from, to, n_mutations, perc_mutations) |>
+ mutate(across(c(from, to), ~ str_to_sentence(.x))) |>
+ arrange(-perc_mutations) |>
+ gt() |>
+ fmt_number(
+ columns = "n_mutations", sep_mark = " ",
+ drop_trailing_zeros = TRUE
+ ) |>
+ fmt_percent(columns = "perc_mutations") |>
grand_summary_rows(
columns = "n_mutations",
fns = list(total = "sum"),
- formatter = fmt_number,
+ fmt = ~ fmt_number(.),
sep_mark = " ",
- drop_trailing_zeros = TRUE) |>
+ drop_trailing_zeros = TRUE
+ ) |>
grand_summary_rows(
columns = "perc_mutations",
fns = list(total = "sum"),
- formatter = fmt_percent,
- missing_text = "") |>
+ fmt = ~ fmt_percent(.),
+ missing_text = ""
+ ) |>
cols_label(
from = html("
"
+ )
+ ) |>
+ tab_source_note(source_note = md("_**Data**: CBS_")) |>
+ tab_options(
+ table.background.color = "#EBEBEB",
+ grand_summary_row.text_transform = "capitalize"
+ ) |>
gtsave("mutations-table.png", expand = 0)
```
- Warning: Since gt v0.9.0, the `formatter` argument (and associated `...`) has been
- deprecated.
- • Please use the `fmt` argument to provide formatting directives.
- This warning is displayed once every 8 hours.
-
So what's the result of all these phenomena? The figure below shows the housing price index for Amsterdam. The first quarter of 2012 was taken as the reference point (100%). In the past 8(!) years, properties had on average nearly doubled in price.
Show code
``` r
-asking_start <- data |>
- filter(type != "Total",
- date == min(date)) |>
- rename(asking_start = asking_price) |>
+asking_start <- data |>
+ filter(
+ type != "Total",
+ date == min(date)
+ ) |>
+ rename(asking_start = asking_price) |>
select(type, asking_start)
-data_asking_index <- data |>
- filter(type != "Total") |>
- left_join(asking_start) |>
+data_asking_index <- data |>
+ filter(type != "Total") |>
+ left_join(asking_start) |>
mutate(asking_index = asking_price / asking_start)
-data_asking_index |>
- ggplot(aes(x = date, y = asking_index, color = type, group = type)) +
- geom_line(size = 2, alpha = 0.15, lineend = "round", show.legend = FALSE) +
+data_asking_index |>
+ ggplot(aes(x = date, y = asking_index, color = type, group = type)) +
+ geom_line(size = 2, alpha = 0.15, lineend = "round", show.legend = FALSE) +
geom_smooth(se = FALSE, show.legend = FALSE) +
- geom_point(size = NA) +
- labs(title = "Price development over the past decade",
- subtitle = "_On average, properties doubled in value since 2012_",
- x = NULL,
- y = "Price development relative to early 2012",
- color = NULL,
- caption = "_**Data**: MVA_") +
- scale_y_continuous(labels = scales::label_percent()) +
- scale_color_manual(values = colors,
- guide = guide_legend(title.position = "top", title.hjust = 0.5, nrow = 2,
- label.position = "right",
- override.aes = list(fill = "transparent", size = 6, alpha = 1))) +
- theme(plot.title = element_textbox(size = 20),
- axis.title.y = element_textbox(width = grid::unit(2.5, "in")),
- legend.key = element_rect(fill = "transparent", color = "transparent"))
+ geom_point(size = NA) +
+ labs(
+ title = "Price development over the past decade",
+ subtitle = "_On average, properties doubled in value since 2012_",
+ x = NULL,
+ y = "Price development relative to early 2012",
+ color = NULL,
+ caption = "_**Data**: MVA_"
+ ) +
+ scale_y_continuous(labels = scales::label_percent()) +
+ scale_color_manual(
+ values = colors,
+ guide = guide_legend(
+ title.position = "top", title.hjust = 0.5, nrow = 2,
+ label.position = "right",
+ override.aes = list(fill = "transparent", size = 6, alpha = 1)
+ )
+ ) +
+ theme(
+ plot.title = element_textbox(size = 20),
+ axis.title.y = element_textbox(width = grid::unit(2.5, "in")),
+ legend.key = element_rect(fill = "transparent", color = "transparent")
+ )
```
-Adjusted for the number of properties offered, the average asking price in Amsterdam has increased by 107%. That means that prices since 2012 have (on average) doubled. For apartments, this increase was 110%. Inflation on the other hand was only about 8.5% since 2012. So young people wanting to buy a house in Amsterdam need to bring nearly twice as much money to the table as they did just a decade ago. The money they saved and the salaries they earned have by no means kept up with the increasing price though.
+{{< sidenote >}}
+Inflation since 2012 was about 8.5% which should be subtracted from these values
+{{< /sidenote >}}
+
+Adjusted for the number of properties offered, the average asking price in Amsterdam has increased by 107%. That means that prices since 2012 have (on average) roughly doubled. For apartments, this increase was 110%. So young people wanting to buy a house in Amsterdam need to bring nearly twice as much money to the table as they did just a decade ago. The money they saved and the salaries they earned have by no means kept up with the increasing price though.
I cannot know this for sure, because the data I presented here doesn't contain the answer, but perhaps these factors combined play a role in the financial uncertainty that millenials experience. How do we fix it? I'm not sure. This is a really complicated issue. But a few things might (in my opinion) help:
-1. **Build more (sustainable) durable houses**
-2. **Prohibit sale of properties to individuals who don't plan to live there themselves for at least 2 or 3 years**
-3. **Increase transparency when buying or selling a house through mandatory public offerings to prevent scooping by investors**
-4. **Rent control to make *"mutating"* houses less profitable**
-5. **Increased socialized housing for poor and low-income families**
+- **Build more (sustainable) durable houses**
+
+- **Prohibit sale of properties to individuals who don't plan to live there themselves for at least 2 or 3 years**
+
+- **Increase transparency when buying or selling a house through mandatory public offerings to prevent scooping by investors**
+
+- **Rent control to make *"mutating"* houses less profitable**
+
+- **Increased socialized housing for poor and low-income families**
Perhaps if we implement a few of these strategies we can make the housing market a little fairer and perhaps ensure that future generations will have the same opportunities current home-owners have.
diff --git a/content/blog/2021-amsterdam-housing-market/index.qmd b/content/blog/2021-amsterdam-housing-market/index.qmd
index bb897de..41a3c2b 100644
--- a/content/blog/2021-amsterdam-housing-market/index.qmd
+++ b/content/blog/2021-amsterdam-housing-market/index.qmd
@@ -14,6 +14,8 @@ execute:
fig.show: hold
results: hold
dev.args: list(bg = "#EBEBEB")
+editor_options:
+ chunk_output_type: console
---
```{css}
@@ -21,7 +23,7 @@ execute:
#| echo: FALSE
body {
- background-color: #EBEBEB; /* was #EFEFEF */
+ background-color: #EBEBEB;
}
table {
@@ -37,13 +39,19 @@ img {
}
```
-
For this post, I decided to hide the code chunks by default to improve legibility. You can click the Show code-button to expand the code.
+{{{< standout bg="#acc8d4" >}}}
+For this post, I decided to hide the code chunks by default to improve legibility. You can click on _Show code_ to expand the code.
+{{{< /standout >}}}
### Introduction
-I'm at the age where a lot of my peers start thinking about buying a home. It is a major conversation topic around the office, and these conversations are not often filled with optimism and excitement. Oslo is a notoriously expensive city to live in, and the housing market is very tough.
+I'm at the age where a lot of my peers start thinking about buying a home. It is a major conversation topic around the office, and these conversations are not often filled with optimism and excitement. Oslo is a notoriously expensive city to live in, and the housing market is very tough.
-Of all the worrying events in 2020, Millenials and Gen Z kids ranked financial instability as the second most common driver of stress, the most common being the welfare of their family ([link](https://www2.deloitte.com/global/en/pages/about-deloitte/articles/millennialsurvey.html)). Today I want to dissect one of the possible causes of this _economic anxiety_: **the housing market**. Now, I'm not an economist nor a financial expert. Nonetheless, I believe I can leverage some of my knowledge of data wrangling and analysis to contribute a small piece to this topic. My insight into the Oslo housing market is fairly limited to so far and there's many nuances I don't fully understand yet, but I do think I have some relevant knowledge of the Dutch housing market. So today I'll dive into some data on the Dutch market, and particularly the Amsterdam housing market. There's countless stories about the Amsterdam housing market and how terrible it is for new buyers, similar to the Oslo housing market. However, in my opinion the Amsterdam housing market is already a few steps ahead of the Oslo market, and hopefully the Amsterdam market can offer some warnings about what the Oslo housing market might look like in a few years without some policy corrections.
+{{{< sidenote >}}}
+This conclusion is based on a survey from Deloitte in 2020 where they surveyed millennials and Gen Z kids both before and after the onset of the pandemic
+{{{< /sidenote >}}}
+
+Of all the worrying events in 2020, millennials and Gen Z kids ranked financial instability as the second most common driver of stress, the most common being the welfare of their family ([source](https://www2.deloitte.com/content/dam/Deloitte/global/Documents/About-Deloitte/deloitte-2020-millennial-survey.pdf)). Today I want to dissect one of the possible causes of this _economic anxiety_: **the housing market**. Now, I'm not an economist nor a financial expert. Nonetheless, I believe I can leverage some of my knowledge of data wrangling and analysis to contribute a small piece to this topic. My insight into the Oslo housing market is fairly limited to so far and there's many nuances I don't fully understand yet, but I do think I have some relevant knowledge of the Dutch housing market. So today I'll dive into some data on the Dutch market, and particularly the Amsterdam housing market. There's countless stories about the Amsterdam housing market and how terrible it is for new buyers, similar to the Oslo housing market. However, in my opinion the Amsterdam housing market is already a few steps ahead of the Oslo market, and hopefully the Amsterdam market can offer some warnings about what the Oslo housing market might look like in a few years without some policy corrections.
This will be a fairly basic data analysis and visualization post, I can't claim that this is an exhaustive list and that I didn't miss some nuances, but I'll give it . I collected some data from the Amsterdam Real Estate Association ([Makelaars Vereniging Amsterdam; MVA](https://www.mva.nl)) and statistics from the Central Bureau for Statistics ([Centraal Bureau for de Statistiek; CBS](https://www.cbs.nl)). As usual, I'll be using `{tidyverse}` _a lot_. I've also recently started using the `{ggtext}` package to manage the text elements in my plots, inspired by Cédric Scherer ([@CedScherer](https://twitter.com/CedScherer)). I'll use the `{gt}` package to spice up some of the tables, and `{patchwork}` for arranging plots. I'll use the `{cbsodataR}` to download data from the Central Bureau for Statistics ([CBS](https://www.cbs.nl)).
@@ -64,25 +72,44 @@ font_add_google(name = "Nunito Sans", family = "nunito-sans")
showtext_auto()
theme_set(ggthemes::theme_economist(base_family = "nunito-sans") +
- theme(rect = element_rect(fill = "#EBEBEB", color = "transparent"),
- plot.background = element_rect(fill = "#EBEBEB", color = "transparent"),
- panel.background = element_rect(fill = "#EBEBEB", color = "transparent"),
- plot.title = element_textbox(margin = margin(0,0,5,0,"pt")),
- plot.title.position = "plot",
- plot.subtitle = element_textbox(hjust = 0, margin = margin(0,0,15,0,"pt")),
- plot.caption = element_textbox(hjust = 1),
- plot.caption.position = "plot",
- axis.title.y = element_textbox(orientation = "left-rotated", face = "bold",
- margin = margin(0,0,5,0,"pt")),
- axis.text.y = element_text(hjust = 1),
- legend.position = "bottom",
- legend.box = "vertical",
- legend.text = element_text(size = 10)))
+ theme(
+ rect = element_rect(fill = "#EBEBEB", color = "transparent"),
+ plot.background = element_rect(
+ fill = "#EBEBEB",
+ color = "transparent"
+ ),
+ panel.background = element_rect(
+ fill = "#EBEBEB",
+ color = "transparent"
+ ),
+ plot.title = element_textbox(
+ margin = margin(0, 0, 5, 0, "pt")
+ ),
+ plot.title.position = "plot",
+ plot.subtitle = element_textbox(
+ hjust = 0,
+ margin = margin(0, 0, 15, 0, "pt")
+ ),
+ plot.caption = element_textbox(hjust = 1),
+ plot.caption.position = "plot",
+ axis.title.y = element_textbox(
+ orientation = "left-rotated", face = "bold",
+ margin = margin(0, 0, 5, 0, "pt")
+ ),
+ axis.text.y = element_text(hjust = 1),
+ legend.position = "bottom",
+ legend.box = "vertical",
+ legend.text = element_text(size = 10)
+ ))
```
### Getting the data
-The first piece of data I'll use comes from the Amsterdam Real Estate Association. They publish quarterly data on a number of variables about the Amsterdam housing market ([link](https://www.mva.nl/over-de-mva/mva/kwartaalcijfers)), inclusing asking price, final price paid, number of properties put on the market, number of properties sold and a few more back to the first quarter of 2012. _Obviously_, these numbers all come in pdf-format, because the people writing quarterly reports apparently have a _massive hatred_ towards people that want to analyze this data. I downloaded the reports, used the online tool [PDFTables](https://pdftables.com) to convert them to Excel, and then stitched the tables together manually. Of course (remember the authors have a _massive hatred_ towards us), the formatting of the numbers in the tables weren't consistent between different quarterly reports, so I had to do some cleaning in R. I put each table in a separate sheet and then used functionality from the `{readxl}` package to load each table into a different variable and then do the cleaning per table. This was a bit cumbersome. Since this will probably be a very long post, I'll hide the code, but you can see it in the [Rmarkdown file](https://github.com/danielroelfs/danielroelfs.com/blob/main/content/blog/2020-running-an-ica-on-questionnaires/index.Rmd) (at least until I figure out how to get code folding to work on my laptop).
+{{{< sidenote br="10em" >}}}
+There are [R packages](https://docs.ropensci.org/tabulizer/) that can parse PDF files, but in my experience they can be clunky. In this case the simplest solution seemed the best, despite requiring some manual work
+{{{< /sidenote >}}}
+
+The first piece of data I'll use comes from the Amsterdam Real Estate Association. They publish quarterly data on a number of variables about the Amsterdam housing market ([link](https://www.mva.nl/over-de-mva/mva/kwartaalcijfers)), inclusing asking price, final price paid, number of properties put on the market, number of properties sold and a few more back to the first quarter of 2012. _Obviously_, these numbers all come in pdf-format, because the people writing quarterly reports apparently have a _massive hatred_ towards people that want to analyze this data. I downloaded the reports, used the online tool [PDFTables](https://pdftables.com) to convert them to Excel, and then stitched the tables together manually. Of course (remember the authors have a _massive hatred_ towards us), the formatting of the numbers in the tables weren't consistent between different quarterly reports, so I had to do some cleaning in R. I put each table in a separate sheet and then used functionality from the `{readxl}` package to load each table into a different variable and then do the cleaning per table. This was a bit cumbersome.
You can look at the code I used to load and merge the files. It's a bit of a mess:
@@ -92,61 +119,135 @@ You can look at the code I used to load and merge the files. It's a bit of a mes
#| code-fold: true
#| code-summary: "Show code"
-asking_price <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 2) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.double), ~ .x * 1e3),
- across(where(is.character), ~ parse_number(str_remove_all(.x, fixed(" "))))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "asking_price")
-
-transaction_price <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 3) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.character), ~ parse_number(str_remove_all(.x, fixed(" "))))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "transaction_price")
-
-price_per_m2 <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 4) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.double), ~ .x * 1e3),
- across(where(is.character), ~ parse_number(str_remove_all(.x, fixed(" "))))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "price_per_m2")
-
-n_offered <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 5) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.character), ~ parse_number(str_remove_all(.x, fixed(" "))))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "n_offered")
-
-n_sold <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 6) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.character), ~ parse_number(str_remove_all(.x, fixed(" "))))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "n_sold")
-
-mortgage_months <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 7) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.character), ~ parse_number(str_remove_all(.x, fixed(" "))))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "mortgage_months")
-
-tightness_index <- readxl::read_xlsx("MVA_kwartaalcijfers.xlsx", sheet = 8) |>
- janitor::clean_names() |>
- mutate(type_woning = as_factor(type_woning),
- across(where(is.character), ~ parse_number(str_replace_all(.x, ",", ".")))) |>
- pivot_longer(starts_with("x"), names_to = "date", values_to = "tightness_index")
-
-data_merged <- inner_join(asking_price, transaction_price) |>
- inner_join(price_per_m2) |>
- inner_join(n_offered) |>
- inner_join(n_sold) |>
- inner_join(mortgage_months) |>
- inner_join(tightness_index) |>
- mutate(asking_price = ifelse(asking_price < 1e5, asking_price * 1e3, asking_price),
- transaction_price = ifelse(transaction_price < 1e5, transaction_price * 1e3, transaction_price),
- price_per_m2 = ifelse(price_per_m2 > 1e4, price_per_m2 / 1e3, price_per_m2))
-
-write_rds(data_merged, "data_merged.rds")
+asking_price <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 2
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(where(is.double), ~ .x * 1e3),
+ across(
+ where(is.character),
+ ~ parse_number(str_remove_all(.x, fixed(" ")))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "asking_price"
+ )
+
+transaction_price <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 3
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(
+ where(is.character),
+ ~ parse_number(str_remove_all(.x, fixed(" ")))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "transaction_price"
+ )
+
+price_per_m2 <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 4
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(where(is.double), ~ .x * 1e3),
+ across(
+ where(is.character),
+ ~ parse_number(str_remove_all(.x, fixed(" ")))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "price_per_m2"
+ )
+
+n_offered <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 5
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(
+ where(is.character),
+ ~ parse_number(str_remove_all(.x, fixed(" ")))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "n_offered"
+ )
+
+n_sold <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 6
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(
+ where(is.character),
+ ~ parse_number(str_remove_all(.x, fixed(" ")))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "n_sold"
+ )
+
+mortgage_months <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 7
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(
+ where(is.character),
+ ~ parse_number(str_remove_all(.x, fixed(" ")))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "mortgage_months"
+ )
+
+tightness_index <- readxl::read_xlsx("./data/MVA_kwartaalcijfers.xlsx",
+ sheet = 8
+) |>
+ janitor::clean_names() |>
+ mutate(
+ type_woning = as_factor(type_woning),
+ across(
+ where(is.character),
+ ~ parse_number(str_replace_all(.x, ",", "."))
+ )
+ ) |>
+ pivot_longer(starts_with("x"),
+ names_to = "date", values_to = "tightness_index"
+ )
+
+data_merged <- inner_join(asking_price, transaction_price) |>
+ inner_join(price_per_m2) |>
+ inner_join(n_offered) |>
+ inner_join(n_sold) |>
+ inner_join(mortgage_months) |>
+ inner_join(tightness_index) |>
+ mutate(
+ asking_price = ifelse(asking_price < 1e5,
+ asking_price * 1e3,
+ asking_price
+ ),
+ transaction_price = ifelse(transaction_price < 1e5,
+ transaction_price * 1e3,
+ transaction_price
+ ),
+ price_per_m2 = ifelse(price_per_m2 > 1e4,
+ price_per_m2 / 1e3,
+ price_per_m2
+ )
+ )
+
+write_rds(data_merged, "./data/data_merged.rds")
```
Let's have a look at the dataset.
@@ -156,7 +257,7 @@ Let's have a look at the dataset.
#| code-fold: true
#| code-summary: "Show code"
-data_merged <- read_rds("data_merged.rds")
+data_merged <- read_rds("./data/data_merged.rds")
glimpse(data_merged)
```
@@ -168,32 +269,42 @@ From this dataset, I want to create a few new variables. I want to create a date
#| code-fold: true
#| code-summary: "Show code"
-data <- data_merged |>
- rename(type = type_woning) |>
- mutate(quarter = str_extract(date, pattern = "x(.*?)e"),
- quarter = parse_number(quarter),
- year = str_extract(date, pattern = "kw_(.*)"),
- year = parse_number(year),
- date = as.Date(str_glue("{year}-{(quarter * 3)}-01")),
- diff_ask_paid = transaction_price - asking_price,
- diff_ask_paid_perc = diff_ask_paid / asking_price,
- diff_offered_sold = n_offered - n_sold,
- diff_offered_sold_perc = diff_offered_sold / n_offered,
- perc_sold = n_sold / n_offered,
- type = case_when(str_detect(type,"Totaal") ~ "Total",
- str_detect(type,"<= 1970") ~ "Apartments (pre-1970)",
- str_detect(type,"> 1970") ~ "Apartments (post-1970)",
- str_detect(type,"Tussenwoning") ~ "Terraced house",
- str_detect(type,"Hoekwoning") ~ "Corner house",
- str_detect(type,"Vrijstaand") ~ "Detached house",
- str_detect(type,"2-onder-1-kap") ~ "Semi-detached house"),
- type = factor(type, levels = c("Apartments (pre-1970)","Apartments (post-1970)",
- "Terraced house","Corner house","Detached house",
- "Semi-detached house","Total"))) |>
+data <- data_merged |>
+ rename(type = type_woning) |>
+ mutate(
+ quarter = str_extract(date, pattern = "x(.*?)e"),
+ quarter = parse_number(quarter),
+ year = str_extract(date, pattern = "kw_(.*)"),
+ year = parse_number(year),
+ date = as.Date(str_glue("{year}-{(quarter * 3)}-01")),
+ diff_ask_paid = transaction_price - asking_price,
+ diff_ask_paid_perc = diff_ask_paid / asking_price,
+ diff_offered_sold = n_offered - n_sold,
+ diff_offered_sold_perc = diff_offered_sold / n_offered,
+ perc_sold = n_sold / n_offered,
+ type = case_when(
+ str_detect(type, "Totaal") ~ "Total",
+ str_detect(type, "<= 1970") ~ "Apartments (pre-1970)",
+ str_detect(type, "> 1970") ~ "Apartments (post-1970)",
+ str_detect(type, "Tussenwoning") ~ "Terraced house",
+ str_detect(type, "Hoekwoning") ~ "Corner house",
+ str_detect(type, "Vrijstaand") ~ "Detached house",
+ str_detect(type, "2-onder-1-kap") ~ "Semi-detached house"
+ ),
+ type = factor(type, levels = c(
+ "Apartments (pre-1970)", "Apartments (post-1970)",
+ "Terraced house", "Corner house", "Detached house",
+ "Semi-detached house", "Total"
+ ))
+ ) |>
glimpse()
```
-The first thing that seems interesting to do is to plot the percentage difference between the asking price and the price paid. This will give us an indication of the trend in overpaying on different types of properties. I'll use a color palette and legend design I again shamelessly stole from Cédric Scherer ([@CedScherer](https://twitter.com/CedScherer)). I'll use a simple line graph to visualize the percentage overpay.
+{{{< sidenote br="1em" >}}}
+I'll use a color palette and legend design I shamelessly stole from [Cédric Scherer](https://twitter.com/CedScherer)
+{{{< /sidenote >}}}
+
+The first thing that seems interesting to do is to plot the percentage difference between the asking price and the price paid. This will give us an indication of the trend in overpaying on different types of properties. Let's use a simple line graph to visualize the percentage overpay.
```{r}
#| label: plot-overpay
@@ -201,33 +312,48 @@ The first thing that seems interesting to do is to plot the percentage differenc
#| code-fold: true
#| code-summary: "Show code"
-colors <- c("#019868","#9dd292","#ec0b88","#651eac","#e18a1e","#2b7de5")
-
-data |>
- filter(type != "Total") |>
- group_by(type) |>
- mutate(n_total = sum(n_sold),
- type_label = str_glue("{type} (n={format(n_total, big.mark = \".\", decimal.mark = \",\")})"),
- type_label = str_replace(type_label,"\\) \\(", ", ")) |>
- arrange(type) |>
- mutate(type_label = factor(type_label)) |>
- ggplot(aes(x = date, y = diff_ask_paid_perc, color = type_label)) +
+colors <- c("#019868", "#9dd292", "#ec0b88", "#651eac", "#e18a1e", "#2b7de5")
+
+data |>
+ filter(type != "Total") |>
+ group_by(type) |>
+ mutate(
+ n_total = sum(n_sold),
+ type_label = str_glue(
+ "{type} (n={format(n_total, big.mark = \".\", decimal.mark = \",\")})"
+ ),
+ type_label = str_replace(type_label, "\\) \\(", ", ")
+ ) |>
+ arrange(type) |>
+ mutate(type_label = factor(type_label)) |>
+ ggplot(aes(x = date, y = diff_ask_paid_perc, color = type_label)) +
geom_hline(yintercept = 0, color = "grey30", size = 1) +
- geom_line(size = 1.2, alpha = 0.8, lineend = "round", key_glyph = "point") +
- labs(title = "Overbidding has become the new normal",
- subtitle = "_Paying as much as 5% over asking price is common the past few years_",
- x = NULL,
- y = "Percentage difference between\nasking price and price paid",
- color = NULL,
- caption = "_**Data**: MVA_") +
- scale_y_continuous(labels = scales::label_percent()) +
- scale_color_manual(values = colors,
- guide = guide_legend(title.position = "top", title.hjust = 0.5, nrow = 2,
- label.position = "right",
- override.aes = list(fill = "transparent", size = 6, alpha = 1))) +
- theme(plot.title = element_textbox(size = 20),
- axis.title.y = element_textbox(width = grid::unit(2.5, "in")),
- legend.key = element_rect(fill = "transparent", color = "transparent"))
+ geom_line(
+ size = 1.2, alpha = 0.8,
+ lineend = "round", key_glyph = "point"
+ ) +
+ labs(
+ title = "Overbidding has become the new normal",
+ subtitle = "_Paying as much as 5% over asking price is common the past few years_",
+ x = NULL,
+ y = "Percentage difference between\nasking price and price paid",
+ color = NULL,
+ caption = "_**Data**: MVA_"
+ ) +
+ scale_y_continuous(labels = scales::label_percent()) +
+ scale_color_manual(
+ values = colors,
+ guide = guide_legend(
+ title.position = "top", title.hjust = 0.5, nrow = 2,
+ label.position = "right",
+ override.aes = list(fill = "transparent", size = 6, alpha = 1)
+ )
+ ) +
+ theme(
+ plot.title = element_textbox(size = 20),
+ axis.title.y = element_textbox(width = grid::unit(2.5, "in")),
+ legend.key = element_rect(fill = "transparent", color = "transparent")
+ )
```
Prior to 2014, most properties in Amsterdam were sold at about 6% below asking price, in the last quarter of 2020, that trend had changed to more than 3.5% above asking. The variance is obviously larger for detached and semi-detached houses, since those are both expensive and scarce in Amsterdam, and are thus also bought and sold less often. The table below shows the stats for the first quarter of 2021. Apart from semi-detached houses, most other properties were sold at 4% over asking or more. The types of properties most accessible to first-time buyers are obviously the apartments. Either the older type in the inner city, or the newer apartments in the suburbs. People buying pre-1970 apartments spent more than €28 000 over the asking price and people buying apartments built after 1970 spent on average more than €18 000 over the asking price.
@@ -238,27 +364,40 @@ Prior to 2014, most properties in Amsterdam were sold at about 6% below asking p
#| code-fold: true
#| code-summary: "Show code"
-data |>
- filter(type != "Total",
- date == max(date)) |>
- select(type, diff_ask_paid_perc, diff_ask_paid, transaction_price, n_sold) |>
- arrange(-diff_ask_paid_perc) |>
- gt() |>
- cols_align(columns = "type", align = "left") |>
- fmt_percent(columns = "diff_ask_paid_perc") |>
- fmt_currency(columns = "diff_ask_paid", currency = "EUR", decimals = 0, sep_mark = " ") |>
- fmt_currency(columns = "transaction_price", currency = "EUR", decimals = 0, sep_mark = " ") |>
- fmt_number(columns = "n_sold", sep_mark = " ", drop_trailing_zeros = TRUE) |>
+data |>
+ filter(
+ type != "Total",
+ date == max(date)
+ ) |>
+ select(type, diff_ask_paid_perc, diff_ask_paid, transaction_price, n_sold) |>
+ arrange(-diff_ask_paid_perc) |>
+ gt() |>
+ cols_align(columns = "type", align = "left") |>
+ fmt_percent(columns = "diff_ask_paid_perc") |>
+ fmt_currency(
+ columns = "diff_ask_paid", currency = "EUR",
+ decimals = 0, sep_mark = " "
+ ) |>
+ fmt_currency(
+ columns = "transaction_price", currency = "EUR",
+ decimals = 0, sep_mark = " "
+ ) |>
+ fmt_number(
+ columns = "n_sold", sep_mark = " ",
+ drop_trailing_zeros = TRUE
+ ) |>
cols_label(
type = html("
")
+ ) |>
+ tab_source_note(source_note = md("_**Data**: MVA_")) |>
tab_options(table.background.color = "#EBEBEB") |>
gtsave("overpay-table.png", expand = 0)
```
@@ -273,47 +412,76 @@ What contributed to this price increase? A simple supply-and-demand plays a part
#| code-fold: true
#| code-summary: "Show code"
-data |>
- filter(type != "Total") |>
- group_by(type) |>
- mutate(n_total = sum(n_sold),
- type_label = str_glue("{type} (n={format(n_total, big.mark = \".\", decimal.mark = \",\")})"),
- type_label = str_replace(type_label,"\\) \\(", ", ")) |>
- arrange(type) |>
- mutate(type_label = factor(type_label)) |>
- ggplot(aes(x = date, y = tightness_index, color = type_label)) +
- geom_rect(data = tibble(),
- aes(xmin = as.Date(-Inf), xmax = as.Date(Inf), ymin = c(0,5,10), ymax = c(5,10,Inf),
- fill = c("Sellers market", "Balanced market", "Buyers market")),
- color = "transparent", alpha = 0.2, key_glyph = "point", inherit.aes = FALSE) +
+data |>
+ filter(type != "Total") |>
+ group_by(type) |>
+ mutate(
+ n_total = sum(n_sold),
+ type_label = str_glue(
+ "{type} (n={format(n_total, big.mark = \".\", decimal.mark = \",\")})"
+ ),
+ type_label = str_replace(type_label, "\\) \\(", ", ")
+ ) |>
+ arrange(type) |>
+ mutate(type_label = factor(type_label)) |>
+ ggplot(aes(x = date, y = tightness_index, color = type_label)) +
+ geom_rect(
+ data = tibble(),
+ aes(
+ xmin = as.Date(-Inf), xmax = as.Date(Inf),
+ ymin = c(0, 5, 10), ymax = c(5, 10, Inf),
+ fill = c("Sellers market", "Balanced market", "Buyers market")
+ ),
+ color = "transparent", alpha = 0.2, key_glyph = "point", inherit.aes = FALSE
+ ) +
geom_hline(yintercept = 0, color = "grey30", size = 1) +
- geom_line(size = 1.2, alpha = 0.8, lineend = "round", key_glyph = "point") +
- labs(title = "Amsterdam has had a sellers market for nearly 5 years",
- x = NULL,
- y = "Indicator of _\"density\"_ on the housing market",
- color = NULL,
- fill = "Type of market:",
- caption = "_**Data**: MVA_") +
- scale_x_date(expand = c(0,0)) +
- scale_y_continuous(trans = "reverse", expand = c(0,0)) +
- scale_color_manual(values = colors,
- guide = guide_legend(title.position = "top", title.hjust = 0.5, nrow = 2,
- label.position = "right", order = 1,
- override.aes = list(fill = "transparent", size = 6, alpha = 1))) +
- scale_fill_manual(values = c("#F5E000","#00B300","#D1000E"),
- limits = c("Buyers market","Balanced market","Sellers market"),
- guide = guide_legend(order = 2, override.aes = list(shape = 21, size = 6, alpha = 1, stroke = 0))) +
+ geom_line(size = 1.2, alpha = 0.8, lineend = "round", key_glyph = "point") +
+ labs(
+ title = "Amsterdam has had a sellers market for nearly 5 years",
+ x = NULL,
+ y = "Indicator of _\"density\"_ on the housing market",
+ color = NULL,
+ fill = "Type of market:",
+ caption = "_**Data**: MVA_"
+ ) +
+ scale_x_date(expand = c(0, 0)) +
+ scale_y_continuous(trans = "reverse", expand = c(0, 0)) +
+ scale_color_manual(
+ values = colors,
+ guide = guide_legend(
+ title.position = "top", title.hjust = 0.5, nrow = 2,
+ label.position = "right", order = 1,
+ override.aes = list(fill = "transparent", size = 6, alpha = 1)
+ )
+ ) +
+ scale_fill_manual(
+ values = c("#F5E000", "#00B300", "#D1000E"),
+ limits = c("Buyers market", "Balanced market", "Sellers market"),
+ guide = guide_legend(
+ order = 2,
+ override.aes = list(
+ shape = 21, size = 6,
+ alpha = 1, stroke = 0
+ )
+ )
+ ) +
coord_cartesian(clip = "off") +
- theme(plot.title = element_textbox(size = 16),
- plot.subtitle = element_textbox(size = 10),
- axis.title.y = element_textbox(orientation = "left-rotated",
- width = grid::unit(2, "in")),
- legend.key = element_rect(fill = "transparent"))
+ theme(
+ plot.title = element_textbox(size = 16),
+ plot.subtitle = element_textbox(size = 10),
+ axis.title.y = element_textbox(
+ orientation = "left-rotated",
+ width = grid::unit(2, "in")
+ ),
+ legend.key = element_rect(fill = "transparent")
+ )
```
So, there's a lot of competition among buyers, and people looking to sell their houses can expect to be paid more than they anticipated. Dozens of buyers compete for the same properties, driving up the price. The figure below shows the percentage of properties sold compared to the number of properties offered. It's clear that after the 2008 housing bubble crisis, the housing market was still recovering in 2012 and 2013. However, since 2016, more apartments were sold than were put on the market. This means that the number of properties available for the growing number of people wanting to move to Amsterdam is decreasing. This decreases supply in a time with increasing demand, thus pushing the prices higher twice over.
-
A new phenomenon that entered the scene a little while ago may indicate how skewed the market is. People wanting to buy a house in a certain neighborhood will put notes and letters in the mailboxes of people living in that area saying "Nice house! Are you interested in selling it to me?". This is now not an uncommon strategy to find a house. Some people living in popular neighborshoods are inundated with notes from agencies facilitating these practices. See also a news item by RTL.
+{{{< standout bg="#acc8d4" >}}}
+A new phenomenon that entered the scene a little while ago may indicate how skewed the market is. People wanting to buy a house in a certain neighborhood will put notes and letters in the mailboxes of people living in that area saying _"Nice house! Are you interested in selling it to me?"_. This is now not an uncommon strategy to find a house. Some people living in popular neighborshoods are inundated with notes from agencies facilitating these practices. See also a [news item by RTL](https://www.rtlnieuws.nl/geld-en-werk/artikel/19001/droomhuis-gezien-maar-niet-te-koop-stop-dan-een-briefje-de-bus)
+{{{< /standout >}}}
```{r}
#| label: plot-n-sales
@@ -323,29 +491,45 @@ So, there's a lot of competition among buyers, and people looking to sell their
#| code-fold: true
#| code-summary: "Show code"
-data |>
- filter(type != "Total") |>
- ggplot(aes(x = date, y = perc_sold, fill = type)) +
- geom_col(key_glyph = "point") +
+data |>
+ filter(type != "Total") |>
+ ggplot(aes(x = date, y = perc_sold, fill = type)) +
+ geom_col(key_glyph = "point") +
geom_hline(yintercept = 1, linetype = "dashed", color = "grey20") +
- labs(title = "Some years, more houses are sold than are being put on the market",
- x = NULL,
- y = "Percentage of offered houses sold",
- fill = NULL,
- caption = "_**Data**: MVA_") +
+ labs(
+ title = "Some years, more houses are sold than are being put on the market",
+ x = NULL,
+ y = "Percentage of offered houses sold",
+ fill = NULL,
+ caption = "_**Data**: MVA_"
+ ) +
coord_cartesian(clip = "off") +
- scale_y_continuous(labels = scales::label_percent(), expand = c(0,NA), n.breaks = 4) +
- scale_fill_manual(values = colors,
- guide = guide_legend(title.position = "top", title.hjust = 0.5, nrow = 2,
- label.position = "right",
- override.aes = list(shape = 21, size = 6, alpha = 1, stroke = 0))) +
- theme(plot.title = element_textbox(size = 20,
- width = grid::unit(6,"in")),
- plot.subtitle = element_textbox(size = 10),
- strip.text.x = element_markdown(face = "bold", padding = margin(10,0,5,0,"pt")),
- strip.background = element_rect(fill = "transparent"),
- legend.key = element_rect(fill = "transparent")) +
- facet_wrap(~ type, strip.position = "top", scales = "free_x")
+ scale_y_continuous(
+ labels = scales::label_percent(),
+ expand = c(0, NA), n.breaks = 4
+ ) +
+ scale_fill_manual(
+ values = colors,
+ guide = guide_legend(
+ title.position = "top", title.hjust = 0.5, nrow = 2,
+ label.position = "right",
+ override.aes = list(shape = 21, size = 6, alpha = 1, stroke = 0)
+ )
+ ) +
+ theme(
+ plot.title = element_textbox(
+ size = 20,
+ width = grid::unit(6, "in")
+ ),
+ plot.subtitle = element_textbox(size = 10),
+ strip.text.x = element_markdown(
+ face = "bold",
+ padding = margin(10, 0, 5, 0, "pt")
+ ),
+ strip.background = element_rect(fill = "transparent"),
+ legend.key = element_rect(fill = "transparent")
+ ) +
+ facet_wrap(~type, strip.position = "top", scales = "free_x")
```
This adds fuel to the fire. I guess I'm trying to show that there are a number of factors stacked against young buyers. Any money you have saved up you need to spend to outbid the massive competition. The massive competition means buyers only have a handful of feasible options. Again, because of the massive competition, the chance they actually get to buy one of those options is low. The number of options is not steady either, the number properties sold has overtaken the number of properties being put on the market. There is a dwindling reserve of properties. Combine this with more and more young people wanting to move to the bigger cities and you have a perfect cocktail for a congested housing market. Building new properties can counteract this, but over the last years the Netherlands has actually slowed building new properties.
@@ -360,55 +544,88 @@ This adds fuel to the fire. I guess I'm trying to show that there are a number o
#| code-summary: "Show code"
decades <- tibble(
- x1 = seq(1920,2020,10),
+ x1 = seq(1920, 2020, 10),
x2 = x1 + 10
-) |>
- slice(seq(1,1e2,2)) |>
+) |>
+ slice(seq(1, 1e2, 2)) |>
filter(x2 < 2030)
-hist_data_homes <- cbsodataR::cbs_get_data("82235NED") |>
- janitor::clean_names() |>
- rename(stock_end = eindstand_voorraad_8) |>
- mutate(year = parse_number(perioden),
- stock_end = stock_end * 1e3,
- diff = stock_end - lag(stock_end))
+hist_data_homes <- cbsodataR::cbs_get_data("82235NED") |>
+ janitor::clean_names() |>
+ rename(stock_end = eindstand_voorraad_8) |>
+ mutate(
+ year = parse_number(perioden),
+ stock_end = stock_end * 1e3,
+ diff = stock_end - lag(stock_end)
+ )
-total_hist_plot <- hist_data_homes |>
+total_hist_plot <- hist_data_homes |>
ggplot(aes(x = year, y = stock_end)) +
- geom_rect(data = tibble(), aes(xmin = 1940.01, xmax = 1945, ymin = -Inf, ymax = Inf),
- fill = "red", alpha = 0.2, inherit.aes = FALSE) +
- geom_text(data = tibble(), aes(x = 1942.5, y = max(hist_data_homes$stock_end, na.rm = TRUE), vjust = 0, label = "WWII"), size = 3,
- fontface = "bold", family = "nunito-sans", inherit.aes = FALSE) +
- geom_rect(data = decades, aes(xmin = x1, xmax = x2, ymin = -Inf, ymax = Inf),
- fill = "grey30", alpha = 0.2, color = "transparent", inherit.aes = FALSE) +
+ geom_rect(
+ data = tibble(), aes(xmin = 1940.01, xmax = 1945, ymin = -Inf, ymax = Inf),
+ fill = "red", alpha = 0.2, inherit.aes = FALSE
+ ) +
+ geom_text(
+ data = tibble(),
+ aes(x = 1942.5, y = max(hist_data_homes$stock_end, na.rm = TRUE)),
+ label = "WWII", fontface = "bold", family = "nunito-sans",
+ vjust = 0, size = 3, inherit.aes = FALSE
+ ) +
+ geom_rect(
+ data = decades, aes(xmin = x1, xmax = x2, ymin = -Inf, ymax = Inf),
+ fill = "grey30", alpha = 0.2, color = "transparent", inherit.aes = FALSE
+ ) +
geom_line(color = "darkred", size = 1.5, lineend = "round") +
- labs(x = NULL,
- y = "Total number of homes") +
- scale_x_continuous(expand = c(0,0), breaks = c(decades$x1,decades$x2,2020)) +
+ labs(
+ x = NULL,
+ y = "Total number of homes"
+ ) +
+ scale_x_continuous(
+ expand = c(0, 0),
+ breaks = c(decades$x1, decades$x2, 2020)
+ ) +
scale_y_continuous(labels = scales::label_number()) +
coord_cartesian(clip = "off")
-diff_hist_plot <- hist_data_homes |>
+diff_hist_plot <- hist_data_homes |>
ggplot(aes(x = year, y = diff)) +
geom_hline(yintercept = 0.5, color = "grey30", size = 1) +
- geom_rect(data = tibble(), aes(xmin = 1940.01, xmax = 1945, ymin = -Inf, ymax = Inf),
- fill = "red", alpha = 0.2, inherit.aes = FALSE) +
- geom_text(data = tibble(), aes(x = 1942.5, y = max(hist_data_homes$diff, na.rm = TRUE), vjust = 0, label = "WWII"), size = 3,
- fontface = "bold", family = "nunito-sans", inherit.aes = FALSE) +
- geom_rect(data = decades, aes(xmin = x1, xmax = x2, ymin = -Inf, ymax = Inf),
- fill = "grey30", alpha = 0.2, color = "transparent", inherit.aes = FALSE) +
+ geom_rect(
+ data = tibble(), aes(xmin = 1940.01, xmax = 1945, ymin = -Inf, ymax = Inf),
+ fill = "red", alpha = 0.2, inherit.aes = FALSE
+ ) +
+ geom_text(
+ data = tibble(), aes(
+ x = 1942.5,
+ y = max(hist_data_homes$diff, na.rm = TRUE),
+ vjust = 0, label = "WWII"
+ ),
+ size = 3, fontface = "bold",
+ family = "nunito-sans", inherit.aes = FALSE
+ ) +
+ geom_rect(
+ data = decades, aes(xmin = x1, xmax = x2, ymin = -Inf, ymax = Inf),
+ fill = "grey30", alpha = 0.2, color = "transparent", inherit.aes = FALSE
+ ) +
geom_line(color = "grey30", alpha = 0.5, size = 1.5, lineend = "round") +
geom_smooth(color = "darkred", size = 1.5, se = FALSE) +
- labs(x = NULL,
- y = "Net homes added per year") +
- scale_x_continuous(expand = c(0,0), breaks = c(decades$x1,decades$x2,2020)) +
+ labs(
+ x = NULL,
+ y = "Net homes added per year"
+ ) +
+ scale_x_continuous(
+ expand = c(0, 0),
+ breaks = c(decades$x1, decades$x2, 2020)
+ ) +
scale_y_continuous(labels = scales::label_number_auto()) +
coord_cartesian(clip = "off")
total_hist_plot / diff_hist_plot +
- plot_annotation(title = "Number of houses available",
- subtitle = "Data covers development in the Netherlands nationwide",
- caption = "**Data**: CBS") &
+ plot_annotation(
+ title = "Number of houses available",
+ subtitle = "Data covers development in the Netherlands nationwide",
+ caption = "**Data**: CBS"
+ ) &
theme(plot.title = element_textbox(size = 20))
```
@@ -417,6 +634,10 @@ The figure displays data from from the CBS through the `{cbsodataR}` package. It
It's also important to mention that between the 60s and 70s, the Netherlands started building entire cities from scratch on newly-reclaimed land from the sea. The province of [Flevoland](https://en.wikipedia.org/wiki/Flevoland) is currently home to about 415 000 people, and until the early 50s this province was non-existent, the current land was at the bottom of a bay of the North sea (called the Zuiderzee or "Southern Sea" in English). Other than a general enthusiasm for building, I think this contributed considerably to the increase in the number of homes added in the 60s and 70s. This new province has good access to Amsterdam, and if it weren't for this new piece of land, the shortage might have peaked earlier.
+{{{< sidenote >}}}
+There's actually quite a few more, but I'll focus on the ones I can quantify at the moment
+{{{< /sidenote >}}}
+
But that's not all, there's a few other features that contribute to the gridlock. See, not only young people are on the market to buy apartments in Amsterdam. There's a thriving market for investors looking to take advantage of the rising prices in the Amsterdam housing market (source: [Algemeen Dagblad](https://www.ad.nl/wonen/beleggers-verwerven-massaal-koophuizen-voor-verhuur-in-amsterdam~afedc50c6/)). According to the Dutch central bank, about 1 in 5 properties are sold to investors, who are mostly looking to convert it to a rental property or flip it for a profit. I couldn't find the data the Dutch central bank relied on, but I found something else. The Central Bureau for Statistics collects data on the number of "mutations" among properties. The "mutation" in this case refers to the change of purpose of a property, e.g. if a house meant for sale is bought and then transformed to a rental property by either private individuals or corporations, or vice versa. I collected this data from the governmental yearly report on the housing market of 2020 ("_Staat van de Woningmarkt - Jaarrapportage 2020_", [link](https://www.rijksoverheid.nl/documenten/rapporten/2020/06/15/staat-van-de-woningmarkt-jaarrapportage-2020)). Instead of per quarter, this data is reported per year. Unfortunately, the data on housing mutations in the report (from 2020 mind you) only goes until 2017. It's important to note that these numbers are country-wide, not specific for Amsterdam. That means there could be some other factors in play. Many of these trends are present across the country, but they're massively amplified in the larger cities. The data was contained in a pdf that wasn't easily machine-readable, so I had to manually copy the numbers into a tibble, which was great...
```{r}
@@ -425,7 +646,8 @@ But that's not all, there's a few other features that contribute to the gridlock
#| code-summary: "Show code"
house_mutations_in <- tribble(
- ~year, ~buy_to_rent_corp, ~buy_to_rent_other, ~rent_corp_to_buy, ~rent_corp_to_rent_other, ~rent_other_to_buy, ~rent_other_to_rent_corp,
+ ~year, ~buy_to_rent_corp, ~buy_to_rent_other, ~rent_corp_to_buy,
+ ~rent_corp_to_rent_other, ~rent_other_to_buy, ~rent_other_to_rent_corp,
2012, 900, 58000, 14600, 5500, 50600, 4900,
2013, 800, 62200, 15200, 11500, 50900, 6000,
2014, 1000, 62200, 15400, 9300, 59900, 39000,
@@ -434,19 +656,28 @@ house_mutations_in <- tribble(
2017, 1600, 98900, 6400, 11000, 7300, 9000
)
-house_mutations <- house_mutations_in |>
- pivot_longer(cols = -year, names_to = "mutation", values_to = "n_mutations") |>
- mutate(from = str_extract(mutation, "(.*?)_to"),
- from = str_remove_all(from, "_to"),
- to = str_extract(mutation, "to_(.*?)$"),
- to = str_remove_all(to, "to_")) |>
- group_by(year) |>
- mutate(total_mutations = sum(n_mutations),
- perc_mutations = n_mutations / total_mutations,
- across(c(from,to), ~ case_when(str_detect(.x,"buy") ~ "buy",
- str_detect(.x,"rent_corp") ~ "rent (corporation)",
- str_detect(.x,"rent_other") ~ "rent (other)")),
- mutation_label = str_glue("From {from} to {to}")) |>
+house_mutations <- house_mutations_in |>
+ pivot_longer(
+ cols = -year,
+ names_to = "mutation", values_to = "n_mutations"
+ ) |>
+ mutate(
+ from = str_extract(mutation, "(.*?)_to"),
+ from = str_remove_all(from, "_to"),
+ to = str_extract(mutation, "to_(.*?)$"),
+ to = str_remove_all(to, "to_")
+ ) |>
+ group_by(year) |>
+ mutate(
+ total_mutations = sum(n_mutations),
+ perc_mutations = n_mutations / total_mutations,
+ across(c(from, to), ~ case_when(
+ str_detect(.x, "buy") ~ "buy",
+ str_detect(.x, "rent_corp") ~ "rent (corporation)",
+ str_detect(.x, "rent_other") ~ "rent (other)"
+ )),
+ mutation_label = str_glue("From {from} to {to}")
+ ) |>
ungroup()
```
@@ -460,46 +691,77 @@ So not every year there's the same number of "mutations" (transformations of pur
#| code-fold: true
#| code-summary: "Show code"
-mutations_n_plot <- house_mutations |>
- arrange(from,to) |>
- mutate(mutation_label = fct_inorder(mutation_label)) |>
- ggplot(aes(x = year, y = n_mutations, alluvium = mutation_label, fill = mutation_label)) +
+mutations_n_plot <- house_mutations |>
+ arrange(from, to) |>
+ mutate(mutation_label = fct_inorder(mutation_label)) |>
+ ggplot(aes(
+ x = year, y = n_mutations,
+ alluvium = mutation_label, fill = mutation_label
+ )) +
geom_point(shape = 21, color = "transparent", size = NA) +
ggalluvial::geom_flow(width = 0, show.legend = FALSE) +
- labs(x = NULL,
- y = "Number of mutations",
- fill = NULL) +
- scale_x_continuous(expand = c(0,0)) +
- scale_y_continuous(labels = scales::label_number_auto(), expand = c(0,0)) +
- scale_fill_manual(values = rev(colors),
- guide = guide_legend(title.position = "top", title.hjust = 0.5, nrow = 2,
- label.position = "right",
- override.aes = list(color = "transparent", size = 6, alpha = 1, stroke = 0))) +
+ labs(
+ x = NULL,
+ y = "Number of mutations",
+ fill = NULL
+ ) +
+ scale_x_continuous(expand = c(0, 0)) +
+ scale_y_continuous(
+ labels = scales::label_number_auto(),
+ expand = c(0, 0)
+ ) +
+ scale_fill_manual(
+ values = rev(colors),
+ guide = guide_legend(
+ title.position = "top", title.hjust = 0.5, nrow = 2,
+ label.position = "right",
+ override.aes = list(
+ color = "transparent", size = 6,
+ alpha = 1, stroke = 0
+ )
+ )
+ ) +
theme(legend.key = element_rect(fill = "transparent"))
-mutations_perc_plot <- house_mutations |>
- arrange(from,to) |>
- mutate(mutation_label = fct_inorder(mutation_label)) |>
- ggplot(aes(x = year, y = perc_mutations, alluvium = mutation_label, fill = mutation_label)) +
+mutations_perc_plot <- house_mutations |>
+ arrange(from, to) |>
+ mutate(mutation_label = fct_inorder(mutation_label)) |>
+ ggplot(aes(
+ x = year, y = perc_mutations,
+ alluvium = mutation_label, fill = mutation_label
+ )) +
geom_point(shape = 21, color = "transparent", size = NA) +
ggalluvial::geom_flow(width = 0, show.legend = FALSE) +
- labs(x = NULL,
- y = "Percentage of total mutations per quarter",
- fill = NULL) +
- scale_x_continuous(expand = c(0,0)) +
- scale_y_continuous(labels = scales::label_percent(), expand = c(0,0)) +
- scale_fill_manual(values = rev(colors),
- guide = guide_legend(title.position = "top", title.hjust = 0.5, nrow = 2,
- label.position = "right",
- override.aes = list(color = "transparent", size = 6, alpha = 1, stroke = 0))) +
- theme(axis.title.y = element_textbox(width = grid::unit(2,"in")),
- legend.key = element_rect(fill = "transparent"))
+ labs(
+ x = NULL,
+ y = "Percentage of total mutations per quarter",
+ fill = NULL
+ ) +
+ scale_x_continuous(expand = c(0, 0)) +
+ scale_y_continuous(labels = scales::label_percent(), expand = c(0, 0)) +
+ scale_fill_manual(
+ values = rev(colors),
+ guide = guide_legend(
+ title.position = "top", title.hjust = 0.5, nrow = 2,
+ label.position = "right",
+ override.aes = list(
+ color = "transparent", size = 6,
+ alpha = 1, stroke = 0
+ )
+ )
+ ) +
+ theme(
+ axis.title.y = element_textbox(width = grid::unit(2, "in")),
+ legend.key = element_rect(fill = "transparent")
+ )
mutations_n_plot / mutations_perc_plot +
- plot_annotation(title = "Shift to a renters market",
- subtitle = "_A lot more houses meant for sale are being transformed into rental properties than vice versa_",
- caption = "_**Data**: CBS_") +
- plot_layout(guides = 'collect') &
+ plot_annotation(
+ title = "Shift to a renters market",
+ subtitle = "_A lot more houses meant for sale are being transformed into rental properties than vice versa_",
+ caption = "_**Data**: CBS_"
+ ) +
+ plot_layout(guides = "collect") &
theme(plot.title = element_textbox(size = 20))
```
@@ -515,53 +777,66 @@ We can look at the net number of houses added to the rental market by adding up
#| code-fold: true
#| code-summary: "Show code"
-net_house_mutations <- house_mutations |>
- mutate(across(c(from,to), ~ case_when(str_detect(.x, "rent") ~ "rent",
- str_detect(.x, "buy") ~ "buy"))) |>
- group_by(year) |>
- filter(from != to) |>
- mutate(total_mutations = sum(n_mutations)) |>
- group_by(year, from, to) |>
- summarise(n_mutations = sum(n_mutations),
- total_mutations = first(total_mutations)) |>
- pivot_wider(id_cols = c(year,total_mutations), names_from = c(from,to), values_from = n_mutations,
- names_sep = "_to_") |>
- mutate(net_buy_to_rent = buy_to_rent - rent_to_buy,
- perc_buy_to_rent = buy_to_rent / total_mutations)
-
-net_mutation_plot <- net_house_mutations |>
- ggplot(aes(x = year, y = net_buy_to_rent)) +
+net_house_mutations <- house_mutations |>
+ mutate(across(c(from, to), ~ case_when(
+ str_detect(.x, "rent") ~ "rent",
+ str_detect(.x, "buy") ~ "buy"
+ ))) |>
+ group_by(year) |>
+ filter(from != to) |>
+ mutate(total_mutations = sum(n_mutations)) |>
+ group_by(year, from, to) |>
+ summarise(
+ n_mutations = sum(n_mutations),
+ total_mutations = first(total_mutations)
+ ) |>
+ pivot_wider(
+ id_cols = c(year, total_mutations),
+ names_from = c(from, to),
+ values_from = n_mutations,
+ names_sep = "_to_"
+ ) |>
+ mutate(
+ net_buy_to_rent = buy_to_rent - rent_to_buy,
+ perc_buy_to_rent = buy_to_rent / total_mutations
+ )
+
+net_mutation_plot <- net_house_mutations |>
+ ggplot(aes(x = year, y = net_buy_to_rent)) +
geom_hline(yintercept = 0, color = "grey30", size = 1) +
- #geom_ribbon(aes(ymin = 0, ymax = net_buy_to_rent), fill = "grey30", alpha = 0.2) +
geom_line(color = "darkred", size = 1.5, lineend = "round") +
- #geom_textbox(data = tibble(), aes(x = 2014, y = 4e4,
- # label = "Net number of properties meant for sale withdrawn from the market"),
- # family = "nunito-sans", size = 4, fill = "transparent",
- # maxwidth = grid::unit(2,"in"), hjust = 0.5, vjust = 0) +
- #geom_curve(data = tibble(), aes(x = 2014, y = 4e4, xend = 2016.5, yend = 2.5e4),
- # curvature = 0.3, size = 0.75, arrow = arrow(length = unit(2,"mm")),
- # lineend = "round") +
- labs(x = NULL,
- y = "**Net number of properties changed from properties meant for sale to rental properties**") +
- scale_y_continuous(labels = scales::label_number_auto(), limits = c(-3e4, NA), n.breaks = 5) +
- theme(axis.title.y = element_textbox(width = grid::unit(3.5,"in")))
-
-perc_mutation_plot <- net_house_mutations |>
- ggplot(aes(x = year, y = perc_buy_to_rent)) +
+ labs(
+ x = NULL,
+ y = "**Net number of properties changed from properties meant for sale to rental properties**"
+ ) +
+ scale_y_continuous(
+ labels = scales::label_number_auto(),
+ limits = c(-3e4, NA), n.breaks = 5
+ ) +
+ theme(axis.title.y = element_textbox(width = grid::unit(3.5, "in")))
+
+perc_mutation_plot <- net_house_mutations |>
+ ggplot(aes(x = year, y = perc_buy_to_rent)) +
geom_hline(yintercept = 0.5, color = "grey30", size = 1) +
geom_line(color = "darkred", size = 1.5, lineend = "round") +
- labs(x = NULL,
- y = "**Percentage of mutations that changed properties meant for sale to rental properties**") +
- scale_y_continuous(labels = scales::label_percent(), limits = c(0,1), expand = c(0,0)) +
+ labs(
+ x = NULL,
+ y = "**Percentage of mutations that changed properties meant for sale to rental properties**"
+ ) +
+ scale_y_continuous(
+ labels = scales::label_percent(),
+ limits = c(0, 1), expand = c(0, 0)
+ ) +
coord_cartesian(clip = "off") +
- theme(axis.title.y = element_textbox(width = grid::unit(3.5,"in")))
-
-net_mutation_plot + perc_mutation_plot +
- plot_annotation(title = "Major net shift towards rental market",
- subtitle = "_A lot more houses meant for sale are being transformed into rental properties than vice versa_",
- caption = "_**Data**: CBS_") &
+ theme(axis.title.y = element_textbox(width = grid::unit(3.5, "in")))
+
+net_mutation_plot + perc_mutation_plot +
+ plot_annotation(
+ title = "Major net shift towards rental market",
+ subtitle = "_A lot more houses meant for sale are being transformed into rental properties than vice versa_",
+ caption = "_**Data**: CBS_"
+ ) &
theme(plot.title = element_textbox(size = 20))
-
```
So in 2017 nearly 90 000 houses were mutated from sale to rental properties. In that same year, about 62 000 new homes were built (source: [CBS](https://www.cbs.nl/nl-nl/nieuws/2018/04/hoogste-aantal-nieuwbouwwoningen-in-acht-jaar)). Not all of those 62 000 homes are meant for sale, a proportion are added to the rental market, but even if all those new homes were meant for sale, the sale market still shrunk due to the (net) ~90 000 properties that were transformed and in that way removed from the sale market.
@@ -572,36 +847,49 @@ So in 2017 nearly 90 000 houses were mutated from sale to rental properties. In
#| code-fold: true
#| code-summary: "Show code"
-house_mutations |>
- filter(year == max(year)) |>
- select(from, to, n_mutations, perc_mutations) |>
- mutate(across(c(from,to), ~ str_to_sentence(.x))) |>
- arrange(-perc_mutations) |>
- gt() |>
- fmt_number(columns = "n_mutations", sep_mark = " ", drop_trailing_zeros = TRUE) |>
- fmt_percent(columns = "perc_mutations") |>
+house_mutations |>
+ filter(year == max(year)) |>
+ select(from, to, n_mutations, perc_mutations) |>
+ mutate(across(c(from, to), ~ str_to_sentence(.x))) |>
+ arrange(-perc_mutations) |>
+ gt() |>
+ fmt_number(
+ columns = "n_mutations", sep_mark = " ",
+ drop_trailing_zeros = TRUE
+ ) |>
+ fmt_percent(columns = "perc_mutations") |>
grand_summary_rows(
columns = "n_mutations",
fns = list(total = "sum"),
- formatter = fmt_number,
+ fmt = ~ fmt_number(.),
sep_mark = " ",
- drop_trailing_zeros = TRUE) |>
+ drop_trailing_zeros = TRUE
+ ) |>
grand_summary_rows(
columns = "perc_mutations",
fns = list(total = "sum"),
- formatter = fmt_percent,
- missing_text = "") |>
+ fmt = ~ fmt_percent(.),
+ missing_text = ""
+ ) |>
cols_label(
from = html("
"
+ )
+ ) |>
+ tab_source_note(source_note = md("_**Data**: CBS_")) |>
+ tab_options(
+ table.background.color = "#EBEBEB",
+ grand_summary_row.text_transform = "capitalize"
+ ) |>
gtsave("mutations-table.png", expand = 0)
```
@@ -616,48 +904,65 @@ So what's the result of all these phenomena? The figure below shows the housing
#| code-fold: true
#| code-summary: "Show code"
-asking_start <- data |>
- filter(type != "Total",
- date == min(date)) |>
- rename(asking_start = asking_price) |>
+asking_start <- data |>
+ filter(
+ type != "Total",
+ date == min(date)
+ ) |>
+ rename(asking_start = asking_price) |>
select(type, asking_start)
-data_asking_index <- data |>
- filter(type != "Total") |>
- left_join(asking_start) |>
+data_asking_index <- data |>
+ filter(type != "Total") |>
+ left_join(asking_start) |>
mutate(asking_index = asking_price / asking_start)
-data_asking_index |>
- ggplot(aes(x = date, y = asking_index, color = type, group = type)) +
- geom_line(size = 2, alpha = 0.15, lineend = "round", show.legend = FALSE) +
+data_asking_index |>
+ ggplot(aes(x = date, y = asking_index, color = type, group = type)) +
+ geom_line(size = 2, alpha = 0.15, lineend = "round", show.legend = FALSE) +
geom_smooth(se = FALSE, show.legend = FALSE) +
- geom_point(size = NA) +
- labs(title = "Price development over the past decade",
- subtitle = "_On average, properties doubled in value since 2012_",
- x = NULL,
- y = "Price development relative to early 2012",
- color = NULL,
- caption = "_**Data**: MVA_") +
- scale_y_continuous(labels = scales::label_percent()) +
- scale_color_manual(values = colors,
- guide = guide_legend(title.position = "top", title.hjust = 0.5, nrow = 2,
- label.position = "right",
- override.aes = list(fill = "transparent", size = 6, alpha = 1))) +
- theme(plot.title = element_textbox(size = 20),
- axis.title.y = element_textbox(width = grid::unit(2.5, "in")),
- legend.key = element_rect(fill = "transparent", color = "transparent"))
-
+ geom_point(size = NA) +
+ labs(
+ title = "Price development over the past decade",
+ subtitle = "_On average, properties doubled in value since 2012_",
+ x = NULL,
+ y = "Price development relative to early 2012",
+ color = NULL,
+ caption = "_**Data**: MVA_"
+ ) +
+ scale_y_continuous(labels = scales::label_percent()) +
+ scale_color_manual(
+ values = colors,
+ guide = guide_legend(
+ title.position = "top", title.hjust = 0.5, nrow = 2,
+ label.position = "right",
+ override.aes = list(fill = "transparent", size = 6, alpha = 1)
+ )
+ ) +
+ theme(
+ plot.title = element_textbox(size = 20),
+ axis.title.y = element_textbox(width = grid::unit(2.5, "in")),
+ legend.key = element_rect(fill = "transparent", color = "transparent")
+ )
```
-Adjusted for the number of properties offered, the average asking price in Amsterdam has increased by `r data_asking_index |> filter(date == max(date)) %$% weighted.mean(asking_index, n_offered) %>% subtract(.,1) |> scales::percent()`. That means that prices since 2012 have (on average) doubled. For apartments, this increase was `r data_asking_index |> filter(str_detect(type,"Apartments"), date == max(date)) %$% weighted.mean(asking_index, n_offered) %>% subtract(.,1) |> scales::percent()`. Inflation on the other hand was only about 8.5% since 2012. So young people wanting to buy a house in Amsterdam need to bring nearly twice as much money to the table as they did just a decade ago. The money they saved and the salaries they earned have by no means kept up with the increasing price though.
+{{{< sidenote >}}}
+Inflation since 2012 was about 8.5% which should be subtracted from these values
+{{{< /sidenote >}}}
+
+Adjusted for the number of properties offered, the average asking price in Amsterdam has increased by `r data_asking_index |> filter(date == max(date)) %$% weighted.mean(asking_index, n_offered) %>% subtract(.,1) |> scales::percent()`. That means that prices since 2012 have (on average) roughly doubled. For apartments, this increase was `r data_asking_index |> filter(str_detect(type,"Apartments"), date == max(date)) %$% weighted.mean(asking_index, n_offered) %>% subtract(.,1) |> scales::percent()`. So young people wanting to buy a house in Amsterdam need to bring nearly twice as much money to the table as they did just a decade ago. The money they saved and the salaries they earned have by no means kept up with the increasing price though.
I cannot know this for sure, because the data I presented here doesn't contain the answer, but perhaps these factors combined play a role in the financial uncertainty that millenials experience. How do we fix it? I'm not sure. This is a really complicated issue. But a few things might (in my opinion) help:
-1. **Build more (sustainable) durable houses**
-1. **Prohibit sale of properties to individuals who don't plan to live there themselves for at least 2 or 3 years**
-1. **Increase transparency when buying or selling a house through mandatory public offerings to prevent scooping by investors**
-1. **Rent control to make _"mutating"_ houses less profitable**
-1. **Increased socialized housing for poor and low-income families**
+- **Build more (sustainable) durable houses**
+
+- **Prohibit sale of properties to individuals who don't plan to live there themselves for at least 2 or 3 years**
+
+- **Increase transparency when buying or selling a house through mandatory public offerings to prevent scooping by investors**
+
+- **Rent control to make _"mutating"_ houses less profitable**
+
+- **Increased socialized housing for poor and low-income families**
Perhaps if we implement a few of these strategies we can make the housing market a little fairer and perhaps ensure that future generations will have the same opportunities current home-owners have.
diff --git a/content/blog/2021-amsterdam-housing-market/mutations-table.png b/content/blog/2021-amsterdam-housing-market/mutations-table.png
index 3488a5c..a489091 100644
Binary files a/content/blog/2021-amsterdam-housing-market/mutations-table.png and b/content/blog/2021-amsterdam-housing-market/mutations-table.png differ
diff --git a/content/blog/2021-comparison-fa-pca-ica/sobar-72.csv b/content/blog/2021-comparison-fa-pca-ica/data/sobar-72.csv
similarity index 100%
rename from content/blog/2021-comparison-fa-pca-ica/sobar-72.csv
rename to content/blog/2021-comparison-fa-pca-ica/data/sobar-72.csv
diff --git a/content/blog/2021-comparison-fa-pca-ica/index.md b/content/blog/2021-comparison-fa-pca-ica/index.md
index 00f9c7c..c2935b5 100644
--- a/content/blog/2021-comparison-fa-pca-ica/index.md
+++ b/content/blog/2021-comparison-fa-pca-ica/index.md
@@ -14,13 +14,19 @@ execute:
fig.show: hold
results: hold
out.width: 80%
+editor_options:
+ chunk_output_type: console
---
This is just a very quick blog post outlining some of the commonalities and differences between factor analysis (FA), principal component analysis (PCA), and independent component analysis (ICA). I was inspired to write some of this down through some confusion caused in the lab by SPSS' apparent dual usage of the term "factor analysis" and "principal components". A few of my colleagues who use SPSS showed me the following screen:
-{{< image src="spss-screenshot.png" alt="spss screenshot" >}}
+{{< figure src="spss-screenshot.png" alt="spss screenshot" >}}
-This screen shows up when you click `Analyze` -\> `Dimension Reduction` -\> `Factor`, which then opens a window called "Factor Analysis: Extraction" which lets you pick "Principal components" as a method. To put the (apparent) PCA method as a sub-section of factor analysis is misleading at best, and straight-up erronious at worst. The other options for the method here is "Principal axis factoring" (which is closer to traditional factor analysis) and "Maximum likelihood" ([source for screenshot and SPSS interface](https://stats.idre.ucla.edu/spss/seminars/efa-spss/)). If you're wondering if you're the first one to be confused by SPSS' choice to present this in such a way, you're not ([link](https://stats.stackexchange.com/questions/1576/what-are-the-differences-between-factor-analysis-and-principal-component-analysi), [link](https://stats.stackexchange.com/questions/24781/interpreting-discrepancies-between-r-and-spss-with-exploratory-factor-analysis)).
+{{< sidenote br="2em" >}}
+If you're wondering if you're the first one to be confused by SPSS' choice to present this in such a way, you're not ([link](https://stats.stackexchange.com/questions/1576/what-are-the-differences-between-factor-analysis-and-principal-component-analysi), [link](https://stats.stackexchange.com/questions/24781/interpreting-discrepancies-between-r-and-spss-with-exploratory-factor-analysis)).
+{{< /sidenote >}}
+
+This screen shows up when you click `Analyze` -\> `Dimension Reduction` -\> `Factor`, which then opens a window called "Factor Analysis: Extraction" which lets you pick "Principal components" as a method. To put the (apparent) PCA method as a sub-section of factor analysis is misleading at best, and straight-up erronious at worst. The other options for the method here is "Principal axis factoring" (which is closer to traditional factor analysis) and "Maximum likelihood" ([source for screenshot and SPSS interface](https://stats.idre.ucla.edu/spss/seminars/efa-spss/)).
Standard disclaimer: I'm not a statistician, and I'm definitely not confident enough to go in-depth into the mathematics of the different algorithms. Instead, I'll just run three common latent variable modeling/clustering methods, and show the difference in results when applied to the same data. Where-ever I feel confident, I will also elaborate on the underlying mathematical principles and concepts. It's a short post, and I'm sure there's levels of nuance and complexity I've missed. Please let me know if you think I've committed a major oversight.
@@ -41,8 +47,8 @@ theme_set(theme_minimal())
We'll load the data into R, clean up the variable names, convert the outcome variable (`ca_cervix`) to a factor. We'll have a look at the dataset using some functions from `{skimr}`. This function will give us summary statistics and basic histograms of the different variables.
``` r
-data <- read_csv("sobar-72.csv") |>
- janitor::clean_names() |>
+data <- read_csv("./data/sobar-72.csv") |>
+ janitor::clean_names() |>
mutate(ca_cervix = as_factor(ca_cervix))
skim_summ <- skimr::skim_with(base = skimr::sfl())
@@ -96,8 +102,8 @@ Data summary
Now let's only select the variables containing the risk factors (called features from here on). We'll also scale all the features to have a mean of 0 and a standard deviation of 1 using the `scale()` function. We can check what the new variable looks like using the `summary()` function.
``` r
-data_features <- data |>
- select(-ca_cervix) |>
+data_features <- data |>
+ select(-ca_cervix) |>
mutate(across(everything(), ~ scale(.x)))
summary(data_features$attitude_consistency)
@@ -114,17 +120,23 @@ summary(data_features$attitude_consistency)
So now we have a data frame with 72 entries and 19 normalized columns, each representing a feature that may or may not predict cervical cancer. We can create a correlation matrix to visualize the degree of correlation between the different features. For this we simply run `cor()` on the data frame with the features, transform the output to a data frame in long format and then visualize it using `ggplot()` and `geom_tile()`.
``` r
-cor(data_features) |>
- as_tibble() %>%
- mutate(feature_y = names(.)) |>
- pivot_longer(cols = -feature_y, names_to = "feature_x", values_to = "correlation") |>
- mutate(feature_y = fct_rev(feature_y)) |>
- ggplot(aes(x = feature_x, y = feature_y, fill = correlation)) +
- geom_tile() +
- labs(x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "berlin", limits = c(-1,1)) +
- coord_fixed() +
+cor(data_features) |>
+ as_tibble() |>
+ mutate(feature_y = names(data_features)) |>
+ pivot_longer(
+ cols = -feature_y,
+ names_to = "feature_x",
+ values_to = "correlation"
+ ) |>
+ mutate(feature_y = fct_rev(feature_y)) |>
+ ggplot(aes(x = feature_x, y = feature_y, fill = correlation)) +
+ geom_tile() +
+ labs(
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "berlin", limits = c(-1, 1)) +
+ coord_fixed() +
theme(axis.text.x = element_text(hjust = 1, angle = 30))
```
@@ -138,6 +150,10 @@ Let's now dive into the approaches we'll use. We'll discuss three here, and if t
## Selecting the number of components
+{{< sidenote br="6em" >}}
+Icasso is implemented in MATLAB, which I won't bore you with here, but it seems someone created a [Python port](https://github.com/Teekuningas/icasso) also
+{{< /sidenote >}}
+
For factor analysis and ICA we need to provide the number of factors we want to extract. We'll use the same number of factors/components throughout this tutorial (also for PCA). Selection of the optimal number of factors/components is a fairly arbitrary process which I won't go into now. In short, before writing this I ran PCA and a tool called [icasso](https://research.ics.aalto.fi/ica/icasso/). *Icasso* runs an ICA algorithm a number of times and provides a number of parameters on the stability of the clusters at different thresholds, this approach is very data-driven. A more common and easier way to get some data-driven insight into the optimal number of clusters is using the "elbow"-method using PCA, eigenvalues of the components, and the cumulative variance explained by the components (we'll show those later). However, in the end, you should also look at the weight matrix of the different cluster thresholds and it becomes a fairly arbitrary process. In this case, the *icasso* showed that 7 components was good (but not the best), the weight matrix looked okay, and 7 components explained more than 80% of the variance in the PCA. I went for 7 components in the end also because it served the purpose of this blogpost quite well, but I think different thresholds are valid and you could make a case for different ones based on the data, based on the interpretation of the weight matrix, etc. This process is a bit of an art and a science combined.
``` r
@@ -146,12 +162,18 @@ n_comps <- 7
## Factor Analysis
-So, let's run the factor analysis. Technically speaking, factor analysis isn't a clustering method but rather a latent variable modeling method [source](https://stats.stackexchange.com/questions/241726/understanding-exploratory-factor-analysis-some-points-for-clarification). The primary aim of a factor analysis is the reconstruction of correlations/covariances between variables. Maximizing explained variance of the factors is only a secondary aim and we'll get to why that is relevant in the PCA section.
+So, let's run the factor analysis. Technically speaking, factor analysis isn't a clustering method but rather a latent variable modeling method ([source](https://stats.stackexchange.com/questions/241726/understanding-exploratory-factor-analysis-some-points-for-clarification)). The primary aim of a factor analysis is the reconstruction of correlations/covariances between variables. Maximizing explained variance of the factors is only a secondary aim and we'll get to why that is relevant in the PCA section.
We've established we want 7 factors. The factor analysis method is implemented in R through the `factanal()` function (see [here](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/factanal%20for%20the%20documentation). This function applies a common factor model using the maximum likelihood method. We'll simply provide it with our data frame, the number of factors we want to extract, and we'll ask to provide Bartlett's weighted least-squares scores as well. We can apply a "rotation" to the factors to make them reduce the complexity of the factor loadings and make them easier to interpret, here we'll use the `varimax` option. Varimax maximizes the sum of the variances of the squared loadings. Then we'll print the model to see the results (and sort so the loadings are ordered). The output may be a bit long.
``` r
-fa_model <- factanal(data_features, factors = n_comps, scores = "Bartlett", rotation = "varimax")
+fa_model <- factanal(
+ data_features,
+ factors = n_comps,
+ scores = "Bartlett",
+ rotation = "varimax"
+)
+
print(fa_model, sort = TRUE)
```
@@ -234,19 +256,26 @@ print(fa_model, sort = TRUE)
We may be tempted to immediately look at the *p*-value at the end of the output. This *p*-value denotes whether the assumption of perfect fit can be rejected. If this *p*-value is below 0.05 or 0.01 we can reject the hypothesis of perfect fit, meaning that we could probably try a different method or try a different number of factors. In this case, the *p*-value is larger than 0.05 so we cannot reject the hypothesis of perfect fit.
-The "Loadings" section of the results show a make-shift weight matrix, but in order to further interpret these results, let's create a plot showing the weight matrix. We'll get the results from the factor analysis model we created earlier using the `tidy()` function from `{broom}` and convert it to long format. We'll then create a weight matrix much in the same way we did earlier.
+The "Loadings" section of the results show a make-shift weight matrix, but in order to further interpret these results, let's create a plot showing the weight matrix. We'll get the results from the factor analysis model we created earlier using the `tidy()` function from the `{broom}` package and convert it to long format. We'll then create a weight matrix much in the same way we did earlier.
``` r
-fa_weight_matrix <- broom::tidy(fa_model) |>
- pivot_longer(starts_with("fl"), names_to = "factor", values_to = "loading")
-
-fa_loading_plot <- ggplot(fa_weight_matrix, aes(x = factor, y = variable, fill = loading)) +
- geom_tile() +
- labs(title = "FA loadings",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "cork", limits = c(-1,1)) +
- coord_fixed(ratio = 1/2)
+fa_weight_matrix <- broom::tidy(fa_model) |>
+ pivot_longer(starts_with("fl"),
+ names_to = "factor", values_to = "loading"
+ )
+
+fa_loading_plot <- ggplot(
+ fa_weight_matrix,
+ aes(x = factor, y = variable, fill = loading)
+) +
+ geom_tile() +
+ labs(
+ title = "FA loadings",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "cork", limits = c(-1, 1)) +
+ coord_fixed(ratio = 1 / 2)
print(fa_loading_plot)
```
@@ -258,24 +287,29 @@ Here we can more easily see that there's two strong clusters in Factor 1 and Fac
Lastly, I think it would be interesting to see how the different factors relate to each other. We'll take the Bartlett's scores and correlate them with each other much like before and create a correlation matrix like before.
``` r
-fa_model$scores |>
- cor() |>
- data.frame() |>
- rownames_to_column("factor_x") |>
- pivot_longer(cols = -factor_x, names_to = "factor_y", values_to = "correlation") |>
- ggplot(aes(x = factor_x, y = factor_y, fill = correlation)) +
+fa_model$scores |>
+ cor() |>
+ data.frame() |>
+ rownames_to_column("factor_x") |>
+ pivot_longer(
+ cols = -factor_x,
+ names_to = "factor_y", values_to = "correlation"
+ ) |>
+ ggplot(aes(x = factor_x, y = factor_y, fill = correlation)) +
geom_tile() +
- geom_text(aes(label = round(correlation,4)), color = "white") +
- labs(title = "Correlation between FA scores",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "berlin", limits = c(-1,1)) +
+ geom_text(aes(label = round(correlation, 4)), color = "white") +
+ labs(
+ title = "Correlation between FA scores",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "berlin", limits = c(-1, 1)) +
coord_equal()
```
-We can see little correlation but certainly a few non-zero correlations also. In particular Factor 5 and Factor 7 seem to correlate to some extent at least and there are a few others with some minor correlations, e.g. Factor 1 and Factor 3, as well as Factor 3 and Factor 6. We'll compare this later with the other algorithms.
+We can see little correlation but certainly a few non-zero correlations also. In particular Factor 5 and Factor 7 seem to correlate to some extent at least and there are a few others with some minor correlations, e.g. Factor 1 and Factor 3, as well as Factor 5 and Factor 6. We'll compare this later with the other algorithms.
## Principal Component Analysis
@@ -285,7 +319,11 @@ Principal component analysis can reliably be classfied as a clustering method (a
I can recommend this paper as a great primer to PCA methods. It goes over a few concepts very relevant for PCA methods as well as clustering methods in general.
-Now, let's run the PCA. In R there's two functions built-in to run a principal component analysis, here we'll use the `prcomp()`function (the other being `princomp()`, but `prcomp()` is preferred for reasons beyond the scope of this post). The `prcomp()` function contains a few options we can play with, but in my experience it works fine out of the box if you've normalized the data beforehand ([link to documentation](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp)). So we'll simply provide the data frame with the features. In addition, we'll also calculate the variance explained by each component. We do that by simply taking the standard deviation calculated by the PCA and squaring it, we'll save it in a new field called `variance`.
+{{< sidenote br="1em" >}}
+For some more info on the difference between `prcomp()` and `princomp()` see [this](https://stats.stackexchange.com/questions/20101/what-is-the-difference-between-r-functions-prcomp-and-princomp) thread on CrossValidated
+{{< /sidenote >}}
+
+Now, let's run the PCA. In R there's two functions built-in to run a principal component analysis, here we'll use the `prcomp()` function (the other being `princomp()`, but `prcomp()` is preferred for reasons beyond the scope of this post). The `prcomp()` function contains a few options we can play with, but in my experience it works fine out of the box if you've normalized the data beforehand ([link to documentation](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp)). So we'll simply provide the data frame with the features. In addition, we'll also calculate the variance explained by each component. We do that by simply taking the standard deviation calculated by the PCA and squaring it, we'll save it in a new field called `variance`.
``` r
pc_model <- prcomp(data_features)
@@ -296,40 +334,46 @@ pc_model$variance <- pc_model$sdev^2
Next we can make a simple scree plot using the variance we calculate above. We'll create a plot with the principal components on the x-axis and the eigenvalue on the y-axis. The scree plot is a very popular plot to visualize features of a PCA. In this plot the elbow is quite clearly at principal component 3, but as discussed, the scree plot is not the best nor the only way to determine the optimal number of components. In the code below I also added a calculation for the cumulative variance (`cum_variance`) which showed that a little more than 80% of the variance is captured in the first 7 components, while the first 3 components combined capture only 56%.
``` r
-pc_model$variance |>
+pc_model$variance |>
as_tibble() |>
- rename(eigenvalue = value) |>
- rownames_to_column("comp") |>
- mutate(comp = parse_number(comp),
- cum_variance = cumsum(eigenvalue)/sum(eigenvalue)) |>
- ggplot(aes(x = comp, y = eigenvalue)) +
+ rename(eigenvalue = value) |>
+ rownames_to_column("comp") |>
+ mutate(
+ comp = parse_number(comp),
+ cum_variance = cumsum(eigenvalue) / sum(eigenvalue)
+ ) |>
+ ggplot(aes(x = comp, y = eigenvalue)) +
geom_hline(yintercept = 1) +
- geom_line(size = 1) +
+ geom_line(linewidth = 1) +
geom_point(size = 3)
```
- Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
- ℹ Please use `linewidth` instead.
-
We'll also create a weight matrix again, based on the rotation from the PCA. We'll create the weight matrix much in the same way as before. A PCA by its very nature returns an equal number of components as the number of variables put in, however, we're interested in just the first 7 components, so we'll select just those using the `filter()` function.
``` r
-pc_weight_matrix <- pc_model$rotation |>
- data.frame() |>
- rownames_to_column("variable") |>
- pivot_longer(starts_with("PC"), names_to = "prin_comp", values_to = "loading")
-
-pca_loading_plot <- pc_weight_matrix |>
- filter(parse_number(prin_comp) <= n_comps) |>
- ggplot(aes(x = reorder(prin_comp, parse_number(prin_comp)), y = variable, fill = loading)) +
- geom_tile() +
- labs(title = "PCA loadings",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "cork", limits = c(-1,1)) +
- coord_fixed(ratio = 1/2)
+pc_weight_matrix <- pc_model$rotation |>
+ data.frame() |>
+ rownames_to_column("variable") |>
+ pivot_longer(starts_with("PC"),
+ names_to = "prin_comp", values_to = "loading"
+ )
+
+pca_loading_plot <- pc_weight_matrix |>
+ filter(parse_number(prin_comp) <= n_comps) |>
+ ggplot(aes(
+ x = reorder(prin_comp, parse_number(prin_comp)),
+ y = variable, fill = loading
+ )) +
+ geom_tile() +
+ labs(
+ title = "PCA loadings",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "cork", limits = c(-1, 1)) +
+ coord_fixed(ratio = 1 / 2)
print(pca_loading_plot)
```
@@ -341,20 +385,27 @@ One thing that immediately jumps out is that PC1 and PC2 are nearly identical to
We can also make a correlation matrix for the different principal components. We'll use the `x` field and otherwise create the correlation matrix the same way as before:
``` r
-pc_model$x |>
- cor() |>
- data.frame() |>
- rownames_to_column("comp_x") |>
- pivot_longer(cols = starts_with("PC"), names_to = "comp_y", values_to = "correlation") |>
- filter(parse_number(comp_x) <= n_comps,
- parse_number(comp_y) <= n_comps) |>
- ggplot(aes(x = comp_x, y = comp_y, fill = correlation)) +
+pc_model$x |>
+ cor() |>
+ data.frame() |>
+ rownames_to_column("comp_x") |>
+ pivot_longer(
+ cols = starts_with("PC"),
+ names_to = "comp_y", values_to = "correlation"
+ ) |>
+ filter(
+ parse_number(comp_x) <= n_comps,
+ parse_number(comp_y) <= n_comps
+ ) |>
+ ggplot(aes(x = comp_x, y = comp_y, fill = correlation)) +
geom_tile() +
- geom_text(aes(label = round(correlation,4)), color = "white") +
- labs(title = "Correlation between PCs",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "berlin", limits = c(-1,1)) +
+ geom_text(aes(label = round(correlation, 4)), color = "white") +
+ labs(
+ title = "Correlation between PCs",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "berlin", limits = c(-1, 1)) +
coord_equal()
```
@@ -387,10 +438,10 @@ pc_manual <- as.matrix(data_features) %*% eigenvector
Let's look at the scree plot:
``` r
-tibble(eigenvalues) |>
- ggplot(aes(x = seq_along(eigenvalues), y = eigenvalues)) +
+tibble(eigenvalues) |>
+ ggplot(aes(x = seq_along(eigenvalues), y = eigenvalues)) +
geom_hline(yintercept = 1) +
- geom_line(size = 1) +
+ geom_line(linewidth = 1) +
geom_point(size = 3)
```
@@ -399,9 +450,9 @@ tibble(eigenvalues) |>
Looks identical to the previous one. Let's also look at the correlation matrix between the principal components.
``` r
-data.frame(pc_manual) |>
- cor() %>%
- round(., 4)
+data.frame(pc_manual) |>
+ cor() |>
+ round(x = _, digits = 4)
```
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19
@@ -429,6 +480,10 @@ Yup, also still 0 across the board. Calculating a PCA by just using matrix manip
## Independent Component Analysis
+{{< sidenote br="14em" >}}
+If you're feeling particularly *mathematical*, you can read find a paper [here](https://www.iiis.org/CDs2017/CD2017Spring/papers/ZA832BA.pdf) that compares different algorithms
+{{< /sidenote >}}
+
While PCA attempts to find components explaining the maximum degree of covariance or correlation, an ICA attemps to find components with maximum statistical independence. There's very complicated nuance here where PCA components are orthogonal and uncorrelated to each other and ICA components are merely statistically independent, which is a very subtle nuance. In practice, it'll mean that ICA components also have a practically zero correlation. The main difference is in how the components are obtained. Like with factor analysis, most ICA algorithms require you to provide a number of components up front. The FastICA algorithm we'll use here is a version of an ICA implementation. There's also InfoMax and JADE to name two other implementations. I couldn't tell you the difference between these ICA algorithms. The FastICA is implemented in R through the `fastICA()` function and the eponymous `{fastICA}` package ([link to documentation](https://www.rdocumentation.org/packages/fastICA/versions/1.2-2/topics/fastICA)).
``` r
@@ -438,18 +493,26 @@ ica_model <- fastICA(data_features, n.comp = n_comps)
Let's create a weight matrix again. The output from the `fastICA()` doesn't provide useful names, but the [documentation](https://www.rdocumentation.org/packages/fastICA/versions/1.2-2/topics/fastICA) provides sufficient information. To create the weight matrix we take the `A` field, transpose it, get the right names to the right places and then create the plot like we've done several times now.
``` r
-ica_weight_matrix <- data.frame(t(ica_model$A)) |>
+ica_weight_matrix <- data.frame(t(ica_model$A)) |>
rename_with(~ str_glue("IC{seq(.)}")) |>
mutate(variable = names(data_features)) |>
- pivot_longer(cols = starts_with("IC"), names_to = "ic", values_to = "loading")
-
-ica_loading_plot <- ggplot(ica_weight_matrix, aes(x = ic, y = variable, fill = loading)) +
- geom_tile() +
- labs(title = "ICA loadings",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "cork", limits = c(-1,1)) +
- coord_fixed(ratio = 1/2)
+ pivot_longer(
+ cols = starts_with("IC"),
+ names_to = "ic", values_to = "loading"
+ )
+
+ica_loading_plot <- ggplot(
+ ica_weight_matrix,
+ aes(x = ic, y = variable, fill = loading)
+) +
+ geom_tile() +
+ labs(
+ title = "ICA loadings",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "cork", limits = c(-1, 1)) +
+ coord_fixed(ratio = 1 / 2)
print(ica_loading_plot)
```
@@ -461,18 +524,23 @@ The FastICA method doesn't rank the components based on variance like factor ana
Again, we can also visualize the correlation between the different components.
``` r
-ica_model$S |>
- cor() |>
- data.frame() |>
- rownames_to_column("comp_x") |>
- pivot_longer(cols = starts_with("X"), names_to = "comp_y", values_to = "correlation") |>
- ggplot(aes(x = comp_x, y = comp_y, fill = correlation)) +
+ica_model$S |>
+ cor() |>
+ data.frame() |>
+ rownames_to_column("comp_x") |>
+ pivot_longer(
+ cols = starts_with("X"),
+ names_to = "comp_y", values_to = "correlation"
+ ) |>
+ ggplot(aes(x = comp_x, y = comp_y, fill = correlation)) +
geom_tile() +
- geom_text(aes(label = round(correlation,4)), color = "white") +
- labs(title = "Correlation between ICs",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "berlin", limits = c(-1,1)) +
+ geom_text(aes(label = round(correlation, 4)), color = "white") +
+ labs(
+ title = "Correlation between ICs",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "berlin", limits = c(-1, 1)) +
coord_equal()
```
@@ -484,34 +552,38 @@ Okay, let's now compare the three approaches and put the loading matrices side b
``` r
all_weight_matrices <- bind_rows(
- fa_weight_matrix |>
- rename(comp = factor) |>
+ fa_weight_matrix |>
+ rename(comp = factor) |>
mutate(alg = "FA"),
- pc_weight_matrix |>
- rename(comp = prin_comp) |>
- mutate(alg = "PCA"),
- ica_weight_matrix |>
- rename(comp = ic) |>
+ pc_weight_matrix |>
+ rename(comp = prin_comp) |>
+ mutate(alg = "PCA"),
+ ica_weight_matrix |>
+ rename(comp = ic) |>
mutate(alg = "ICA")
)
-all_weight_matrices |>
- filter(parse_number(comp) <= n_comps) |>
- mutate(alg = str_glue("{alg} loadings"),
- alg = as_factor(alg)) |>
+all_weight_matrices |>
+ filter(parse_number(comp) <= n_comps) |>
+ mutate(
+ alg = str_glue("{alg} loadings"),
+ alg = as_factor(alg)
+ ) |>
ggplot(aes(x = comp, y = variable, fill = loading)) +
- geom_tile() +
- labs(x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "cork", limits = c(-1,1)) +
- facet_wrap(~ alg, scales = "free_x")
+ geom_tile() +
+ labs(
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "cork", limits = c(-1, 1)) +
+ facet_wrap(~alg, scales = "free_x")
```
Here we can most clearly see the overlap between the three methods, Factor 1, PC1, and IC 6 capture essentially the same information. The same goes for Factor 2, PC2, and IC 4. Other than that we can see that the other components vary quite markedly. I wouldn't be comfortable calling any of the other components "fairly similar" to another. You could see how some variables load together with multiple methods, but then the components those are captured in also have other information or miss information. We already discussed IC7 consisting mostly of a single variable, and a similar thing happens with Factor 4, but for a different variable.
-### BONUS: Hierarchical clustering
+## BONUS: Hierarchical clustering
I'll quickly go over hierarchical clustering too, it's simple and easy to interpret. Hierarchical clustering works by taking your variables and clustering them first into two groups, then three, then four, and so on. It looks at similarity and a concept called "Euclidian distance" (other methods are available) between the variables and determines how to separate. Essentially, hierarchical clustering works by iteratively separating variables into groups until every variable is on its own. It does so rather aggressively, with the previous methods it's possible for a variable to be part of two clusters, with hierarchical clustering it's part of a single cluster only. This approach makes it an easy way to see how variables cluster together at different thresholds.
@@ -540,7 +612,7 @@ hclust_weight_matrix <- cutree(hclust_model, k = n_comps)
The `cutree()` function assigns a cluster to each of the variables. It looks at the tree and determines where to cut the tree to get the desired number of branches and then tells you the composition of the branches. We can recreate this ourselves by simply adding a `geom_hline()` to the dendrogram. With some trial and error, it seems like cutting the tree at `y = 10.5` will give us 7 clusters.
``` r
-ggdendro::ggdendrogram(hclust_model) +
+ggdendro::ggdendrogram(hclust_model) +
geom_hline(yintercept = 10.5, color = "firebrick")
```
@@ -549,16 +621,20 @@ ggdendro::ggdendrogram(hclust_model) +
Let's look at how the clusters are made up according to the hierarchical clustering model. This isn't really a weight matrix, but rather a definition of the clusters. The "loadings" here are not numerical, but rather 1 or 0.
``` r
-hclust_weight_matrix %>%
- data.frame() |>
- janitor::clean_names() |>
- rename(cluster = x) |>
- rownames_to_column("variable") |>
- ggplot(aes(x = as_factor(cluster), y = variable, fill = as_factor(cluster))) +
- geom_tile() +
- labs(x = NULL,
- y = NULL,
- fill = "cluster") +
+hclust_weight_matrix |>
+ data.frame() |>
+ rename(cluster = hclust_weight_matrix) |>
+ rownames_to_column("variable") |>
+ ggplot(aes(
+ x = as_factor(cluster), y = variable,
+ fill = as_factor(cluster)
+ )) +
+ geom_tile() +
+ labs(
+ x = NULL,
+ y = NULL,
+ fill = "cluster"
+ ) +
scico::scale_fill_scico_d(palette = "batlow")
```
diff --git a/content/blog/2021-comparison-fa-pca-ica/index.qmd b/content/blog/2021-comparison-fa-pca-ica/index.qmd
index 8cc0c05..07d13e8 100644
--- a/content/blog/2021-comparison-fa-pca-ica/index.qmd
+++ b/content/blog/2021-comparison-fa-pca-ica/index.qmd
@@ -14,13 +14,19 @@ execute:
fig.show: hold
results: hold
out.width: 80%
+editor_options:
+ chunk_output_type: console
---
This is just a very quick blog post outlining some of the commonalities and differences between factor analysis (FA), principal component analysis (PCA), and independent component analysis (ICA). I was inspired to write some of this down through some confusion caused in the lab by SPSS' apparent dual usage of the term "factor analysis" and "principal components". A few of my colleagues who use SPSS showed me the following screen:
-{{{< image src="spss-screenshot.png" alt="spss screenshot" >}}}
+{{{< figure src="spss-screenshot.png" alt="spss screenshot" >}}}
-This screen shows up when you click `Analyze` -> `Dimension Reduction` -> `Factor`, which then opens a window called "Factor Analysis: Extraction" which lets you pick "Principal components" as a method. To put the (apparent) PCA method as a sub-section of factor analysis is misleading at best, and straight-up erronious at worst. The other options for the method here is "Principal axis factoring" (which is closer to traditional factor analysis) and "Maximum likelihood" ([source for screenshot and SPSS interface](https://stats.idre.ucla.edu/spss/seminars/efa-spss/)). If you're wondering if you're the first one to be confused by SPSS' choice to present this in such a way, you're not ([link](https://stats.stackexchange.com/questions/1576/what-are-the-differences-between-factor-analysis-and-principal-component-analysi), [link](https://stats.stackexchange.com/questions/24781/interpreting-discrepancies-between-r-and-spss-with-exploratory-factor-analysis)).
+{{{< sidenote br="2em" >}}}
+If you're wondering if you're the first one to be confused by SPSS' choice to present this in such a way, you're not ([link](https://stats.stackexchange.com/questions/1576/what-are-the-differences-between-factor-analysis-and-principal-component-analysi), [link](https://stats.stackexchange.com/questions/24781/interpreting-discrepancies-between-r-and-spss-with-exploratory-factor-analysis)).
+{{{< /sidenote >}}}
+
+This screen shows up when you click `Analyze` -> `Dimension Reduction` -> `Factor`, which then opens a window called "Factor Analysis: Extraction" which lets you pick "Principal components" as a method. To put the (apparent) PCA method as a sub-section of factor analysis is misleading at best, and straight-up erronious at worst. The other options for the method here is "Principal axis factoring" (which is closer to traditional factor analysis) and "Maximum likelihood" ([source for screenshot and SPSS interface](https://stats.idre.ucla.edu/spss/seminars/efa-spss/)).
Standard disclaimer: I'm not a statistician, and I'm definitely not confident enough to go in-depth into the mathematics of the different algorithms. Instead, I'll just run three common latent variable modeling/clustering methods, and show the difference in results when applied to the same data. Where-ever I feel confident, I will also elaborate on the underlying mathematical principles and concepts. It's a short post, and I'm sure there's levels of nuance and complexity I've missed. Please let me know if you think I've committed a major oversight.
@@ -47,8 +53,8 @@ We'll load the data into R, clean up the variable names, convert the outcome var
#| label: load-data
#| message: false
-data <- read_csv("sobar-72.csv") |>
- janitor::clean_names() |>
+data <- read_csv("./data/sobar-72.csv") |>
+ janitor::clean_names() |>
mutate(ca_cervix = as_factor(ca_cervix))
skim_summ <- skimr::skim_with(base = skimr::sfl())
@@ -60,8 +66,8 @@ Now let's only select the variables containing the risk factors (called features
```{r}
#| label: scale-data
-data_features <- data |>
- select(-ca_cervix) |>
+data_features <- data |>
+ select(-ca_cervix) |>
mutate(across(everything(), ~ scale(.x)))
summary(data_features$attitude_consistency)
@@ -72,17 +78,23 @@ So now we have a data frame with 72 entries and 19 normalized columns, each repr
```{r}
#| label: corr-matrix-original
-cor(data_features) |>
- as_tibble() %>%
- mutate(feature_y = names(.)) |>
- pivot_longer(cols = -feature_y, names_to = "feature_x", values_to = "correlation") |>
- mutate(feature_y = fct_rev(feature_y)) |>
- ggplot(aes(x = feature_x, y = feature_y, fill = correlation)) +
- geom_tile() +
- labs(x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "berlin", limits = c(-1,1)) +
- coord_fixed() +
+cor(data_features) |>
+ as_tibble() |>
+ mutate(feature_y = names(data_features)) |>
+ pivot_longer(
+ cols = -feature_y,
+ names_to = "feature_x",
+ values_to = "correlation"
+ ) |>
+ mutate(feature_y = fct_rev(feature_y)) |>
+ ggplot(aes(x = feature_x, y = feature_y, fill = correlation)) +
+ geom_tile() +
+ labs(
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "berlin", limits = c(-1, 1)) +
+ coord_fixed() +
theme(axis.text.x = element_text(hjust = 1, angle = 30))
```
@@ -94,6 +106,10 @@ Let's now dive into the approaches we'll use. We'll discuss three here, and if t
## Selecting the number of components
+{{{< sidenote br="6em" >}}}
+Icasso is implemented in MATLAB, which I won't bore you with here, but it seems someone created a [Python port](https://github.com/Teekuningas/icasso) also
+{{{< /sidenote >}}}
+
For factor analysis and ICA we need to provide the number of factors we want to extract. We'll use the same number of factors/components throughout this tutorial (also for PCA). Selection of the optimal number of factors/components is a fairly arbitrary process which I won't go into now. In short, before writing this I ran PCA and a tool called [icasso](https://research.ics.aalto.fi/ica/icasso/). _Icasso_ runs an ICA algorithm a number of times and provides a number of parameters on the stability of the clusters at different thresholds, this approach is very data-driven. A more common and easier way to get some data-driven insight into the optimal number of clusters is using the "elbow"-method using PCA, eigenvalues of the components, and the cumulative variance explained by the components (we'll show those later). However, in the end, you should also look at the weight matrix of the different cluster thresholds and it becomes a fairly arbitrary process. In this case, the _icasso_ showed that 7 components was good (but not the best), the weight matrix looked okay, and 7 components explained more than 80% of the variance in the PCA. I went for 7 components in the end also because it served the purpose of this blogpost quite well, but I think different thresholds are valid and you could make a case for different ones based on the data, based on the interpretation of the weight matrix, etc. This process is a bit of an art and a science combined.
```{r}
@@ -105,34 +121,47 @@ n_comps <- 7
## Factor Analysis
-So, let's run the factor analysis. Technically speaking, factor analysis isn't a clustering method but rather a latent variable modeling method [source](https://stats.stackexchange.com/questions/241726/understanding-exploratory-factor-analysis-some-points-for-clarification). The primary aim of a factor analysis is the reconstruction of correlations/covariances between variables. Maximizing explained variance of the factors is only a secondary aim and we'll get to why that is relevant in the PCA section.
+So, let's run the factor analysis. Technically speaking, factor analysis isn't a clustering method but rather a latent variable modeling method ([source](https://stats.stackexchange.com/questions/241726/understanding-exploratory-factor-analysis-some-points-for-clarification)). The primary aim of a factor analysis is the reconstruction of correlations/covariances between variables. Maximizing explained variance of the factors is only a secondary aim and we'll get to why that is relevant in the PCA section.
We've established we want 7 factors. The factor analysis method is implemented in R through the `factanal()` function (see [here](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/factanal for the documentation). This function applies a common factor model using the maximum likelihood method. We'll simply provide it with our data frame, the number of factors we want to extract, and we'll ask to provide Bartlett's weighted least-squares scores as well. We can apply a "rotation" to the factors to make them reduce the complexity of the factor loadings and make them easier to interpret, here we'll use the `varimax` option. Varimax maximizes the sum of the variances of the squared loadings. Then we'll print the model to see the results (and sort so the loadings are ordered). The output may be a bit long.
```{r}
#| label: fa
-fa_model <- factanal(data_features, factors = n_comps, scores = "Bartlett", rotation = "varimax")
+fa_model <- factanal(
+ data_features,
+ factors = n_comps,
+ scores = "Bartlett",
+ rotation = "varimax"
+)
+
print(fa_model, sort = TRUE)
```
We may be tempted to immediately look at the _p_-value at the end of the output. This _p_-value denotes whether the assumption of perfect fit can be rejected. If this _p_-value is below 0.05 or 0.01 we can reject the hypothesis of perfect fit, meaning that we could probably try a different method or try a different number of factors. In this case, the _p_-value is larger than 0.05 so we cannot reject the hypothesis of perfect fit.
-The "Loadings" section of the results show a make-shift weight matrix, but in order to further interpret these results, let's create a plot showing the weight matrix. We'll get the results from the factor analysis model we created earlier using the `tidy()` function from `{broom}` and convert it to long format. We'll then create a weight matrix much in the same way we did earlier.
+The "Loadings" section of the results show a make-shift weight matrix, but in order to further interpret these results, let's create a plot showing the weight matrix. We'll get the results from the factor analysis model we created earlier using the `tidy()` function from the `{broom}` package and convert it to long format. We'll then create a weight matrix much in the same way we did earlier.
```{r}
#| label: fa-weight-matrix
-fa_weight_matrix <- broom::tidy(fa_model) |>
- pivot_longer(starts_with("fl"), names_to = "factor", values_to = "loading")
+fa_weight_matrix <- broom::tidy(fa_model) |>
+ pivot_longer(starts_with("fl"),
+ names_to = "factor", values_to = "loading"
+ )
-fa_loading_plot <- ggplot(fa_weight_matrix, aes(x = factor, y = variable, fill = loading)) +
- geom_tile() +
- labs(title = "FA loadings",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "cork", limits = c(-1,1)) +
- coord_fixed(ratio = 1/2)
+fa_loading_plot <- ggplot(
+ fa_weight_matrix,
+ aes(x = factor, y = variable, fill = loading)
+) +
+ geom_tile() +
+ labs(
+ title = "FA loadings",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "cork", limits = c(-1, 1)) +
+ coord_fixed(ratio = 1 / 2)
print(fa_loading_plot)
```
@@ -144,22 +173,27 @@ Lastly, I think it would be interesting to see how the different factors relate
```{r}
#| label: fa-corr-matrix
-fa_model$scores |>
- cor() |>
- data.frame() |>
- rownames_to_column("factor_x") |>
- pivot_longer(cols = -factor_x, names_to = "factor_y", values_to = "correlation") |>
- ggplot(aes(x = factor_x, y = factor_y, fill = correlation)) +
+fa_model$scores |>
+ cor() |>
+ data.frame() |>
+ rownames_to_column("factor_x") |>
+ pivot_longer(
+ cols = -factor_x,
+ names_to = "factor_y", values_to = "correlation"
+ ) |>
+ ggplot(aes(x = factor_x, y = factor_y, fill = correlation)) +
geom_tile() +
- geom_text(aes(label = round(correlation,4)), color = "white") +
- labs(title = "Correlation between FA scores",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "berlin", limits = c(-1,1)) +
+ geom_text(aes(label = round(correlation, 4)), color = "white") +
+ labs(
+ title = "Correlation between FA scores",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "berlin", limits = c(-1, 1)) +
coord_equal()
```
-We can see little correlation but certainly a few non-zero correlations also. In particular Factor 5 and Factor 7 seem to correlate to some extent at least and there are a few others with some minor correlations, e.g. Factor 1 and Factor 3, as well as Factor 3 and Factor 6. We'll compare this later with the other algorithms.
+We can see little correlation but certainly a few non-zero correlations also. In particular Factor 5 and Factor 7 seem to correlate to some extent at least and there are a few others with some minor correlations, e.g. Factor 1 and Factor 3, as well as Factor 5 and Factor 6. We'll compare this later with the other algorithms.
## Principal Component Analysis
@@ -170,7 +204,11 @@ Principal component analysis can reliably be classfied as a clustering method (a
I can recommend this paper as a great primer to PCA methods. It goes over a few concepts very relevant for PCA methods as well as clustering methods in general.
-Now, let's run the PCA. In R there's two functions built-in to run a principal component analysis, here we'll use the `prcomp()`function (the other being `princomp()`, but `prcomp()` is preferred for reasons beyond the scope of this post). The `prcomp()` function contains a few options we can play with, but in my experience it works fine out of the box if you've normalized the data beforehand ([link to documentation](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp)). So we'll simply provide the data frame with the features. In addition, we'll also calculate the variance explained by each component. We do that by simply taking the standard deviation calculated by the PCA and squaring it, we'll save it in a new field called `variance`.
+{{{< sidenote br="1em" >}}}
+For some more info on the difference between `prcomp()` and `princomp()` see [this](https://stats.stackexchange.com/questions/20101/what-is-the-difference-between-r-functions-prcomp-and-princomp) thread on CrossValidated
+{{{< /sidenote >}}}
+
+Now, let's run the PCA. In R there's two functions built-in to run a principal component analysis, here we'll use the `prcomp()` function (the other being `princomp()`, but `prcomp()` is preferred for reasons beyond the scope of this post). The `prcomp()` function contains a few options we can play with, but in my experience it works fine out of the box if you've normalized the data beforehand ([link to documentation](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/prcomp)). So we'll simply provide the data frame with the features. In addition, we'll also calculate the variance explained by each component. We do that by simply taking the standard deviation calculated by the PCA and squaring it, we'll save it in a new field called `variance`.
```{r}
#| label: pca
@@ -185,15 +223,17 @@ Next we can make a simple scree plot using the variance we calculate above. We'l
```{r}
#| label: pca-scree
-pc_model$variance |>
+pc_model$variance |>
as_tibble() |>
- rename(eigenvalue = value) |>
- rownames_to_column("comp") |>
- mutate(comp = parse_number(comp),
- cum_variance = cumsum(eigenvalue)/sum(eigenvalue)) |>
- ggplot(aes(x = comp, y = eigenvalue)) +
+ rename(eigenvalue = value) |>
+ rownames_to_column("comp") |>
+ mutate(
+ comp = parse_number(comp),
+ cum_variance = cumsum(eigenvalue) / sum(eigenvalue)
+ ) |>
+ ggplot(aes(x = comp, y = eigenvalue)) +
geom_hline(yintercept = 1) +
- geom_line(size = 1) +
+ geom_line(linewidth = 1) +
geom_point(size = 3)
```
@@ -202,20 +242,27 @@ We'll also create a weight matrix again, based on the rotation from the PCA. We'
```{r}
#| label: pca-weight-matrix
-pc_weight_matrix <- pc_model$rotation |>
- data.frame() |>
- rownames_to_column("variable") |>
- pivot_longer(starts_with("PC"), names_to = "prin_comp", values_to = "loading")
-
-pca_loading_plot <- pc_weight_matrix |>
- filter(parse_number(prin_comp) <= n_comps) |>
- ggplot(aes(x = reorder(prin_comp, parse_number(prin_comp)), y = variable, fill = loading)) +
- geom_tile() +
- labs(title = "PCA loadings",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "cork", limits = c(-1,1)) +
- coord_fixed(ratio = 1/2)
+pc_weight_matrix <- pc_model$rotation |>
+ data.frame() |>
+ rownames_to_column("variable") |>
+ pivot_longer(starts_with("PC"),
+ names_to = "prin_comp", values_to = "loading"
+ )
+
+pca_loading_plot <- pc_weight_matrix |>
+ filter(parse_number(prin_comp) <= n_comps) |>
+ ggplot(aes(
+ x = reorder(prin_comp, parse_number(prin_comp)),
+ y = variable, fill = loading
+ )) +
+ geom_tile() +
+ labs(
+ title = "PCA loadings",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "cork", limits = c(-1, 1)) +
+ coord_fixed(ratio = 1 / 2)
print(pca_loading_plot)
```
@@ -227,20 +274,27 @@ We can also make a correlation matrix for the different principal components. We
```{r}
#| label: pca-corr-matrix
-pc_model$x |>
- cor() |>
- data.frame() |>
- rownames_to_column("comp_x") |>
- pivot_longer(cols = starts_with("PC"), names_to = "comp_y", values_to = "correlation") |>
- filter(parse_number(comp_x) <= n_comps,
- parse_number(comp_y) <= n_comps) |>
- ggplot(aes(x = comp_x, y = comp_y, fill = correlation)) +
+pc_model$x |>
+ cor() |>
+ data.frame() |>
+ rownames_to_column("comp_x") |>
+ pivot_longer(
+ cols = starts_with("PC"),
+ names_to = "comp_y", values_to = "correlation"
+ ) |>
+ filter(
+ parse_number(comp_x) <= n_comps,
+ parse_number(comp_y) <= n_comps
+ ) |>
+ ggplot(aes(x = comp_x, y = comp_y, fill = correlation)) +
geom_tile() +
- geom_text(aes(label = round(correlation,4)), color = "white") +
- labs(title = "Correlation between PCs",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "berlin", limits = c(-1,1)) +
+ geom_text(aes(label = round(correlation, 4)), color = "white") +
+ labs(
+ title = "Correlation between PCs",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "berlin", limits = c(-1, 1)) +
coord_equal()
```
@@ -275,10 +329,10 @@ Let's look at the scree plot:
```{r}
#| label: pca-manual-scree
-tibble(eigenvalues) |>
- ggplot(aes(x = seq_along(eigenvalues), y = eigenvalues)) +
+tibble(eigenvalues) |>
+ ggplot(aes(x = seq_along(eigenvalues), y = eigenvalues)) +
geom_hline(yintercept = 1) +
- geom_line(size = 1) +
+ geom_line(linewidth = 1) +
geom_point(size = 3)
```
@@ -290,33 +344,42 @@ Looks identical to the previous one. Let's also look at the correlation matrix b
#| eval: false
data.frame(pc_manual) |>
- rename_with(~ str_glue("PC{parse_number(.x)}")) |>
- rownames_to_column("participant") |>
- pivot_longer(starts_with("PC"), names_to = "prin_comp", values_to = "loading") |>
- filter(parse_number(prin_comp) <= n_comps) |>
- ggplot(aes(x = reorder(prin_comp, parse_number(prin_comp)),
- y = parse_number(participant), fill = loading)) +
- geom_tile() +
- labs(title = "Individual loadings",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "berlin") +
- coord_fixed(1/5)
+ rename_with(~ str_glue("PC{parse_number(.x)}")) |>
+ rownames_to_column("participant") |>
+ pivot_longer(starts_with("PC"),
+ names_to = "prin_comp", values_to = "loading"
+ ) |>
+ filter(parse_number(prin_comp) <= n_comps) |>
+ ggplot(aes(
+ x = reorder(prin_comp, parse_number(prin_comp)),
+ y = parse_number(participant), fill = loading
+ )) +
+ geom_tile() +
+ labs(
+ title = "Individual loadings",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "berlin") +
+ coord_fixed(1 / 5)
```
```{r}
#| label: pca-manual-corr-matrix
-data.frame(pc_manual) |>
- cor() %>%
- round(., 4)
+data.frame(pc_manual) |>
+ cor() |>
+ round(x = _, digits = 4)
```
Yup, also still 0 across the board. Calculating a PCA by just using matrix manipulations isn't too complicated, and it's a fun exercise, but I'd recommend just sticking to the `prcomp()` function, since it's a lot simpler and offer better functionality. Now, let's move on to the independent component analysis.
-
## Independent Component Analysis
+{{{< sidenote br="14em" >}}}
+If you're feeling particularly _mathematical_, you can read find a paper [here](https://www.iiis.org/CDs2017/CD2017Spring/papers/ZA832BA.pdf) that compares different algorithms
+{{{< /sidenote >}}}
+
While PCA attempts to find components explaining the maximum degree of covariance or correlation, an ICA attemps to find components with maximum statistical independence. There's very complicated nuance here where PCA components are orthogonal and uncorrelated to each other and ICA components are merely statistically independent, which is a very subtle nuance. In practice, it'll mean that ICA components also have a practically zero correlation. The main difference is in how the components are obtained. Like with factor analysis, most ICA algorithms require you to provide a number of components up front. The FastICA algorithm we'll use here is a version of an ICA implementation. There's also InfoMax and JADE to name two other implementations. I couldn't tell you the difference between these ICA algorithms. The FastICA is implemented in R through the `fastICA()` function and the eponymous `{fastICA}` package ([link to documentation](https://www.rdocumentation.org/packages/fastICA/versions/1.2-2/topics/fastICA)).
```{r}
@@ -330,18 +393,26 @@ Let's create a weight matrix again. The output from the `fastICA()` doesn't prov
```{r}
#| label: ica-weight-matrix
-ica_weight_matrix <- data.frame(t(ica_model$A)) |>
+ica_weight_matrix <- data.frame(t(ica_model$A)) |>
rename_with(~ str_glue("IC{seq(.)}")) |>
mutate(variable = names(data_features)) |>
- pivot_longer(cols = starts_with("IC"), names_to = "ic", values_to = "loading")
-
-ica_loading_plot <- ggplot(ica_weight_matrix, aes(x = ic, y = variable, fill = loading)) +
- geom_tile() +
- labs(title = "ICA loadings",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "cork", limits = c(-1,1)) +
- coord_fixed(ratio = 1/2)
+ pivot_longer(
+ cols = starts_with("IC"),
+ names_to = "ic", values_to = "loading"
+ )
+
+ica_loading_plot <- ggplot(
+ ica_weight_matrix,
+ aes(x = ic, y = variable, fill = loading)
+) +
+ geom_tile() +
+ labs(
+ title = "ICA loadings",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "cork", limits = c(-1, 1)) +
+ coord_fixed(ratio = 1 / 2)
print(ica_loading_plot)
```
@@ -353,18 +424,23 @@ Again, we can also visualize the correlation between the different components.
```{r}
#| label: ica-corr-matrix
-ica_model$S |>
- cor() |>
- data.frame() |>
- rownames_to_column("comp_x") |>
- pivot_longer(cols = starts_with("X"), names_to = "comp_y", values_to = "correlation") |>
- ggplot(aes(x = comp_x, y = comp_y, fill = correlation)) +
+ica_model$S |>
+ cor() |>
+ data.frame() |>
+ rownames_to_column("comp_x") |>
+ pivot_longer(
+ cols = starts_with("X"),
+ names_to = "comp_y", values_to = "correlation"
+ ) |>
+ ggplot(aes(x = comp_x, y = comp_y, fill = correlation)) +
geom_tile() +
- geom_text(aes(label = round(correlation,4)), color = "white") +
- labs(title = "Correlation between ICs",
- x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "berlin", limits = c(-1,1)) +
+ geom_text(aes(label = round(correlation, 4)), color = "white") +
+ labs(
+ title = "Correlation between ICs",
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "berlin", limits = c(-1, 1)) +
coord_equal()
```
@@ -377,32 +453,36 @@ Okay, let's now compare the three approaches and put the loading matrices side b
#| out.width: 100%
all_weight_matrices <- bind_rows(
- fa_weight_matrix |>
- rename(comp = factor) |>
+ fa_weight_matrix |>
+ rename(comp = factor) |>
mutate(alg = "FA"),
- pc_weight_matrix |>
- rename(comp = prin_comp) |>
- mutate(alg = "PCA"),
- ica_weight_matrix |>
- rename(comp = ic) |>
+ pc_weight_matrix |>
+ rename(comp = prin_comp) |>
+ mutate(alg = "PCA"),
+ ica_weight_matrix |>
+ rename(comp = ic) |>
mutate(alg = "ICA")
)
-all_weight_matrices |>
- filter(parse_number(comp) <= n_comps) |>
- mutate(alg = str_glue("{alg} loadings"),
- alg = as_factor(alg)) |>
+all_weight_matrices |>
+ filter(parse_number(comp) <= n_comps) |>
+ mutate(
+ alg = str_glue("{alg} loadings"),
+ alg = as_factor(alg)
+ ) |>
ggplot(aes(x = comp, y = variable, fill = loading)) +
- geom_tile() +
- labs(x = NULL,
- y = NULL) +
- scico::scale_fill_scico(palette = "cork", limits = c(-1,1)) +
- facet_wrap(~ alg, scales = "free_x")
+ geom_tile() +
+ labs(
+ x = NULL,
+ y = NULL
+ ) +
+ scico::scale_fill_scico(palette = "cork", limits = c(-1, 1)) +
+ facet_wrap(~alg, scales = "free_x")
```
Here we can most clearly see the overlap between the three methods, Factor 1, PC1, and IC 6 capture essentially the same information. The same goes for Factor 2, PC2, and IC 4. Other than that we can see that the other components vary quite markedly. I wouldn't be comfortable calling any of the other components "fairly similar" to another. You could see how some variables load together with multiple methods, but then the components those are captured in also have other information or miss information. We already discussed IC7 consisting mostly of a single variable, and a similar thing happens with Factor 4, but for a different variable.
-### BONUS: Hierarchical clustering
+## BONUS: Hierarchical clustering
I'll quickly go over hierarchical clustering too, it's simple and easy to interpret. Hierarchical clustering works by taking your variables and clustering them first into two groups, then three, then four, and so on. It looks at similarity and a concept called "Euclidian distance" (other methods are available) between the variables and determines how to separate. Essentially, hierarchical clustering works by iteratively separating variables into groups until every variable is on its own. It does so rather aggressively, with the previous methods it's possible for a variable to be part of two clusters, with hierarchical clustering it's part of a single cluster only. This approach makes it an easy way to see how variables cluster together at different thresholds.
@@ -437,7 +517,7 @@ The `cutree()` function assigns a cluster to each of the variables. It looks at
```{r}
#| label: dendrogram-w-line
-ggdendro::ggdendrogram(hclust_model) +
+ggdendro::ggdendrogram(hclust_model) +
geom_hline(yintercept = 10.5, color = "firebrick")
```
@@ -446,21 +526,23 @@ Let's look at how the clusters are made up according to the hierarchical cluster
```{r}
#| label: hclust-weight-matrix
-hclust_weight_matrix %>%
- data.frame() |>
- janitor::clean_names() |>
- rename(cluster = x) |>
- rownames_to_column("variable") |>
- ggplot(aes(x = as_factor(cluster), y = variable, fill = as_factor(cluster))) +
- geom_tile() +
- labs(x = NULL,
- y = NULL,
- fill = "cluster") +
+hclust_weight_matrix |>
+ data.frame() |>
+ rename(cluster = hclust_weight_matrix) |>
+ rownames_to_column("variable") |>
+ ggplot(aes(
+ x = as_factor(cluster), y = variable,
+ fill = as_factor(cluster)
+ )) +
+ geom_tile() +
+ labs(
+ x = NULL,
+ y = NULL,
+ fill = "cluster"
+ ) +
scico::scale_fill_scico_d(palette = "batlow")
```
We can again see that the same two clusters show up. Cluster 6 and 7 resemble those of Factor 2 and Factor 1, PC2 and PC1, and IC4 and IC6 respectively. If you want to learn more about what you can do with the `hlcust()` and associated functions, you can check out [this webpage](https://uc-r.github.io/hc_clustering).
Okay I'd like to leave it at that. This blogpost is long enough. Again, I had to simplify and take some shortcuts, if you think I made mistakes in that effort, please let me know and I'll fix it as well as I can!
-
-
diff --git a/content/blog/2021-easy-map-norway/areal.csv b/content/blog/2021-easy-map-norway/data/areal.csv
similarity index 100%
rename from content/blog/2021-easy-map-norway/areal.csv
rename to content/blog/2021-easy-map-norway/data/areal.csv
diff --git a/content/blog/2021-easy-map-norway/data/norway_hospitals.rds b/content/blog/2021-easy-map-norway/data/norway_hospitals.rds
new file mode 100644
index 0000000..d13a32c
--- /dev/null
+++ b/content/blog/2021-easy-map-norway/data/norway_hospitals.rds
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:10db380d65beafdbc632a6a0ac27d7c714e2a426f5693635726a4f1a929e50ef
+size 601
diff --git a/content/blog/2021-easy-map-norway/splmaps_data.RData b/content/blog/2021-easy-map-norway/data/splmaps_data.RData
similarity index 100%
rename from content/blog/2021-easy-map-norway/splmaps_data.RData
rename to content/blog/2021-easy-map-norway/data/splmaps_data.RData
diff --git a/content/blog/2021-easy-map-norway/index.markdown_strict_files/figure-markdown_strict/age-plot-1.png b/content/blog/2021-easy-map-norway/index.markdown_strict_files/figure-markdown_strict/age-plot-1.png
index 03b39ea..6514024 100644
Binary files a/content/blog/2021-easy-map-norway/index.markdown_strict_files/figure-markdown_strict/age-plot-1.png and b/content/blog/2021-easy-map-norway/index.markdown_strict_files/figure-markdown_strict/age-plot-1.png differ
diff --git a/content/blog/2021-easy-map-norway/index.markdown_strict_files/figure-markdown_strict/plot-kommune-faceted-1.png b/content/blog/2021-easy-map-norway/index.markdown_strict_files/figure-markdown_strict/plot-kommune-faceted-1.png
index f8ce1ce..3e39cf3 100644
Binary files a/content/blog/2021-easy-map-norway/index.markdown_strict_files/figure-markdown_strict/plot-kommune-faceted-1.png and b/content/blog/2021-easy-map-norway/index.markdown_strict_files/figure-markdown_strict/plot-kommune-faceted-1.png differ
diff --git a/content/blog/2021-easy-map-norway/index.md b/content/blog/2021-easy-map-norway/index.md
index 0249baf..16b83d6 100644
--- a/content/blog/2021-easy-map-norway/index.md
+++ b/content/blog/2021-easy-map-norway/index.md
@@ -9,40 +9,35 @@ tags:
- ggplot
- map
- norway
-editor_options:
- chunk_output_type: console
execute:
fig.retina: 2
fig.align: center
fig.show: hold
results: hold
out.width: 80%
+editor_options:
+ chunk_output_type: console
---
-
-As of March 2023 {fhimaps} and {splmaps} are no longer supported due to budget cuts at the institution supporting them. The amazing developers (led by Richard Aubrey White and Chi Zhang) have moved the data and functionality to a new package called {csmaps}. The code below is updated to reflect the changes.
-
-
-
-
+{{< standout bg="#acc8d4" >}}
+As of March 2023 {fhimaps} and {splmaps} are no longer supported due to budget cuts at the institution supporting them. The amazing developers (led by [Richard Aubrey White](https://www.rwhite.no) and [Chi Zhang](https://andreaczhang.github.io)) have moved the data and functionality to a new package called `{csmaps}`. The code below is updated to reflect the changes.
+{{< /standout >}}
Every now and then you discover a discover a much simpler solution to a problem you spent a lot of time solving. This recently happened to me on the topic of creating a map of Norway in R. In this post, I want to go through the process of what I learned.
Previously, I used a JSON file and the `{geojsonio}` package to create a map of Norway and its fylker (counties) in particular. This was a very flexible and robust way of going about this, but also quite cumbersome. This method relies on a high-quality JSON file, meaning, a file that is detailed enough to display everything nicely, but not too detailed that it takes a ton of time and computing power to create a single plot. While I'll still use this method if I need to create a map for places other than Norway, I think I've found a better and easier solution for plotting Norway and the fylker and kommuner in the [`{csmaps}`](https://www.csids.no/csmaps/) package.
-The `{csmaps}` package is created by the *Consortium for Statistics in Disease Surveillance* team. It's part of a series of packages (which they refer to as the "csverse"), which includes a package containing basic health surveillance data ([`{csdata}`](https://www.csids.no/csdata/)), one for real-time analysis in disease surveillance ([`{sc9}`](https://www.csids.no/sc9/)) and a few more. Here I'll dive into the `{csmaps}` package with some help from the `{csdata}` package. I'll also use the `{ggmap}` package to help with some other data and plotting. It's perhaps important to note that `{ggmap}` does contain a map of Norway as a whole, but not of the fylker and kommuner (municipalities), hence the usefulness of the `{csmaps}` package, which contains both. I'll also use `{tidyverse}` and `{ggtext}` as I always do. I won't load `{csmaps}` with the `library()` function, but will use the `::` operator instead since it'll make it easier to navigate the different datasets included.
+{{< sidenote >}}
+The group has [a bunch of packages](https://www.csids.no) for data science within public health
+{{< /sidenote >}}
+
+The `{csmaps}` package is created by the [*Consortium for Statistics in Disease Surveillance*](https://www.csids.no) team. It's part of a series of packages (which they refer to as the "csverse"), which includes a package containing basic health surveillance data ([`{csdata}`](https://www.csids.no/csdata/)), one for real-time analysis in disease surveillance ([`{sc9}`](https://www.csids.no/sc9/)) and a few more. Here I'll dive into the `{csmaps}` package with some help from the `{csdata}` package. I'll also use the `{ggmap}` package to help with some other data and plotting. It's perhaps important to note that `{ggmap}` does contain a map of Norway as a whole, but not of the fylker and kommuner (municipalities), hence the usefulness of the `{csmaps}` package, which contains both. I'll also use `{tidyverse}` and `{ggtext}` as I always do. I won't load `{csmaps}` with the `library()` function, but will use the `::` operator instead since it'll make it easier to navigate the different datasets included.
``` r
library(tidyverse)
@@ -54,10 +49,10 @@ library(ggmap)
So let's have a look at what's included. You'll see that nearly all maps come in either a `data.table` format or an `sf` format. Here I'll use only the data frames, since they're a lot easier to work with. The maps in `sf` format can be useful elsewhere, but I think for most purposes it's easier and more intuitive to work with data frames.
``` r
-data(package = "csmaps") |>
- pluck("results") |>
- as_tibble() |>
- select(Item, Title) |>
+data(package = "csmaps") |>
+ pluck("results") |>
+ as_tibble() |>
+ select(Item, Title) |>
print(n = 18)
```
@@ -66,22 +61,25 @@ A comprehensive version of this list is also included in the [reference](https:/
So let's have a look at one of those maps. For instance the one with the new fylker from 2020 with an inset of Oslo.
``` r
-map_df <- nor_county_map_b2020_insert_oslo_dt |>
+map_df <- nor_county_map_b2020_insert_oslo_dt |>
glimpse()
```
Rows: 4,493
Columns: 5
- $ long 5.823860, 5.969413, 6.183042, 6.302433, 6.538059, 6.693511, 6.657289, 6.767480, 7.096287, 7.110080, 7.214667, 7.098724, 7.048585, 6.991842, 6.962960, 7.022880, 6.882833, 6.932169, 6.865142, 6.824533, 6.847579, 6.939172, 6.881039, 6.903689, 6.560838, 6.608553, 6.518096, 6.572750, 6.622347, 6.436069, 6.497499, 6.614721, 6.587698, 6.631556, 6.560330, 6.382453, 6.324024, 6.350390, 6.239081, 6.039468, 5.996479, 5.875390, 5.666733, 5.489157, 5.601538, 5.576495, 5.633912, 5.686111, 5.565115, 5.581353, 5.769811, 5.741457, 5.786016, 5.851127, 5.893869, 5.979000, 5.965561, 6.175315, 6.281425, 6.116682, 6.086432, 6.108257, 6.072709, 5.995039, 6.036212, 6.090791, 5.949121, 5.889756, 5.977865, 6.174015, 6.022552, 6.192079, 6.562320, 6.360672, 6.157856, 6.113738, 6.273343, 6.289766, 6.217630, 6.226502, 6.186815, 6.201218, 6.176712, 6.102012, 6.118100, 6.085544, 6.020944, 6.212991, 6.210265, 6.270777, 6.598666, 6.270651, 6.368525, 6.312591, 6.218008, 6.119585, 6.014455, 5.977…
- $ lat 59.64576, 59.57897, 59.58607, 59.76743, 59.84285, 59.82295, 59.70174, 59.70049, 59.78278, 59.70060, 59.67269, 59.65129, 59.58683, 59.59703, 59.51645, 59.48120, 59.42930, 59.40027, 59.36840, 59.26988, 59.18267, 59.18192, 59.14459, 59.11340, 58.90995, 58.88007, 58.76741, 58.76528, 58.69668, 58.64933, 58.60992, 58.62509, 58.49277, 58.44801, 58.32740, 58.27794, 58.29827, 58.32613, 58.32347, 58.37897, 58.47691, 58.46442, 58.55017, 58.75369, 58.87953, 58.95905, 58.89528, 58.92810, 58.98060, 59.03644, 58.96494, 58.85185, 58.93869, 58.92934, 58.96797, 58.93705, 58.90389, 58.83516, 58.84096, 58.87579, 58.90089, 58.92352, 58.90352, 58.97113, 58.99728, 58.98579, 59.02345, 59.07371, 59.12151, 59.13993, 59.14356, 59.25835, 59.32107, 59.31982, 59.26062, 59.30281, 59.29761, 59.31837, 59.39100, 59.32360, 59.31884, 59.34543, 59.31699, 59.32179, 59.37375, 59.32646, 59.34218, 59.44488, 59.40925, 59.50584, 59.55886, 59.51752, 59.64394, 59.64117, 59.47384, 59.40548, 59.41009, 59.35…
- $ order 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217…
- $ group 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.2, 11.2, 11.2, 11.2, 11.2, 11.2, 11.2, 11.2, 11.2, 11.2, 11.2, 11.2, 11.2, 11.3, 11.3, 11.3, 11.3, 11.3, 11.3, 11.4, 11.4, 11.…
- $ location_code "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11", "county_nor11",…
+ $ long 5.823860, 5.969413, 6.183042, 6.302433, 6.538059, 6.6935…
+ $ lat 59.64576, 59.57897, 59.58607, 59.76743, 59.84285, 59.822…
+ $ order 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
+ $ group 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11.1, 11…
+ $ location_code "county_nor11", "county_nor11", "county_nor11", "county_…
Immediately you can see that there's a lot of rows, each representing a point on the map. A data frame with a larger number of rows would be more detailed (i.e. less straight lines, more detail in curvatures of borders etc.). Let's create a very simple map. We'll use the `geom_polygon` to turn our data frame into a map. The location of the points are given in longitudes and latitudes (like x- and y-coordinates), the group serves to make sure lines are drawn correctly (try running the code below without `group = group` and see what happens). The `location_code` denotes the county number (which isn't from 1 to 11, but instead uses some other standard format matching numbers in other government datasets). Let's see the simplest map:
``` r
-ggplot(map_df, aes(x = long, y = lat, group = group, fill = location_code)) +
+ggplot(map_df, aes(
+ x = long, y = lat,
+ group = group, fill = location_code
+)) +
geom_polygon()
```
@@ -90,7 +88,7 @@ ggplot(map_df, aes(x = long, y = lat, group = group, fill = location_code)) +
Now let's convert the awkward county numbers to the actual names of the fylker. The `{csdata}` package has a data frame with codes and names for all the kommuner, fylker, and even regions (Øst, Vest, Nord-Norge etc.). We're only interested in the fylker here, so we'll select the unique location codes and the corresponding location names.
``` r
-county_names <- csdata::nor_locations_names(border = 2020) |>
+county_names <- csdata::nor_locations_names(border = 2020) |>
filter(str_detect(location_code, "^county")) |>
distinct(location_code, location_name)
@@ -110,33 +108,58 @@ print(county_names)
10: county_nor46 Vestland
11: county_nor30 Viken
-Now let's also create a nice color palette to give each fylke a nicer color than the default ggplot colors. We'll create a named vector to match each fylke with a color from the *batlow* palette by [Fabio Crameri](https://www.fabiocrameri.ch/colourmaps/).
+{{< sidenote >}}
+The [`{scico}`](https://github.com/thomasp85/scico) package contains a collection of color palettes that are great for accessibility
+{{< /sidenote >}}
+
+Now let's also create a nice color palette to give each fylke a nicer color than the default ggplot colors. We'll create a named vector to match each fylke with a color from the *batlow* palette by [Fabio Crameri](https://www.fabiocrameri.ch/colourmaps/) implemented in the `{scico}` package.
``` r
-county_colors <- setNames(scico::scico(n = nrow(county_names), palette = "batlow"),
- nm = county_names$location_name)
+county_colors <- setNames(
+ scico::scico(
+ n = nrow(county_names),
+ palette = "batlow"
+ ),
+ nm = county_names$location_name
+)
```
Let's see what we can make now. We'll add the county names to the large data frame containing the longitudes and latitudes and then create a plot again. I'll also add some other style elements, such as a labels to the x- and y-axes, circles instead of squares for the legend and a map projection. For Norway especially I think a conic map projection works well since the northern fylker are so massive and the southern fylker are more dense, so adding a conic projection with a cone tangent of 40 degrees makes it a bit more perceptionally balanced (`lat0` refers to the cone tangent, the details are complicated but a higher cone tangent results a greater distortion in favor of southern points).
``` r
-map_df |>
- left_join(county_names, by = "location_code") |>
- ggplot(aes(x = long, y = lat, fill = location_name, group = group)) +
- geom_polygon(key_glyph = "point") +
- labs(x = NULL,
- y = NULL,
- fill = NULL) +
- scale_x_continuous(labels = scales::label_number(suffix = "\u00b0W")) +
- scale_y_continuous(labels = scales::label_number(suffix = "\u00b0N")) +
- scale_fill_manual(values = county_colors,
- guide = guide_legend(override.aes = list(shape = 21, size = 4))) +
- coord_map(projection = "conic", lat0 = 40) +
+map_df |>
+ left_join(county_names, by = "location_code") |>
+ ggplot(aes(
+ x = long, y = lat,
+ fill = location_name, group = group
+ )) +
+ geom_polygon(key_glyph = "point") +
+ labs(
+ x = NULL,
+ y = NULL,
+ fill = NULL
+ ) +
+ scale_x_continuous(
+ labels = scales::label_number(suffix = "\u00b0W")
+ ) +
+ scale_y_continuous(
+ labels = scales::label_number(suffix = "\u00b0N")
+ ) +
+ scale_fill_manual(
+ values = county_colors,
+ guide = guide_legend(override.aes = list(shape = 21, size = 4))
+ ) +
+ coord_map(projection = "conic", lat0 = 40) +
theme_minimal() +
- theme(legend.position = c(0.9,0.2),
- legend.text = element_text(size = 5),
- legend.key.height = unit(10,"pt"),
- legend.background = element_rect(fill = "white", color = "transparent"))
+ theme(
+ legend.position = c(0.9, 0.2),
+ legend.text = element_text(size = 5),
+ legend.key.height = unit(10, "pt"),
+ legend.background = element_rect(
+ fill = "white",
+ color = "transparent"
+ )
+ )
```
@@ -156,12 +179,20 @@ str_glue("Range across latitude: {str_c(range(map_df$lat), collapse = ', ')}")
Let's also combine the map with some actual data. The `{csdata}` package contains some simple data on age distribution in kommuner and fylker. Let's take the data from the different fylker in the last available year and see what proportion of the total population is younger than 18.
``` r
-age_data <- csdata::nor_population_by_age_cats(border = 2020, cats = list(c(0:18))) |>
- filter(str_detect(location_code, "^county"),
- calyear == max(calyear)) |>
- pivot_wider(id_cols = location_code, names_from = age, values_from = pop_jan1_n) |>
+age_data <- csdata::nor_population_by_age_cats(
+ border = 2020,
+ cats = list(seq(18))
+) |>
+ filter(
+ str_detect(location_code, "^county"),
+ calyear == max(calyear)
+ ) |>
+ pivot_wider(
+ id_cols = location_code,
+ names_from = age, values_from = pop_jan1_n
+ ) |>
janitor::clean_names() |>
- rename(age_0_18 = x000_018) |>
+ rename(age_0_18 = x001_018) |>
mutate(proportion = age_0_18 / total)
```
@@ -169,36 +200,60 @@ Let's create a map without the Oslo inset, combine it with the age distribution
``` r
nor_county_map_b2020_default_dt |>
- left_join(age_data, by = "location_code") |>
+ left_join(age_data, by = "location_code") |>
ggplot(aes(x = long, y = lat, group = group)) +
- geom_polygon(data = map_data("world") |> filter(region != "Norway"),
- fill = "grey80") +
- geom_polygon(aes(fill = proportion), key_glyph = "point") +
+ geom_polygon(
+ data = map_data("world") |> filter(region != "Norway"),
+ fill = "grey80"
+ ) +
+ geom_polygon(aes(fill = proportion), key_glyph = "point") +
labs(fill = "Proportion of the population younger than 18") +
- scico::scale_fill_scico(palette = "devon", limits = c(0.15, 0.31),
- labels = scales::percent_format(accuracy = 1),
- guide = guide_colorbar(title.position = "top", title.hjust = 0.5,
- barwidth = 10, barheight = 0.5, ticks = FALSE)) +
- coord_map(projection = "conic", lat0 = 60,
- xlim = c(-8,40), ylim = c(57, 70)) +
+ scico::scale_fill_scico(
+ palette = "devon", limits = c(0.15, 0.31),
+ labels = scales::percent_format(accuracy = 1),
+ guide = guide_colorbar(
+ title.position = "top", title.hjust = 0.5,
+ barwidth = 10, barheight = 0.5, ticks = FALSE
+ )
+ ) +
+ coord_map(
+ projection = "conic", lat0 = 60,
+ xlim = c(-8, 40), ylim = c(57, 70)
+ ) +
theme_void() +
- theme(plot.background = element_rect(fill = "#A2C0F4", color = "transparent"),
- legend.direction = "horizontal",
- legend.position = c(0.8, 0.1),
- legend.title = element_text(size = 8),
- legend.text = element_text(size = 6))
+ theme(
+ plot.background = element_rect(
+ fill = "#A2C0F4",
+ color = "transparent"
+ ),
+ legend.direction = "horizontal",
+ legend.position = c(0.8, 0.1),
+ legend.title = element_text(size = 8),
+ legend.text = element_text(size = 6)
+ )
```
## Geocoding
+{{< sidenote br="3em" >}}
+The Google Maps API requires a personal API key, you can read how to obtain and register an API key [here](https://github.com/dkahle/ggmap)
+{{< /sidenote >}}
+
The `{ggmap}` package also has an incredibly useful function called `mutate_geocode()` which transforms a string with an address or description in character format to longitude and latitude. Since `{ggmap}` uses the Google Maps API, it works similarly to typing in a description in Google Maps. So an approximation of the location will (most likely) get you the right result (e.g. with "Hospital Lillehammer"). Note that `mutate_geocode` uses `lon` instead of `long` as column name for longitude. Just to avoid confusion, I'll rename the column to `long`.
``` r
-hospitals_df <- tibble(location = c("Ullevål Sykehus, Oslo","Haukeland universitetssjukehus, Bergen","St. Olav, Trondheim",
- "Universitetssykehuset Nord-Norge, Tromsø","Stavanger Universitetssjukehus","Sørlandet Hospital Kristiansand", "Hospital Lillehammer")) |>
- mutate_geocode(location) |>
+hospitals_df <- tibble(location = c(
+ "Ullevål Sykehus, Oslo",
+ "Haukeland universitetssjukehus, Bergen",
+ "St. Olav, Trondheim",
+ "Universitetssykehuset Nord-Norge, Tromsø",
+ "Stavanger Universitetssjukehus",
+ "Sørlandet Hospital Kristiansand",
+ "Hospital Lillehammer"
+)) |>
+ mutate_geocode(location) |>
rename(long = lon)
```
@@ -224,27 +279,47 @@ Now let's put these on top of the map. We'll use the same map we used earlier. W
``` r
set.seed(21)
-map_df |>
- left_join(county_names, by = "location_code") |>
- ggplot(aes(x = long, y = lat, fill = location_name, group = group)) +
- geom_polygon(key_glyph = "point") +
- geom_segment(data = hospitals_df |> filter(str_detect(location, "Oslo")),
- aes(x = long, y = lat, xend = 19.5, yend = 62), inherit.aes = FALSE) +
- geom_point(data = hospitals_df, aes(x = long, y = lat), inherit.aes = FALSE,
- shape = 18, color = "firebrick", size = 4, show.legend = FALSE) +
- ggrepel::geom_label_repel(data = hospitals_df, aes(x = long, y = lat, label = location),
- size = 2, alpha = 0.75, label.size = 0, inherit.aes = FALSE) +
- labs(x = NULL,
- y = NULL,
- fill = NULL) +
- scale_fill_manual(values = county_colors,
- guide = guide_legend(override.aes = list(size = 4, shape = 21,
- color = "transparent"))) +
- coord_map(projection = "conic", lat0 = 60) +
+map_df |>
+ left_join(county_names, by = "location_code") |>
+ ggplot(aes(
+ x = long, y = lat,
+ fill = location_name, group = group
+ )) +
+ geom_polygon(key_glyph = "point") +
+ geom_segment(
+ data = hospitals_df |> filter(str_detect(location, "Oslo")),
+ aes(x = long, y = lat, xend = 19.5, yend = 62),
+ inherit.aes = FALSE
+ ) +
+ geom_point(
+ data = hospitals_df, aes(x = long, y = lat),
+ inherit.aes = FALSE,
+ shape = 18, color = "firebrick",
+ size = 4, show.legend = FALSE
+ ) +
+ ggrepel::geom_label_repel(
+ data = hospitals_df, aes(x = long, y = lat, label = location),
+ size = 2, alpha = 0.75, label.size = 0, inherit.aes = FALSE
+ ) +
+ labs(
+ x = NULL,
+ y = NULL,
+ fill = NULL
+ ) +
+ scale_fill_manual(
+ values = county_colors,
+ guide = guide_legend(override.aes = list(
+ size = 4, shape = 21,
+ color = "transparent"
+ ))
+ ) +
+ coord_map(projection = "conic", lat0 = 60) +
theme_void() +
- theme(legend.position = c(0.2,0.7),
- legend.text = element_text(size = 5),
- legend.key.height = unit(10,"pt"))
+ theme(
+ legend.position = c(0.2, 0.7),
+ legend.text = element_text(size = 5),
+ legend.key.height = unit(10, "pt")
+ )
```
@@ -254,97 +329,125 @@ map_df |>
Let's take it a step further and now look at how we can combine our map with data that didn't come directly from the FHI. Instead I downloaded some data from the Norwegian Statistics Bureau ([Statistisk sentralbyrå, SSB](https://www.ssb.no)) on land use in the kommuner ([link](https://www.ssb.no/en/natur-og-miljo/areal/statistikk/arealbruk-og-arealressurser)). This came in the form of a semi-colon separated .csv file.
``` r
-area_use <- read_delim("areal.csv", delim = ";", skip = 2) |>
+area_use <- read_delim("./data/areal.csv",
+ delim = ";", skip = 2
+) |>
janitor::clean_names()
print(area_use)
```
# A tibble: 356 × 20
- region area_2021_residential_areas area_2021_recreational_facilities area_2021_built_up_areas_for_agriculture_and_fishing area_2021_industrial_commercial_and_service_areas area_2021_education_and_day_care_facilities area_2021_health_and_social_welfare_institutions area_2021_cultural_and_religious_activities area_2021_transport_telecommunications_and_technical_infrastructure area_2021_emergency_and_defence_services area_2021_green_areas_and_sports_facilities area_2021_unclassified_built_up_areas_and_related_land area_2021_agricultural_land area_2021_forest area_2021_open_firm_ground area_2021_wetland area_2021_bare_rock_gravel_and_blockfields area_2021_permanent_snow_and_glaciers area_2021_inland_waters area_2021_unclassified_undeveloped_areas
-
- 1 3001 Halden 9.61 1.85 2.51 2.89 0.29 0.21 0.22 11.4 0.03 0.84 1.51 61.6 480. 2.88 16.2 1.3 0 48.7 0
- 2 3002 Moss 10.4 1.86 1.59 3.92 0.32 0.21 0.14 7.7 0.26 1.57 1.42 33.6 60.0 4.09 0.21 0.48 0 10.1 0
- 3 3003 Sarpsborg 14.0 2.95 3.49 4.39 0.48 0.28 0.19 12.2 0.01 1.87 1.52 78.1 241. 8.22 1.7 1.55 0 33.5 0
- 4 3004 Fredrikstad 19.9 5.14 3.05 6.24 0.62 0.36 0.26 11.8 0.27 2.84 2.49 67.4 139. 8.91 0.23 15.4 0 8.59 0
- 5 3005 Drammen 18.4 1.17 1.26 4.59 0.63 0.35 0.23 11.6 0.01 3.41 2.54 26.0 226. 2.59 5 1.17 0 13.2 0
- 6 3006 Kongsberg 7.68 2.23 2.38 2.23 0.36 0.11 0.15 11.8 0.35 1.86 1.4 40.2 616. 15.4 32.1 18.8 0 39.6 0
- 7 3007 Ringerike 11 3.99 4.18 5.19 0.37 0.29 0.11 21.4 0.36 1.27 2.12 77.7 1191. 47.9 50.8 3.21 0 134. 0
- 8 3011 Hvaler 2.31 5.79 0.65 0.34 0.04 0.02 0.03 1.96 0 0.29 0.19 4.76 34.0 5.65 0.16 33.5 0 0.21 0
- 9 3012 Aremark 0.64 0.73 0.8 0.31 0.02 0.01 0.03 3.35 0 0.16 0.14 20.9 245. 0.94 7.24 0.02 0 38.6 0
- 10 3013 Marker 1.38 0.58 1.68 0.52 0.03 0.02 0.05 5.38 0.01 0.14 0.34 40.0 305. 1.71 10.5 0.04 0 46.0 0
+ region area_2021_residentia…¹ area_2021_recreation…² area_2021_built_up_a…³
+
+ 1 3001 Ha… 9.61 1.85 2.51
+ 2 3002 Mo… 10.4 1.86 1.59
+ 3 3003 Sa… 14.0 2.95 3.49
+ 4 3004 Fr… 19.9 5.14 3.05
+ 5 3005 Dr… 18.4 1.17 1.26
+ 6 3006 Ko… 7.68 2.23 2.38
+ 7 3007 Ri… 11 3.99 4.18
+ 8 3011 Hv… 2.31 5.79 0.65
+ 9 3012 Ar… 0.64 0.73 0.8
+ 10 3013 Ma… 1.38 0.58 1.68
# ℹ 346 more rows
+ # ℹ abbreviated names: ¹area_2021_residential_areas,
+ # ²area_2021_recreational_facilities,
+ # ³area_2021_built_up_areas_for_agriculture_and_fishing
+ # ℹ 16 more variables: area_2021_industrial_commercial_and_service_areas ,
+ # area_2021_education_and_day_care_facilities ,
+ # area_2021_health_and_social_welfare_institutions , …
You can see there's 356 rows, each representing a different kommune in Norway. The columns here represent the surface area (in km2) with different designations (e.g. forest, health services, agriculture etc.). All data here is from 2021. Now, kommunes have different sizes, so I want to get the designations of interest as percentages of total area in the kommune. Here I assumed that the sum of all designations is equal to the total size of each kommune. I also want to extract the kommune number, since we'll use that to merge this data frame with the map later. The kommune number needs to be 4 digits, so we need to add a leading 0 in some instances. Then we'll create the `location_code` column which will match the `location_code` column in the data frame from `{csmaps}`. Then we'll calculate the percentage land use for different designations. Here I'm just interested in *"bare rock, gravel, and blockfields"*, *"wetland*, *"forest"*, and *"Open firm ground"*.
``` r
-area_use <- area_use |>
- mutate(total_area = rowSums(across(where(is.numeric))),
- kommune_code = parse_number(region),
- kommune_code = format(kommune_code, digits = 4),
- kommune_code = str_replace_all(kommune_code, " ", "0"),
- location_code = str_glue("municip_nor{kommune_code}"),
- perc_rocks = area_2021_bare_rock_gravel_and_blockfields / total_area,
- perc_wetland = area_2021_wetland / total_area,
- perc_forest = area_2021_forest / total_area,
- perc_open_ground = area_2021_open_firm_ground / total_area) |>
- arrange(kommune_code) |>
+area_use <- area_use |>
+ mutate(
+ total_area = rowSums(across(where(is.numeric))),
+ kommune_code = parse_number(region),
+ kommune_code = format(kommune_code, digits = 4),
+ kommune_code = str_replace_all(kommune_code, " ", "0"),
+ location_code = str_glue("municip_nor{kommune_code}"),
+ perc_rocks = area_2021_bare_rock_gravel_and_blockfields / total_area,
+ perc_wetland = area_2021_wetland / total_area,
+ perc_forest = area_2021_forest / total_area,
+ perc_open_ground = area_2021_open_firm_ground / total_area
+ ) |>
+ arrange(kommune_code) |>
glimpse()
```
Rows: 356
Columns: 27
- $ region "0301 Oslo municipality", "1101 Eigersund", "1103 Stavanger", "1106 Haugesund", "1108 Sandnes", "1111 Sokndal", "1112 Lund", "1114 Bjerkreim", "1119 H\xe5", "1120 Klepp", "1121 Time", "1122 Gjesdal", "1124 Sola", "1127 Randaberg", "1130 Strand", "1133 Hjelmeland", "1134 Suldal", "1135 Sauda", "1144 Kvits\xf8y", "1145 Bokn", "1146 Tysv\xe6r", "1149 Karm\xf8y", "1151 Utsira", "1160 Vindafjord", "1505 Kristiansund", "1506 Molde", "1507 \xc5lesund", "1511 Vanylven", "1514 Sande (M\xf8re og Romsdal)", "1515 Her\xf8y (M\xf8re og Romsdal)", "1516 Ulstein", "1517 Hareid", "1520 \xd8rsta", "1525 Stranda", "1528 Sykkylven", "1531 Sula", "1532 Giske", "1535 Vestnes", "1539 Rauma", "1547 Aukra", "1554 Aver\xf8y", "1557 Gjemnes", "1560 Tingvoll", "1563 Sunndal", "1566 Surnadal", "1573 Sm\xf8la", "1576 Aure", "1577 Volda", "1578 Fjord", "1579 Hustadvika", "1804 Bod\xf8", "1806 Narvik", "1811 Bindal", "1812 S\xf8mna", "181…
- $ area_2021_residential_areas 51.53, 3.58, 19.28, 7.30, 11.90, 1.15, 1.03, 0.70, 3.66, 3.45, 3.49, 2.23, 4.70, 1.71, 3.75, 0.81, 1.25, 1.54, 0.12, 0.34, 3.48, 12.53, 0.07, 2.98, 4.78, 7.88, 15.66, 1.22, 0.82, 3.00, 2.17, 1.50, 2.92, 1.30, 2.15, 2.69, 2.29, 2.86, 1.96, 1.86, 2.26, 0.83, 1.37, 2.07, 2.11, 0.88, 1.28, 2.53, 0.75, 4.79, 8.31, 5.79, 0.56, 0.65, 2.56, 0.63, 0.20, 0.86, 2.10, 0.85, 3.50, 0.56, 0.57, 0.71, 0.76, 1.69, 7.11, 0.88, 0.19, 0.63, 2.97, 1.04, 0.47, 1.82, 3.82, 0.75, 1.02, 0.77, 0.58, 0.25, 0.29, 0.64, 5.09, 2.41, 2.97, 1.29, 1.85, 3.32, 2.06, 0.45, 1.38, 9.61, 10.41, 13.98, 19.91, 18.35, 7.68, 11.00, 2.31, 0.64, 1.38, 13.26, 1.31, 2.95, 2.67, 1.93, 4.30, 11.79, 4.68, 3.60, 5.70, 22.83, 25.10, 7.36, 3.31, 3.23, 5.76, 15.38, 5.24, 1.92, 7.53, 7.49, 8.37, 4.31, 1.38, 2.35, 0.56, 1.50, 1.81, 0.83, 1.66, 1.98, 1.42, 1.09, 5.35, 6.29, 6.06, 1.11, 0.50, 0.95, 2.39, 3.30, 6.35, 7.93, 7.19, 10.47, 13.54, 3.07, 8.24, 2.52, 3.89…
- $ area_2021_recreational_facilities 1.12, 1.20, 1.87, 0.21, 2.85, 0.26, 0.50, 0.67, 0.43, 0.13, 0.04, 0.66, 0.30, 0.14, 1.61, 1.02, 1.64, 0.92, 0.07, 0.18, 1.74, 0.80, 0.05, 1.31, 0.47, 1.62, 1.83, 0.31, 0.17, 0.35, 0.22, 0.13, 0.64, 0.72, 0.68, 0.15, 0.10, 0.63, 1.09, 0.11, 0.52, 0.67, 1.16, 0.67, 1.43, 0.50, 1.39, 0.63, 0.76, 0.80, 2.53, 3.45, 0.42, 0.28, 0.79, 0.44, 0.18, 0.58, 0.50, 0.50, 0.81, 0.71, 0.64, 0.38, 0.39, 0.93, 1.60, 0.85, 0.05, 0.55, 0.83, 1.76, 0.59, 1.06, 1.03, 0.61, 0.90, 0.59, 0.61, 0.05, 0.06, 0.15, 0.81, 0.67, 0.68, 0.75, 0.42, 0.66, 0.50, 0.10, 1.37, 1.85, 1.86, 2.95, 5.14, 1.17, 2.23, 3.99, 5.79, 0.73, 0.58, 1.42, 0.41, 0.29, 1.19, 0.42, 2.02, 0.52, 0.57, 4.12, 1.68, 0.83, 4.51, 1.47, 0.13, 1.02, 0.02, 0.69, 0.53, 0.06, 0.10, 0.66, 0.67, 0.30, 0.69, 1.00, 2.64, 3.65, 3.06, 2.30, 3.08, 7.11, 4.51, 1.56, 0.57, 0.53, 0.32, 3.26, 2.13, 3.73, 0.31, 1.26, 1.46, 0.74, 1.62, 0.86, 7.62, 1.36, 0.90, 0.86, 0.65, 0.82, 0.84, 0.7…
- $ area_2021_built_up_areas_for_agriculture_and_fishing 0.64, 1.26, 3.51, 0.33, 3.58, 0.54, 0.89, 1.22, 4.21, 4.36, 2.28, 1.03, 1.91, 1.07, 1.29, 1.41, 1.48, 0.39, 0.09, 0.27, 2.25, 2.12, 0.03, 3.02, 0.30, 3.22, 3.33, 1.26, 0.57, 0.78, 0.39, 0.60, 2.15, 0.93, 1.04, 0.29, 0.82, 1.51, 1.78, 0.33, 1.32, 1.27, 1.14, 1.28, 1.96, 0.81, 1.37, 2.14, 1.02, 3.73, 1.65, 1.21, 0.71, 0.89, 1.13, 0.42, 0.26, 0.18, 0.70, 1.00, 1.68, 0.49, 0.74, 0.62, 0.37, 1.27, 1.53, 0.26, 0.03, 0.25, 0.76, 0.33, 0.55, 1.26, 0.75, 0.62, 1.04, 0.29, 0.49, 0.07, 0.15, 0.24, 0.78, 0.67, 1.14, 0.32, 0.36, 1.35, 1.03, 0.12, 0.75, 2.51, 1.59, 3.49, 3.05, 1.26, 2.38, 4.18, 0.65, 0.80, 1.68, 8.58, 1.33, 3.52, 1.62, 1.41, 1.35, 1.49, 1.38, 0.73, 0.49, 1.09, 2.42, 3.81, 0.16, 1.08, 0.30, 4.10, 0.81, 0.60, 2.66, 4.79, 2.72, 1.97, 0.63, 1.01, 0.63, 1.34, 1.75, 0.84, 2.15, 1.15, 2.14, 0.81, 2.82, 2.39, 2.67, 1.53, 1.12, 1.44, 0.81, 1.53, 3.41, 1.95, 2.40, 3.56, 7.60, 2.30, 3.70, 2.11, 2.62, 2.44, 2.84, 4.2…
- $ area_2021_industrial_commercial_and_service_areas 10.37, 2.09, 4.86, 1.71, 5.89, 4.44, 0.24, 0.49, 1.75, 1.79, 1.39, 1.71, 2.97, 0.49, 1.93, 1.22, 1.78, 0.79, 0.02, 0.10, 3.02, 3.96, 0.01, 1.51, 2.06, 3.37, 4.50, 1.97, 0.33, 0.77, 0.85, 0.50, 1.38, 0.75, 0.90, 0.48, 0.51, 1.34, 1.58, 1.58, 1.12, 0.28, 0.46, 1.81, 1.15, 0.31, 1.23, 0.95, 0.85, 2.95, 2.88, 3.71, 0.24, 0.21, 1.91, 0.19, 0.08, 0.21, 1.57, 0.35, 2.27, 0.43, 0.46, 0.11, 0.45, 0.85, 9.87, 0.28, 0.07, 0.16, 1.17, 0.76, 0.16, 0.97, 1.24, 0.79, 0.29, 0.43, 0.23, 0.11, 0.15, 0.26, 1.29, 1.08, 0.76, 0.50, 0.52, 1.41, 1.21, 0.12, 0.51, 2.89, 3.92, 4.39, 6.24, 4.59, 2.23, 5.19, 0.34, 0.31, 0.52, 4.45, 0.33, 1.23, 1.28, 0.29, 2.61, 2.58, 1.38, 0.42, 0.53, 3.41, 4.44, 1.91, 0.19, 0.97, 1.51, 6.03, 1.57, 0.16, 5.18, 1.46, 1.78, 0.75, 0.27, 0.51, 0.39, 0.77, 1.03, 0.99, 1.26, 1.26, 0.94, 0.58, 1.77, 2.48, 3.10, 0.97, 0.43, 0.79, 0.69, 0.60, 2.23, 2.57, 2.28, 3.19, 3.99, 0.54, 1.60, 0.57, 1.20, 0.94, 1.08, 1.…
- $ area_2021_education_and_day_care_facilities 3.51, 0.17, 1.27, 0.37, 0.69, 0.02, 0.06, 0.04, 0.22, 0.23, 0.22, 0.14, 0.30, 0.14, 0.16, 0.06, 0.07, 0.06, 0.00, 0.02, 0.13, 0.45, 0.00, 0.12, 0.21, 0.40, 0.80, 0.05, 0.03, 0.11, 0.07, 0.03, 0.12, 0.06, 0.08, 0.10, 0.11, 0.12, 0.07, 0.09, 0.06, 0.06, 0.04, 0.10, 0.09, 0.04, 0.04, 0.17, 0.03, 0.19, 0.49, 0.26, 0.04, 0.02, 0.08, 0.02, 0.01, 0.01, 0.09, 0.03, 0.15, 0.02, 0.03, 0.02, 0.05, 0.06, 0.24, 0.04, 0.01, 0.04, 0.11, 0.05, 0.02, 0.06, 0.15, 0.03, 0.04, 0.03, 0.02, 0.01, 0.01, 0.04, 0.18, 0.11, 0.10, 0.03, 0.04, 0.12, 0.08, 0.01, 0.05, 0.29, 0.32, 0.48, 0.62, 0.63, 0.36, 0.37, 0.04, 0.02, 0.03, 0.45, 0.04, 0.09, 0.09, 0.06, 0.19, 0.47, 0.54, 0.11, 0.17, 1.08, 0.90, 0.18, 0.14, 0.10, 0.33, 0.72, 0.25, 0.06, 0.41, 0.28, 0.26, 0.15, 0.06, 0.11, 0.01, 0.03, 0.08, 0.02, 0.13, 0.07, 0.04, 0.03, 0.19, 0.16, 0.21, 0.05, 0.02, 0.03, 0.06, 0.12, 0.21, 0.33, 0.29, 0.33, 0.41, 0.09, 0.28, 0.05, 0.14, 0.08, 0.05, 0.1…
- $ area_2021_health_and_social_welfare_institutions 1.17, 0.04, 0.47, 0.14, 0.19, 0.01, 0.02, 0.00, 0.16, 0.07, 0.05, 0.02, 0.07, 0.02, 0.03, 0.02, 0.04, 0.03, 0.00, 0.00, 0.05, 0.16, 0.00, 0.06, 0.08, 0.15, 0.31, 0.03, 0.01, 0.05, 0.04, 0.02, 0.04, 0.02, 0.02, 0.03, 0.02, 0.07, 0.03, 0.05, 0.03, 0.02, 0.02, 0.04, 0.04, 0.02, 0.04, 0.07, 0.04, 0.08, 0.18, 0.10, 0.02, 0.02, 0.03, 0.01, 0.01, 0.01, 0.03, 0.02, 0.07, 0.00, 0.02, 0.01, 0.01, 0.02, 0.14, 0.03, 0.01, 0.01, 0.05, 0.02, 0.01, 0.05, 0.07, 0.03, 0.02, 0.03, 0.01, 0.00, 0.01, 0.01, 0.05, 0.04, 0.06, 0.02, 0.03, 0.06, 0.03, 0.01, 0.02, 0.21, 0.21, 0.28, 0.36, 0.35, 0.11, 0.29, 0.02, 0.01, 0.02, 0.23, 0.03, 0.04, 0.03, 0.03, 0.06, 0.16, 0.09, 0.04, 0.08, 0.38, 0.43, 0.08, 0.03, 0.03, 0.20, 0.25, 0.05, 0.01, 0.21, 0.07, 0.12, 0.04, 0.04, 0.06, 0.02, 0.03, 0.02, 0.01, 0.04, 0.03, 0.01, 0.02, 0.14, 0.09, 0.07, 0.01, 0.02, 0.02, 0.02, 0.07, 0.16, 0.15, 0.10, 0.17, 0.23, 0.04, 0.23, 0.07, 0.03, 0.04, 0.06, 0.0…
- $ area_2021_cultural_and_religious_activities 0.90, 0.06, 0.36, 0.11, 0.22, 0.03, 0.04, 0.02, 0.10, 0.07, 0.06, 0.10, 0.09, 0.04, 0.07, 0.03, 0.05, 0.02, 0.00, 0.01, 0.08, 0.25, 0.00, 0.07, 0.07, 0.18, 0.26, 0.04, 0.03, 0.07, 0.02, 0.02, 0.06, 0.03, 0.03, 0.03, 0.04, 0.05, 0.05, 0.03, 0.06, 0.04, 0.04, 0.05, 0.05, 0.05, 0.03, 0.05, 0.04, 0.12, 0.17, 0.15, 0.02, 0.03, 0.07, 0.03, 0.01, 0.02, 0.05, 0.02, 0.07, 0.01, 0.03, 0.03, 0.03, 0.05, 0.14, 0.04, 0.01, 0.03, 0.07, 0.05, 0.03, 0.04, 0.03, 0.02, 0.04, 0.02, 0.03, 0.00, 0.01, 0.01, 0.11, 0.08, 0.07, 0.03, 0.04, 0.05, 0.08, 0.02, 0.07, 0.22, 0.14, 0.19, 0.26, 0.23, 0.15, 0.11, 0.03, 0.03, 0.05, 0.22, 0.02, 0.07, 0.03, 0.02, 0.03, 0.10, 0.06, 0.05, 0.06, 0.18, 0.20, 0.08, 0.03, 0.03, 0.05, 0.19, 0.05, 0.02, 0.09, 0.07, 0.10, 0.04, 0.04, 0.02, 0.02, 0.05, 0.03, 0.02, 0.06, 0.07, 0.03, 0.02, 0.07, 0.10, 0.06, 0.03, 0.03, 0.03, 0.05, 0.03, 0.10, 0.17, 0.20, 0.14, 0.14, 0.04, 0.11, 0.04, 0.06, 0.07, 0.04, 0.0…
- $ area_2021_transport_telecommunications_and_technical_infrastructure 36.75, 6.15, 14.50, 4.02, 12.64, 2.54, 3.20, 4.10, 6.42, 4.41, 3.95, 4.35, 8.82, 1.46, 4.51, 3.95, 6.63, 2.51, 0.15, 0.94, 7.28, 9.44, 0.12, 7.87, 4.19, 12.01, 13.70, 3.74, 1.53, 2.59, 1.83, 1.37, 5.54, 3.47, 2.81, 1.41, 2.66, 4.08, 6.33, 1.69, 3.39, 3.49, 4.13, 4.86, 5.55, 2.96, 5.99, 6.41, 3.14, 7.54, 14.10, 12.01, 2.70, 2.12, 4.95, 1.57, 0.76, 1.05, 3.15, 2.77, 9.05, 5.61, 5.24, 2.20, 1.72, 5.14, 14.53, 1.72, 0.18, 2.20, 4.31, 3.10, 2.17, 5.43, 4.62, 3.18, 4.21, 2.12, 3.65, 0.25, 0.34, 1.33, 5.77, 4.23, 4.79, 2.75, 2.37, 5.37, 9.81, 0.49, 5.11, 11.45, 7.70, 12.20, 11.85, 11.60, 11.76, 21.43, 1.96, 3.35, 5.38, 17.05, 1.92, 6.68, 4.01, 3.35, 5.02, 8.09, 4.36, 3.14, 2.57, 10.12, 13.79, 13.67, 1.66, 3.19, 3.11, 13.83, 4.23, 1.43, 18.48, 9.76, 9.49, 6.29, 3.57, 3.06, 5.72, 6.94, 6.32, 4.33, 7.75, 8.58, 8.10, 4.61, 7.66, 7.79, 7.17, 7.62, 4.54, 9.82, 3.21, 5.32, 16.13, 7.22, 9.64, 13.64, 23.25, 6.24, 13.03, 5.9…
- $ area_2021_emergency_and_defence_services 0.66, 0.01, 0.41, 0.02, 0.14, 0.01, 0.00, 0.07, 0.01, 0.00, 0.00, 0.06, 0.09, 0.00, 0.01, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00, 0.02, 0.00, 0.02, 0.03, 0.04, 0.07, 0.02, 0.00, 0.00, 0.01, 0.01, 0.00, 0.00, 0.00, 0.02, 0.01, 0.00, 0.31, 0.01, 0.00, 0.00, 0.03, 0.01, 0.02, 0.02, 0.01, 0.03, 0.00, 0.04, 0.29, 0.34, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.10, 0.00, 0.00, 0.00, 0.00, 0.01, 0.02, 0.00, 0.00, 0.00, 0.01, 0.01, 0.00, 0.09, 0.04, 0.00, 0.00, 0.00, 0.64, 0.00, 0.01, 0.00, 0.01, 0.02, 0.03, 0.00, 0.00, 0.04, 0.26, 0.00, 0.00, 0.03, 0.26, 0.01, 0.27, 0.01, 0.35, 0.36, 0.00, 0.00, 0.01, 0.03, 0.00, 0.00, 0.24, 0.01, 0.00, 0.22, 0.01, 0.11, 0.01, 0.21, 0.04, 0.01, 0.05, 0.01, 0.03, 0.31, 0.00, 0.00, 0.85, 0.00, 0.03, 0.01, 0.00, 0.05, 0.00, 0.01, 0.00, 0.00, 0.03, 0.01, 0.00, 0.00, 0.01, 0.01, 0.03, 0.00, 0.01, 0.00, 0.00, 0.00, 0.12, 0.03, 0.37, 0.01, 0.01, 0.01, 0.01, 0.00, 0.01, 0.00, 0.00, 0.0…
- $ area_2021_green_areas_and_sports_facilities 9.78, 0.49, 2.81, 0.71, 2.14, 0.21, 0.10, 0.57, 0.81, 0.46, 0.66, 0.44, 1.53, 0.39, 0.37, 0.11, 0.27, 0.45, 0.03, 0.03, 0.57, 1.40, 0.02, 0.37, 0.53, 1.44, 2.03, 0.26, 0.18, 0.19, 0.20, 0.34, 0.59, 1.71, 0.38, 0.19, 0.35, 0.46, 0.60, 0.16, 0.21, 0.14, 0.25, 0.62, 0.61, 0.13, 0.17, 0.91, 0.42, 0.79, 2.12, 2.43, 0.14, 0.07, 0.29, 0.06, 0.03, 0.09, 0.30, 0.11, 0.83, 0.13, 0.12, 0.09, 0.13, 0.17, 1.75, 0.10, 0.02, 0.06, 0.38, 0.12, 0.09, 0.25, 0.61, 0.14, 0.16, 0.10, 0.11, 0.03, 0.02, 0.04, 0.53, 0.84, 0.25, 0.16, 0.17, 0.73, 0.66, 0.04, 0.14, 0.84, 1.57, 1.87, 2.84, 3.41, 1.86, 1.27, 0.29, 0.16, 0.14, 2.44, 0.09, 0.85, 0.21, 0.29, 0.68, 1.81, 0.56, 0.70, 0.25, 4.47, 2.71, 1.15, 0.35, 0.65, 1.44, 2.11, 1.96, 0.47, 1.75, 1.62, 0.58, 0.53, 0.31, 0.70, 0.12, 1.53, 1.19, 1.86, 0.85, 2.64, 0.32, 0.82, 0.73, 0.62, 1.49, 0.41, 0.16, 0.85, 0.29, 0.30, 1.49, 0.95, 1.03, 1.24, 1.98, 0.42, 1.25, 0.25, 0.36, 0.70, 0.35, 0.6…
- $ area_2021_unclassified_built_up_areas_and_related_land 8.32, 0.81, 3.21, 1.19, 2.55, 0.21, 0.18, 0.25, 0.84, 0.79, 0.97, 0.53, 1.23, 0.38, 0.82, 0.29, 0.37, 0.25, 0.03, 0.11, 1.28, 2.36, 0.03, 0.95, 0.86, 1.58, 3.43, 0.37, 0.30, 0.81, 0.39, 0.41, 0.80, 0.36, 0.52, 0.54, 0.71, 0.65, 0.63, 0.30, 0.54, 0.31, 0.43, 0.63, 0.69, 0.29, 0.39, 0.76, 0.38, 1.23, 1.81, 1.95, 0.19, 0.19, 0.52, 0.15, 0.07, 0.19, 0.46, 0.37, 0.79, 0.18, 0.12, 0.20, 0.09, 0.26, 0.91, 0.04, 0.01, 0.05, 0.31, 0.09, 0.02, 0.26, 0.92, 0.27, 0.27, 0.20, 0.23, 0.05, 0.13, 0.18, 0.87, 0.66, 0.61, 0.25, 0.40, 0.42, 0.21, 0.15, 0.36, 1.51, 1.42, 1.52, 2.49, 2.54, 1.40, 2.12, 0.19, 0.14, 0.34, 2.13, 0.23, 0.64, 0.34, 0.41, 0.91, 1.33, 0.74, 0.52, 0.64, 2.63, 4.14, 1.16, 0.46, 0.51, 0.78, 1.73, 0.70, 0.23, 1.48, 1.47, 0.97, 0.47, 0.30, 0.40, 0.18, 0.50, 0.24, 0.29, 0.25, 0.18, 0.59, 0.37, 0.89, 1.00, 1.38, 0.30, 0.21, 0.22, 0.42, 0.64, 1.33, 1.08, 1.70, 1.32, 2.81, 0.42, 1.23, 0.42, 0.67, 0.62, 0.42, 0.6…
- $ area_2021_agricultural_land 9.60, 52.14, 92.91, 9.14, 89.80, 16.75, 24.57, 60.74, 118.94, 74.25, 83.45, 55.25, 35.43, 14.72, 27.35, 40.64, 35.16, 8.31, 3.44, 9.54, 56.49, 58.49, 1.48, 80.91, 4.85, 46.89, 42.05, 26.10, 10.25, 7.50, 5.69, 4.94, 36.20, 17.96, 14.47, 2.78, 11.52, 22.83, 35.41, 10.75, 22.44, 26.19, 20.60, 24.71, 38.44, 18.89, 24.96, 33.79, 18.92, 79.39, 35.19, 24.16, 14.02, 31.15, 36.28, 17.06, 9.07, 11.67, 29.45, 24.87, 37.48, 9.65, 36.32, 20.95, 15.44, 24.72, 32.93, 11.08, 0.67, 10.22, 18.51, 14.81, 12.53, 14.04, 19.93, 8.49, 33.68, 6.69, 8.91, 1.40, 1.16, 5.15, 41.33, 12.06, 25.22, 18.79, 10.46, 28.30, 24.19, 0.58, 15.93, 61.55, 33.60, 78.12, 67.39, 26.01, 40.17, 77.73, 4.76, 20.86, 40.01, 235.68, 32.57, 109.71, 35.31, 34.96, 34.88, 38.93, 37.71, 15.30, 5.03, 14.39, 40.64, 104.14, 6.26, 30.36, 5.86, 117.74, 17.06, 26.21, 88.50, 141.56, 53.51, 51.83, 7.90, 22.27, 7.35, 17.59, 25.22, 22.12, 40.10, 20.54, 35.57, 10.35, 52.2…
- $ area_2021_forest 279.30, 83.02, 80.63, 20.73, 215.91, 60.98, 121.74, 113.52, 16.42, 5.08, 17.12, 97.32, 3.75, 1.49, 123.23, 296.76, 595.47, 118.94, 0.00, 5.40, 165.70, 39.31, 0.18, 275.45, 46.98, 489.57, 270.36, 68.09, 11.13, 13.03, 18.31, 21.31, 188.95, 223.66, 128.83, 23.32, 2.39, 189.39, 379.15, 11.97, 66.69, 199.19, 224.07, 332.08, 455.33, 8.69, 320.03, 270.09, 269.47, 196.36, 545.74, 835.81, 387.48, 69.51, 292.09, 13.65, 97.72, 4.14, 50.08, 124.01, 674.33, 705.00, 808.09, 37.38, 57.54, 531.95, 1007.77, 73.80, 0.00, 142.32, 198.73, 189.96, 308.67, 519.92, 300.21, 435.66, 385.29, 184.36, 149.49, 0.00, 0.02, 8.62, 60.81, 133.56, 213.64, 68.57, 93.44, 298.75, 162.99, 1.39, 637.75, 480.39, 60.03, 241.20, 138.60, 225.58, 616.43, 1190.86, 34.03, 245.44, 304.58, 451.14, 53.75, 282.96, 53.54, 192.11, 77.37, 124.24, 46.05, 54.27, 42.22, 120.58, 254.82, 857.40, 42.49, 150.51, 46.40, 235.65, 141.42, 47.59, 117.12, 418.69, 293.99, 2…
- $ area_2021_open_firm_ground 7.55, 216.78, 23.07, 18.13, 521.59, 165.91, 188.31, 368.19, 78.49, 5.89, 46.87, 364.82, 6.49, 1.86, 62.20, 528.31, 818.84, 321.92, 1.60, 25.25, 143.15, 73.82, 3.35, 202.35, 14.90, 592.96, 200.22, 234.44, 61.52, 80.55, 57.77, 38.48, 258.42, 377.52, 125.74, 23.78, 14.61, 138.71, 567.72, 20.77, 60.66, 107.65, 50.78, 822.37, 659.92, 147.25, 223.17, 391.69, 461.82, 146.07, 542.35, 1400.98, 626.84, 64.46, 459.76, 106.97, 277.31, 38.91, 57.29, 207.80, 847.93, 856.49, 1252.48, 106.65, 83.16, 638.39, 2239.28, 131.28, 12.50, 334.86, 249.82, 330.56, 632.46, 1227.25, 508.36, 592.27, 364.43, 245.33, 45.55, 6.17, 11.80, 120.10, 233.13, 216.77, 211.34, 98.67, 132.98, 217.51, 180.95, 85.45, 750.60, 2.88, 4.09, 8.22, 8.91, 2.59, 15.40, 47.93, 5.65, 0.94, 1.71, 8.81, 0.75, 3.23, 3.57, 2.06, 3.52, 3.21, 2.28, 1.21, 0.83, 4.96, 5.08, 4.81, 0.97, 1.46, 1.05, 8.44, 1.41, 0.61, 4.15, 3.54, 4.24, 1.85, 0.81, 1.55, 159.30, 133.12, 8…
- $ area_2021_wetland 4.09, 13.50, 5.72, 3.30, 9.87, 9.99, 6.70, 15.98, 12.97, 0.55, 9.36, 8.41, 0.36, 0.09, 2.87, 22.54, 17.53, 8.40, 0.03, 2.08, 12.94, 11.33, 0.04, 18.28, 3.85, 50.85, 36.33, 14.66, 2.39, 5.29, 4.71, 6.02, 5.69, 12.75, 12.20, 0.51, 2.44, 17.23, 21.35, 10.32, 11.46, 27.59, 15.15, 18.43, 80.50, 72.38, 34.71, 13.88, 12.06, 54.52, 57.92, 54.50, 22.24, 6.47, 21.96, 9.90, 2.56, 1.59, 4.67, 20.62, 70.09, 111.61, 144.50, 9.09, 3.19, 63.69, 100.35, 5.10, 0.27, 10.66, 13.42, 11.42, 14.95, 45.82, 36.49, 30.26, 36.43, 13.63, 18.58, 0.09, 0.26, 4.30, 37.72, 28.76, 29.41, 30.02, 42.63, 75.73, 204.36, 0.21, 42.56, 16.21, 0.21, 1.70, 0.23, 5.00, 32.09, 50.78, 0.16, 7.24, 10.49, 7.46, 0.20, 7.96, 0.41, 1.58, 0.12, 0.95, 0.21, 0.11, 0.21, 0.67, 1.58, 53.65, 0.44, 1.81, 0.23, 2.98, 3.53, 2.68, 1.11, 20.15, 7.61, 9.28, 8.75, 2.17, 45.17, 74.40, 59.38, 32.05, 102.76, 84.30, 41.63, 11.29, 10.51, 7.72, 3.98, 22.37, 27.03, 175.26, 9.1…
- $ area_2021_bare_rock_gravel_and_blockfields 0.52, 4.57, 1.19, 0.74, 63.49, 3.90, 4.46, 7.11, 1.56, 0.69, 0.22, 19.28, 0.76, 0.09, 4.97, 51.53, 98.29, 41.34, 0.65, 0.28, 1.25, 2.42, 0.88, 1.66, 1.92, 213.87, 11.43, 11.10, 0.81, 2.56, 2.08, 1.00, 140.75, 184.30, 34.40, 0.56, 1.25, 18.40, 352.71, 0.37, 1.79, 2.28, 1.24, 426.30, 64.25, 6.12, 4.17, 105.12, 344.71, 7.66, 93.69, 733.08, 134.70, 15.27, 174.25, 9.95, 127.57, 3.82, 35.94, 66.14, 180.55, 176.39, 146.79, 7.58, 17.76, 104.75, 553.53, 31.44, 2.37, 134.21, 134.17, 62.76, 145.37, 254.45, 121.03, 357.95, 129.93, 51.39, 11.61, 1.04, 4.02, 27.89, 17.30, 56.54, 57.10, 12.56, 24.17, 62.47, 23.28, 20.99, 368.66, 1.30, 0.48, 1.55, 15.41, 1.17, 18.85, 3.21, 33.54, 0.02, 0.04, 0.11, 0.01, 0.01, 1.02, 0.02, 0.23, 0.07, 0.02, 0.13, 0.20, 0.25, 2.49, 0.04, 0.01, 0.08, 0.02, 0.14, 0.11, 0.01, 0.00, 0.01, 0.36, 0.06, 0.12, 0.52, 6.60, 3.13, 1.45, 128.67, 153.00, 154.92, 1.67, 0.55, 1.03, 0.26, 0.83, 1.55, 2.53, 5.…
- $ area_2021_permanent_snow_and_glaciers 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.53, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 7.08, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 3.46, 17.42, 3.51, 0.00, 0.00, 0.00, 16.49, 0.00, 0.00, 0.00, 0.00, 9.38, 0.88, 0.00, 0.00, 1.93, 27.92, 0.00, 0.35, 111.75, 0.05, 0.00, 1.06, 0.00, 0.34, 0.00, 0.00, 0.00, 5.48, 13.66, 14.71, 0.00, 0.00, 55.69, 226.52, 0.00, 0.00, 48.25, 170.79, 4.37, 60.29, 8.74, 106.05, 34.59, 4.43, 0.97, 0.43, 0.00, 0.00, 0.00, 0.00, 0.35, 2.22, 0.00, 0.00, 0.08, 0.00, 0.00, 9.47, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.04, 0.21, 9.20, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
- $ area_2021_inland_waters 28.32, 46.55, 6.35, 4.45, 97.18, 28.00, 56.36, 76.87, 10.97, 11.17, 13.05, 61.60, 0.30, 0.62, 26.54, 119.38, 155.83, 40.67, 0.02, 2.65, 25.96, 11.01, 0.04, 23.52, 1.33, 70.25, 26.31, 21.47, 3.21, 1.79, 2.48, 5.64, 13.90, 22.89, 10.02, 1.65, 0.76, 6.31, 60.80, 0.27, 2.70, 11.64, 15.80, 67.95, 52.31, 12.28, 21.24, 45.74, 48.27, 15.55, 85.40, 240.41, 73.64, 4.04, 48.34, 2.29, 22.78, 0.85, 1.25, 15.87, 93.75, 123.20, 273.44, 6.49, 2.24, 159.13, 261.82, 7.73, 0.12, 26.57, 76.80, 43.39, 43.25, 134.66, 104.45, 171.62, 47.64, 20.43, 11.55, 0.59, 0.08, 9.71, 18.37, 20.13, 16.19, 12.03, 9.72, 25.93, 44.89, 8.59, 185.91, 48.71, 10.11, 33.47, 8.59, 13.19, 39.65, 134.27, 0.21, 38.57, 45.95, 38.33, 8.19, 14.64, 13.27, 18.02, 0.68, 7.02, 2.06, 1.17, 0.81, 4.20, 13.43, 93.88, 15.01, 37.54, 3.45, 46.32, 7.31, 1.15, 2.83, 29.16, 71.69, 17.73, 24.65, 57.91, 36.18, 40.90, 19.47, 43.18, 93.92, 201.74, 32.68, 35.43, 53.99, 40.18,…
- $ area_2021_unclassified_undeveloped_areas 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.03, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.03, 0.00, 0.00, 0.02, 0.03, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.02, 0.02, 0.02, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.02, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
- $ total_area 454.13, 432.42, 262.42, 72.60, 1040.63, 294.95, 408.40, 650.54, 257.96, 113.39, 183.18, 617.95, 69.10, 24.71, 261.71, 1068.08, 1736.24, 546.54, 6.25, 47.20, 425.37, 229.87, 6.30, 620.45, 87.41, 1503.36, 632.62, 385.13, 93.28, 119.44, 97.23, 82.32, 661.61, 865.85, 337.78, 58.53, 40.59, 404.64, 1448.06, 60.66, 175.25, 381.65, 336.71, 1713.36, 1365.33, 271.62, 640.22, 876.89, 1190.60, 521.81, 1395.17, 3432.12, 1264.01, 195.38, 1046.07, 163.34, 538.96, 64.18, 187.63, 465.33, 1928.93, 2004.14, 2684.33, 192.51, 183.33, 1588.79, 4460.07, 264.67, 16.51, 711.07, 873.21, 664.60, 1221.63, 2216.19, 1209.82, 1637.30, 1009.82, 527.38, 252.72, 10.11, 18.52, 178.67, 424.15, 478.98, 566.58, 246.74, 319.60, 722.30, 656.59, 118.72, 2020.66, 642.45, 137.92, 405.62, 292.16, 317.68, 793.10, 1555.09, 89.97, 319.26, 412.91, 791.79, 101.18, 434.87, 118.83, 256.97, 133.97, 202.98, 102.70, 85.73, 61.48, 192.28, 376.72, 1144.80, 71.69,…
- $ kommune_code "0301", "1101", "1103", "1106", "1108", "1111", "1112", "1114", "1119", "1120", "1121", "1122", "1124", "1127", "1130", "1133", "1134", "1135", "1144", "1145", "1146", "1149", "1151", "1160", "1505", "1506", "1507", "1511", "1514", "1515", "1516", "1517", "1520", "1525", "1528", "1531", "1532", "1535", "1539", "1547", "1554", "1557", "1560", "1563", "1566", "1573", "1576", "1577", "1578", "1579", "1804", "1806", "1811", "1812", "1813", "1815", "1816", "1818", "1820", "1822", "1824", "1825", "1826", "1827", "1828", "1832", "1833", "1834", "1835", "1836", "1837", "1838", "1839", "1840", "1841", "1845", "1848", "1851", "1853", "1856", "1857", "1859", "1860", "1865", "1866", "1867", "1868", "1870", "1871", "1874", "1875", "3001", "3002", "3003", "3004", "3005", "3006", "3007", "3011", "3012", "3013", "3014", "3015", "3016", "3017", "3018", "3019", "3020", "3021", "3022", "3023", "3024", "3025", "3026", "3027", "…
- $ location_code "municip_nor0301", "municip_nor1101", "municip_nor1103", "municip_nor1106", "municip_nor1108", "municip_nor1111", "municip_nor1112", "municip_nor1114", "municip_nor1119", "municip_nor1120", "municip_nor1121", "municip_nor1122", "municip_nor1124", "municip_nor1127", "municip_nor1130", "municip_nor1133", "municip_nor1134", "municip_nor1135", "municip_nor1144", "municip_nor1145", "municip_nor1146", "municip_nor1149", "municip_nor1151", "municip_nor1160", "municip_nor1505", "municip_nor1506", "municip_nor1507", "municip_nor1511", "municip_nor1514", "municip_nor1515", "municip_nor1516", "municip_nor1517", "municip_nor1520", "municip_nor1525", "municip_nor1528", "municip_nor1531", "municip_nor1532", "municip_nor1535", "municip_nor1539", "municip_nor1547", "municip_nor1554", "municip_nor1557", "municip_nor1560", "municip_nor1563", "municip_nor1566", "municip_nor1573", "municip_nor1576", "municip_nor1577", "municip…
- $ perc_rocks 1.145047e-03, 1.056843e-02, 4.534715e-03, 1.019284e-02, 6.101112e-02, 1.322258e-02, 1.092067e-02, 1.092938e-02, 6.047449e-03, 6.085193e-03, 1.201004e-03, 3.119994e-02, 1.099855e-02, 3.642250e-03, 1.899049e-02, 4.824545e-02, 5.661084e-02, 7.563948e-02, 1.040000e-01, 5.932203e-03, 2.938618e-03, 1.052769e-02, 1.396825e-01, 2.675477e-03, 2.196545e-02, 1.422613e-01, 1.806772e-02, 2.882144e-02, 8.683533e-03, 2.143336e-02, 2.139257e-02, 1.214772e-02, 2.127386e-01, 2.128544e-01, 1.018414e-01, 9.567743e-03, 3.079576e-02, 4.547252e-02, 2.435742e-01, 6.099571e-03, 1.021398e-02, 5.974060e-03, 3.682694e-03, 2.488094e-01, 4.705822e-02, 2.253148e-02, 6.513386e-03, 1.198782e-01, 2.895263e-01, 1.467967e-02, 6.715311e-02, 2.135939e-01, 1.065656e-01, 7.815539e-02, 1.665759e-01, 6.091588e-02, 2.366966e-01, 5.952010e-02, 1.915472e-01, 1.421357e-01, 9.360112e-02, 8.801281e-02, 5.468404e-02, 3.937458e-02, 9.687449e-02, 6.593068e-0…
- $ perc_wetland 0.0090062317, 0.0312196476, 0.0217971191, 0.0454545455, 0.0094846391, 0.0338701475, 0.0164054848, 0.0245642082, 0.0502791130, 0.0048505159, 0.0510972814, 0.0136095153, 0.0052098408, 0.0036422501, 0.0109663368, 0.0211032881, 0.0100965304, 0.0153694149, 0.0048000000, 0.0440677966, 0.0304205750, 0.0492887284, 0.0063492063, 0.0294624869, 0.0440453037, 0.0338242337, 0.0574278398, 0.0380650689, 0.0256217839, 0.0442900201, 0.0484418389, 0.0731292517, 0.0086002328, 0.0147254143, 0.0361181834, 0.0087134803, 0.0601133284, 0.0425810597, 0.0147438642, 0.1701285856, 0.0653922967, 0.0722913664, 0.0449942087, 0.0107566419, 0.0589601049, 0.2664752227, 0.0542157383, 0.0158286672, 0.0101293465, 0.1044824745, 0.0415146541, 0.0158793982, 0.0175947975, 0.0331149555, 0.0209928590, 0.0606097710, 0.0047498887, 0.0247740729, 0.0248894100, 0.0443126383, 0.0363362071, 0.0556897223, 0.0538309373, 0.0472183263, 0.0174003164, 0.040087110…
- $ perc_forest 0.615022130, 0.191989270, 0.307255545, 0.285537190, 0.207480084, 0.206746906, 0.298090108, 0.174501184, 0.063653280, 0.044801129, 0.093459985, 0.157488470, 0.054269175, 0.060299474, 0.470864698, 0.277844356, 0.342965258, 0.217623596, 0.000000000, 0.114406780, 0.389543221, 0.171009701, 0.028571429, 0.443951970, 0.537467109, 0.325650543, 0.427365559, 0.176797445, 0.119318182, 0.109092431, 0.188316363, 0.258867833, 0.285591209, 0.258312641, 0.381402096, 0.398428157, 0.058881498, 0.468045670, 0.261833073, 0.197329377, 0.380542083, 0.521917988, 0.665468801, 0.193817995, 0.333494467, 0.031993226, 0.499875043, 0.308008986, 0.226331262, 0.376305552, 0.391163801, 0.243525867, 0.306548208, 0.355768246, 0.279226056, 0.083568018, 0.181312157, 0.064506077, 0.266908277, 0.266499044, 0.349587595, 0.351771832, 0.301039738, 0.194171731, 0.313860252, 0.334814544, 0.225953853, 0.278837798, 0.000000000, 0.200149071, 0.227585575…
- $ perc_open_ground 0.016625195, 0.501318163, 0.087912507, 0.249724518, 0.501225219, 0.562502119, 0.461092067, 0.565975958, 0.304271980, 0.051944616, 0.255868545, 0.590371389, 0.093921852, 0.075273169, 0.237667647, 0.494635233, 0.471616827, 0.589014528, 0.256000000, 0.534957627, 0.336530550, 0.321138035, 0.531746032, 0.326134257, 0.170461046, 0.394423159, 0.316493314, 0.608729520, 0.659519726, 0.674397187, 0.594158182, 0.467444121, 0.390592645, 0.436010856, 0.372254130, 0.406287374, 0.359940872, 0.342798537, 0.392055578, 0.342400264, 0.346134094, 0.282064719, 0.150812272, 0.479975020, 0.483341024, 0.542117664, 0.348583299, 0.446680884, 0.387888460, 0.279929476, 0.388733989, 0.408196683, 0.495913798, 0.329921179, 0.439511696, 0.654891637, 0.514527980, 0.606263634, 0.305334968, 0.446564803, 0.439585677, 0.427360364, 0.466589428, 0.553997195, 0.453608247, 0.401808924, 0.502072837, 0.496013904, 0.757116899, 0.470924100, 0.286093838…
+ $ region "0…
+ $ area_2021_residential_areas 51…
+ $ area_2021_recreational_facilities 1.…
+ $ area_2021_built_up_areas_for_agriculture_and_fishing 0.…
+ $ area_2021_industrial_commercial_and_service_areas 10…
+ $ area_2021_education_and_day_care_facilities 3.…
+ $ area_2021_health_and_social_welfare_institutions 1.…
+ $ area_2021_cultural_and_religious_activities 0.…
+ $ area_2021_transport_telecommunications_and_technical_infrastructure 36…
+ $ area_2021_emergency_and_defence_services 0.…
+ $ area_2021_green_areas_and_sports_facilities 9.…
+ $ area_2021_unclassified_built_up_areas_and_related_land 8.…
+ $ area_2021_agricultural_land 9.…
+ $ area_2021_forest 27…
+ $ area_2021_open_firm_ground 7.…
+ $ area_2021_wetland 4.…
+ $ area_2021_bare_rock_gravel_and_blockfields 0.…
+ $ area_2021_permanent_snow_and_glaciers 0.…
+ $ area_2021_inland_waters 28…
+ $ area_2021_unclassified_undeveloped_areas 0,…
+ $ total_area 45…
+ $ kommune_code "0…
+ $ location_code "…
+ $ perc_rocks 0.…
+ $ perc_wetland 0.…
+ $ perc_forest 0.…
+ $ perc_open_ground 0.…
Then the next step is very similar to what we've done before. We'll use `left_join` to merge the data frame containing the land use variables with the data frame containing the map with the kommune borders. I want to plot the four designations of interest in one figure, so I'll transform the plot to a long format using `pivot_longer`. Then I'll create a new label with nicer descriptions of the designations, and then the rest is similar to before. We'll facet the plot based on the designation:
``` r
-nor_municip_map_b2020_split_dt |>
- left_join(area_use, by = "location_code") |>
- pivot_longer(cols = starts_with("perc"), names_to = "land_type", values_to = "percentage") |>
- mutate(land_type_label = case_when(str_detect(land_type, "rocks") ~ "Bare rock, gravel and rockfields",
- str_detect(land_type, "wetland") ~ "Wetland",
- str_detect(land_type, "forest") ~ "Forest",
- str_detect(land_type, "open_ground") ~ "Open firm ground")) |>
- ggplot(aes(x = long, y = lat, group = group, fill = percentage)) +
- geom_polygon() +
+nor_municip_map_b2020_split_dt |>
+ left_join(area_use, by = "location_code") |>
+ pivot_longer(
+ cols = starts_with("perc"),
+ names_to = "land_type", values_to = "percentage"
+ ) |>
+ mutate(land_type_label = case_when(
+ str_detect(land_type, "rocks") ~ "Bare rock, gravel and rockfields",
+ str_detect(land_type, "wetland") ~ "Wetland",
+ str_detect(land_type, "forest") ~ "Forest",
+ str_detect(land_type, "open_ground") ~ "Open firm ground"
+ )) |>
+ ggplot(aes(
+ x = long, y = lat,
+ group = group, fill = percentage
+ )) +
+ geom_polygon() +
labs(fill = "Percentage") +
- scico::scale_fill_scico(palette = "acton", labels = scales::label_percent(), limits = c(0,1),
- guide = guide_colorbar(barheight = 0.5, barwidth = 12,
- ticks = FALSE, direction = "horizontal",
- title.position = "top", title.hjust = 0.5)) +
- facet_wrap(~ land_type_label) +
- coord_map(projection = "conic", lat0 = 60) +
+ scico::scale_fill_scico(
+ palette = "acton",
+ labels = scales::label_percent(),
+ limits = c(0, 1),
+ guide = guide_colorbar(
+ barheight = 0.5, barwidth = 12,
+ ticks = FALSE, direction = "horizontal",
+ title.position = "top", title.hjust = 0.5
+ )
+ ) +
+ facet_wrap(~land_type_label) +
+ coord_map(projection = "conic", lat0 = 60) +
theme_void() +
- theme(legend.position = "bottom",
- strip.text.x = element_textbox_simple(size = rel(1.25), halign = 0.5,
- margin = margin(10,0,10,0, "pt")))
+ theme(
+ legend.position = "bottom",
+ strip.text.x = element_textbox_simple(
+ size = rel(1.25), halign = 0.5,
+ margin = margin(10, 0, 10, 0, "pt")
+ )
+ )
```
@@ -353,45 +456,53 @@ nor_municip_map_b2020_split_dt |>
The last thing I want to show is a map of Oslo! `{csmaps}` contains a detailed map of the bydeler that we will use. Now, these bydeler are again coded and `{csdata}` (since it's update) now contains a [data frame](https://www.csids.no/csmaps/articles/customization.html#add-location-name-for-ward-and-population-for-oslo-map) with the corresponding names for all geography levels (fylker, municipalities, bydeler, etc.). We could get our data from there, but we also need something to visualize so we'll scrape a Wikipedia article for Oslo's bydeler which contains a table with the bydel numbers, the names, and some data we can use for visualization. We'll extract the table from the website using `{rvest}`, do some data wrangling and prepare it for merging into the data frame with the map. I won't go into the wrangling much here, we're interested mainly in the plotting of the data right now.
+{{< sidenote >}}
+I won't go into much detail on web scraping here, you can see one of the [other posts](https://danielroelfs.com/blog/) for more details
+{{< /sidenote >}}
+
``` r
-bydel_data <- "https://en.wikipedia.org/wiki/List_of_boroughs_of_Oslo" |>
- rvest::read_html() |>
- rvest::html_table() |>
- pluck(1) |>
- janitor::clean_names() |>
- mutate(inhabitants = str_remove_all(residents, "[[:blank:]]"),
- inhabitants = as.numeric(inhabitants),
- area = str_remove_all(area, "km2"),
- area = str_replace_all(area, ",", "."),
- area = str_squish(area),
- area = as.numeric(area),
- pop_density = inhabitants / area) |>
- arrange(number) |>
- mutate(bydel_nr = format(number, digits = 2),
- bydel_nr = str_replace_all(bydel_nr, " ", "0"),
- location_code = str_glue("wardoslo_nor0301{bydel_nr}"))
+bydel_data <- "https://en.wikipedia.org/wiki/List_of_boroughs_of_Oslo" |>
+ rvest::read_html() |>
+ rvest::html_table() |>
+ pluck(1) |>
+ janitor::clean_names() |>
+ mutate(
+ inhabitants = str_remove_all(residents, "[[:blank:]]"),
+ inhabitants = as.numeric(inhabitants),
+ area = str_remove_all(area, "km2"),
+ area = str_replace_all(area, ",", "."),
+ area = str_squish(area),
+ area = as.numeric(area),
+ pop_density = inhabitants / area
+ ) |>
+ arrange(number) |>
+ mutate(
+ bydel_nr = format(number, digits = 2),
+ bydel_nr = str_replace_all(bydel_nr, " ", "0"),
+ location_code = str_glue("wardoslo_nor0301{bydel_nr}")
+ )
print(bydel_data)
```
# A tibble: 15 × 8
- borough residents area number inhabitants pop_density bydel_nr location_code
-
- 1 Gamle Oslo 58 671 7.5 1 58671 7823. 01 wardoslo_nor030101
- 2 Grünerløkka 62 423 4.8 2 62423 13005. 02 wardoslo_nor030102
- 3 Sagene 45 089 3.1 3 45089 14545. 03 wardoslo_nor030103
- 4 St. Hanshaugen 38 945 3.6 4 38945 10818. 04 wardoslo_nor030104
- 5 Frogner 59 269 8.3 5 59269 7141. 05 wardoslo_nor030105
- 6 Ullern 34 596 9.4 6 34596 3680. 06 wardoslo_nor030106
- 7 Vestre Aker 50 157 16.6 7 50157 3022. 07 wardoslo_nor030107
- 8 Nordre Aker 52 327 13.6 8 52327 3848. 08 wardoslo_nor030108
- 9 Bjerke 33 422 7.7 9 33422 4341. 09 wardoslo_nor030109
- 10 Grorud 27 707 8.2 10 27707 3379. 10 wardoslo_nor030110
- 11 Stovner 33 316 8.2 11 33316 4063. 11 wardoslo_nor030111
- 12 Alna 49 801 13.7 12 49801 3635. 12 wardoslo_nor030112
- 13 Østensjø 50 806 12.2 13 50806 4164. 13 wardoslo_nor030113
- 14 Nordstrand 52 459 16.9 14 52459 3104. 14 wardoslo_nor030114
- 15 Søndre Nordstrand 39 066 18.4 15 39066 2123. 15 wardoslo_nor030115
+ borough residents area number inhabitants pop_density bydel_nr location_code
+
+ 1 Gamle … 58 671 7.5 1 58671 7823. 01 wardoslo_nor…
+ 2 Grüner… 62 423 4.8 2 62423 13005. 02 wardoslo_nor…
+ 3 Sagene 45 089 3.1 3 45089 14545. 03 wardoslo_nor…
+ 4 St. Ha… 38 945 3.6 4 38945 10818. 04 wardoslo_nor…
+ 5 Frogner 59 269 8.3 5 59269 7141. 05 wardoslo_nor…
+ 6 Ullern 34 596 9.4 6 34596 3680. 06 wardoslo_nor…
+ 7 Vestre… 50 157 16.6 7 50157 3022. 07 wardoslo_nor…
+ 8 Nordre… 52 327 13.6 8 52327 3848. 08 wardoslo_nor…
+ 9 Bjerke 33 422 7.7 9 33422 4341. 09 wardoslo_nor…
+ 10 Grorud 27 707 8.2 10 27707 3379. 10 wardoslo_nor…
+ 11 Stovner 33 316 8.2 11 33316 4063. 11 wardoslo_nor…
+ 12 Alna 49 801 13.7 12 49801 3635. 12 wardoslo_nor…
+ 13 Østens… 50 806 12.2 13 50806 4164. 13 wardoslo_nor…
+ 14 Nordst… 52 459 16.9 14 52459 3104. 14 wardoslo_nor…
+ 15 Søndre… 39 066 18.4 15 39066 2123. 15 wardoslo_nor…
`{csmaps}` also provides a very useful data frame containing the geographical center or best location to put a label to avoid overlap and make it as clear as possible which label corresponds to which bydel. So we'll merge those two together.
@@ -404,19 +515,34 @@ Then we'll create the final plot. This will be more-or-less identical to what we
``` r
oslo_ward_map_b2020_default_dt |>
- left_join(bydel_data, by = "location_code") |>
+ left_join(bydel_data, by = "location_code") |>
ggplot(aes(x = long, y = lat, group = group)) +
- geom_polygon(aes(color = pop_density, fill = pop_density)) +
- geom_label(data = bydel_centres, aes(label = borough, group = 1), alpha = 0.5, label.size = 0) +
- labs(fill = "No of inhabitants per km2") +
- scico::scale_color_scico(palette = "turku", limits = c(0,1.5e4), guide = "none") +
- scico::scale_fill_scico(palette = "turku", limits = c(0,1.5e4), labels = scales::number_format(),
- guide = guide_colorbar(title.position = "top", title.hjust = 0.5,
- barwidth = 15, barheight = 0.75, ticks = FALSE)) +
- theme_void() +
- theme(legend.position = "bottom",
- legend.title = element_markdown(),
- legend.direction = "horizontal")
+ geom_polygon(aes(color = pop_density, fill = pop_density)) +
+ geom_label(
+ data = bydel_centres, aes(label = borough, group = 1),
+ alpha = 0.5, label.size = 0
+ ) +
+ labs(fill = "No of inhabitants per km2") +
+ scico::scale_color_scico(
+ palette = "turku",
+ limits = c(0, 1.5e4),
+ guide = "none"
+ ) +
+ scico::scale_fill_scico(
+ palette = "turku",
+ limits = c(0, 1.5e4),
+ labels = scales::number_format(),
+ guide = guide_colorbar(
+ title.position = "top", title.hjust = 0.5,
+ barwidth = 15, barheight = 0.75, ticks = FALSE
+ )
+ ) +
+ theme_void() +
+ theme(
+ legend.position = "bottom",
+ legend.title = element_markdown(),
+ legend.direction = "horizontal"
+ )
```
@@ -429,11 +555,11 @@ An example of when you might use the `sf` data format is in interactive maps. He
library(leaflet)
map_sf <- csmaps::nor_county_map_b2020_default_sf |>
- sf::st_as_sf() |>
+ sf::st_as_sf() |>
left_join(county_names, by = "location_code")
-map_sf |>
- leaflet() |>
+map_sf |>
+ leaflet() |>
addProviderTiles(providers$Esri.WorldStreetMap) |>
addPolygons(
fillColor = unname(county_colors),
@@ -442,7 +568,8 @@ map_sf |>
fillOpacity = 0.75,
highlightOptions = highlightOptions(
color = "#333", bringToFront = TRUE,
- weight = 2, opacity = 1)
+ weight = 2, opacity = 1
+ )
)
```
@@ -453,3 +580,7 @@ map_sf |>
In addition, I also replaced all usage of the NORMENT internal `{normentR}` package with the `{scico}` package which is available on CRAN.
**EDIT (2023-06-18)**: Updated the blogpost to reflect the deprecation of the `{splmaps}` and `{spldata}` packages. Replaced them with the replacement package `{csmaps}` available on CRAN.
+
+
+
+
diff --git a/content/blog/2021-easy-map-norway/index.qmd b/content/blog/2021-easy-map-norway/index.qmd
index c52efc6..efd2443 100644
--- a/content/blog/2021-easy-map-norway/index.qmd
+++ b/content/blog/2021-easy-map-norway/index.qmd
@@ -9,52 +9,38 @@ tags:
- ggplot
- map
- norway
-editor_options:
- chunk_output_type: console
execute:
fig.retina: 2
fig.align: center
fig.show: hold
results: hold
out.width: 80%
+editor_options:
+ chunk_output_type: console
---
-
```{css}
#| label: style
#| echo: FALSE
-p.announcement {
- border-radius: 5px;
- background-color: #acc8d4;
- padding: 1em;
-}
-
-p.announcement code {
+p.stand-out-paragraph code {
background-color: #93b8c8;
}
```
-
-As of March 2023 {fhimaps} and {splmaps} are no longer supported due to budget cuts at the institution supporting them. The amazing developers (led by Richard Aubrey White and Chi Zhang) have moved the data and functionality to a new package called {csmaps}. The code below is updated to reflect the changes.
-
-
-```{r}
-#| label: knitr-opts
-#| include: false
-
-options(width = 999)
-```
-
-
-
-
+{{{< standout bg="#acc8d4" >}}}
+As of March 2023 {fhimaps} and {splmaps} are no longer supported due to budget cuts at the institution supporting them. The amazing developers (led by [Richard Aubrey White](https://www.rwhite.no) and [Chi Zhang](https://andreaczhang.github.io)) have moved the data and functionality to a new package called `{csmaps}`. The code below is updated to reflect the changes.
+{{{< /standout >}}}
Every now and then you discover a discover a much simpler solution to a problem you spent a lot of time solving. This recently happened to me on the topic of creating a map of Norway in R. In this post, I want to go through the process of what I learned.
Previously, I used a JSON file and the `{geojsonio}` package to create a map of Norway and its fylker (counties) in particular. This was a very flexible and robust way of going about this, but also quite cumbersome. This method relies on a high-quality JSON file, meaning, a file that is detailed enough to display everything nicely, but not too detailed that it takes a ton of time and computing power to create a single plot. While I'll still use this method if I need to create a map for places other than Norway, I think I've found a better and easier solution for plotting Norway and the fylker and kommuner in the [`{csmaps}`](https://www.csids.no/csmaps/) package.
-The `{csmaps}` package is created by the _Consortium for Statistics in Disease Surveillance_ team. It's part of a series of packages (which they refer to as the "csverse"), which includes a package containing basic health surveillance data ([`{csdata}`](https://www.csids.no/csdata/)), one for real-time analysis in disease surveillance ([`{sc9}`](https://www.csids.no/sc9/)) and a few more. Here I'll dive into the `{csmaps}` package with some help from the `{csdata}` package. I'll also use the `{ggmap}` package to help with some other data and plotting. It's perhaps important to note that `{ggmap}` does contain a map of Norway as a whole, but not of the fylker and kommuner (municipalities), hence the usefulness of the `{csmaps}` package, which contains both. I'll also use `{tidyverse}` and `{ggtext}` as I always do. I won't load `{csmaps}` with the `library()` function, but will use the `::` operator instead since it'll make it easier to navigate the different datasets included.
+{{{< sidenote >}}}
+The group has [a bunch of packages](https://www.csids.no) for data science within public health
+{{{< /sidenote >}}}
+
+The `{csmaps}` package is created by the [_Consortium for Statistics in Disease Surveillance_](https://www.csids.no) team. It's part of a series of packages (which they refer to as the "csverse"), which includes a package containing basic health surveillance data ([`{csdata}`](https://www.csids.no/csdata/)), one for real-time analysis in disease surveillance ([`{sc9}`](https://www.csids.no/sc9/)) and a few more. Here I'll dive into the `{csmaps}` package with some help from the `{csdata}` package. I'll also use the `{ggmap}` package to help with some other data and plotting. It's perhaps important to note that `{ggmap}` does contain a map of Norway as a whole, but not of the fylker and kommuner (municipalities), hence the usefulness of the `{csmaps}` package, which contains both. I'll also use `{tidyverse}` and `{ggtext}` as I always do. I won't load `{csmaps}` with the `library()` function, but will use the `::` operator instead since it'll make it easier to navigate the different datasets included.
```{r}
#| label: pkgs
@@ -72,10 +58,10 @@ So let's have a look at what's included. You'll see that nearly all maps come in
#| label: print-map-list
#| eval: false
-data(package = "csmaps") |>
- pluck("results") |>
- as_tibble() |>
- select(Item, Title) |>
+data(package = "csmaps") |>
+ pluck("results") |>
+ as_tibble() |>
+ select(Item, Title) |>
print(n = 18)
```
@@ -86,7 +72,7 @@ So let's have a look at one of those maps. For instance the one with the new fyl
```{r}
#| label: load-map
-map_df <- nor_county_map_b2020_insert_oslo_dt |>
+map_df <- nor_county_map_b2020_insert_oslo_dt |>
glimpse()
```
@@ -95,7 +81,10 @@ Immediately you can see that there's a lot of rows, each representing a point on
```{r}
#| label: minimal-plot
-ggplot(map_df, aes(x = long, y = lat, group = group, fill = location_code)) +
+ggplot(map_df, aes(
+ x = long, y = lat,
+ group = group, fill = location_code
+)) +
geom_polygon()
```
@@ -104,21 +93,30 @@ Now let's convert the awkward county numbers to the actual names of the fylker.
```{r}
#| label: fylke-names
-county_names <- csdata::nor_locations_names(border = 2020) |>
+county_names <- csdata::nor_locations_names(border = 2020) |>
filter(str_detect(location_code, "^county")) |>
distinct(location_code, location_name)
print(county_names)
```
-Now let's also create a nice color palette to give each fylke a nicer color than the default ggplot colors. We'll create a named vector to match each fylke with a color from the _batlow_ palette by [Fabio Crameri](https://www.fabiocrameri.ch/colourmaps/).
+{{{< sidenote >}}}
+The [`{scico}`](https://github.com/thomasp85/scico) package contains a collection of color palettes that are great for accessibility
+{{{< /sidenote >}}}
+
+Now let's also create a nice color palette to give each fylke a nicer color than the default ggplot colors. We'll create a named vector to match each fylke with a color from the _batlow_ palette by [Fabio Crameri](https://www.fabiocrameri.ch/colourmaps/) implemented in the `{scico}` package.
```{r}
#| label: fylke-colors
#| message: false
-county_colors <- setNames(scico::scico(n = nrow(county_names), palette = "batlow"),
- nm = county_names$location_name)
+county_colors <- setNames(
+ scico::scico(
+ n = nrow(county_names),
+ palette = "batlow"
+ ),
+ nm = county_names$location_name
+)
```
Let's see what we can make now. We'll add the county names to the large data frame containing the longitudes and latitudes and then create a plot again. I'll also add some other style elements, such as a labels to the x- and y-axes, circles instead of squares for the legend and a map projection. For Norway especially I think a conic map projection works well since the northern fylker are so massive and the southern fylker are more dense, so adding a conic projection with a cone tangent of 40 degrees makes it a bit more perceptionally balanced (`lat0` refers to the cone tangent, the details are complicated but a higher cone tangent results a greater distortion in favor of southern points).
@@ -126,23 +124,39 @@ Let's see what we can make now. We'll add the county names to the large data fra
```{r}
#| label: simple-plot
-map_df |>
- left_join(county_names, by = "location_code") |>
- ggplot(aes(x = long, y = lat, fill = location_name, group = group)) +
- geom_polygon(key_glyph = "point") +
- labs(x = NULL,
- y = NULL,
- fill = NULL) +
- scale_x_continuous(labels = scales::label_number(suffix = "\u00b0W")) +
- scale_y_continuous(labels = scales::label_number(suffix = "\u00b0N")) +
- scale_fill_manual(values = county_colors,
- guide = guide_legend(override.aes = list(shape = 21, size = 4))) +
- coord_map(projection = "conic", lat0 = 40) +
+map_df |>
+ left_join(county_names, by = "location_code") |>
+ ggplot(aes(
+ x = long, y = lat,
+ fill = location_name, group = group
+ )) +
+ geom_polygon(key_glyph = "point") +
+ labs(
+ x = NULL,
+ y = NULL,
+ fill = NULL
+ ) +
+ scale_x_continuous(
+ labels = scales::label_number(suffix = "\u00b0W")
+ ) +
+ scale_y_continuous(
+ labels = scales::label_number(suffix = "\u00b0N")
+ ) +
+ scale_fill_manual(
+ values = county_colors,
+ guide = guide_legend(override.aes = list(shape = 21, size = 4))
+ ) +
+ coord_map(projection = "conic", lat0 = 40) +
theme_minimal() +
- theme(legend.position = c(0.9,0.2),
- legend.text = element_text(size = 5),
- legend.key.height = unit(10,"pt"),
- legend.background = element_rect(fill = "white", color = "transparent"))
+ theme(
+ legend.position = c(0.9, 0.2),
+ legend.text = element_text(size = 5),
+ legend.key.height = unit(10, "pt"),
+ legend.background = element_rect(
+ fill = "white",
+ color = "transparent"
+ )
+ )
```
## Norway with Scandinavia
@@ -162,12 +176,20 @@ Let's also combine the map with some actual data. The `{csdata}` package contain
```{r}
#| label: get-age-data
-age_data <- csdata::nor_population_by_age_cats(border = 2020, cats = list(c(0:18))) |>
- filter(str_detect(location_code, "^county"),
- calyear == max(calyear)) |>
- pivot_wider(id_cols = location_code, names_from = age, values_from = pop_jan1_n) |>
+age_data <- csdata::nor_population_by_age_cats(
+ border = 2020,
+ cats = list(seq(18))
+) |>
+ filter(
+ str_detect(location_code, "^county"),
+ calyear == max(calyear)
+ ) |>
+ pivot_wider(
+ id_cols = location_code,
+ names_from = age, values_from = pop_jan1_n
+ ) |>
janitor::clean_names() |>
- rename(age_0_18 = x000_018) |>
+ rename(age_0_18 = x001_018) |>
mutate(proportion = age_0_18 / total)
```
@@ -177,37 +199,61 @@ Let's create a map without the Oslo inset, combine it with the age distribution
#| label: age-plot
nor_county_map_b2020_default_dt |>
- left_join(age_data, by = "location_code") |>
+ left_join(age_data, by = "location_code") |>
ggplot(aes(x = long, y = lat, group = group)) +
- geom_polygon(data = map_data("world") |> filter(region != "Norway"),
- fill = "grey80") +
- geom_polygon(aes(fill = proportion), key_glyph = "point") +
+ geom_polygon(
+ data = map_data("world") |> filter(region != "Norway"),
+ fill = "grey80"
+ ) +
+ geom_polygon(aes(fill = proportion), key_glyph = "point") +
labs(fill = "Proportion of the population younger than 18") +
- scico::scale_fill_scico(palette = "devon", limits = c(0.15, 0.31),
- labels = scales::percent_format(accuracy = 1),
- guide = guide_colorbar(title.position = "top", title.hjust = 0.5,
- barwidth = 10, barheight = 0.5, ticks = FALSE)) +
- coord_map(projection = "conic", lat0 = 60,
- xlim = c(-8,40), ylim = c(57, 70)) +
+ scico::scale_fill_scico(
+ palette = "devon", limits = c(0.15, 0.31),
+ labels = scales::percent_format(accuracy = 1),
+ guide = guide_colorbar(
+ title.position = "top", title.hjust = 0.5,
+ barwidth = 10, barheight = 0.5, ticks = FALSE
+ )
+ ) +
+ coord_map(
+ projection = "conic", lat0 = 60,
+ xlim = c(-8, 40), ylim = c(57, 70)
+ ) +
theme_void() +
- theme(plot.background = element_rect(fill = "#A2C0F4", color = "transparent"),
- legend.direction = "horizontal",
- legend.position = c(0.8, 0.1),
- legend.title = element_text(size = 8),
- legend.text = element_text(size = 6))
+ theme(
+ plot.background = element_rect(
+ fill = "#A2C0F4",
+ color = "transparent"
+ ),
+ legend.direction = "horizontal",
+ legend.position = c(0.8, 0.1),
+ legend.title = element_text(size = 8),
+ legend.text = element_text(size = 6)
+ )
```
## Geocoding
+{{{< sidenote br="3em" >}}}
+The Google Maps API requires a personal API key, you can read how to obtain and register an API key [here](https://github.com/dkahle/ggmap)
+{{{< /sidenote >}}}
+
The `{ggmap}` package also has an incredibly useful function called `mutate_geocode()` which transforms a string with an address or description in character format to longitude and latitude. Since `{ggmap}` uses the Google Maps API, it works similarly to typing in a description in Google Maps. So an approximation of the location will (most likely) get you the right result (e.g. with "Hospital Lillehammer"). Note that `mutate_geocode` uses `lon` instead of `long` as column name for longitude. Just to avoid confusion, I'll rename the column to `long`.
```{r}
#| label: get-cities-locations
#| eval: false
-hospitals_df <- tibble(location = c("Ullevål Sykehus, Oslo","Haukeland universitetssjukehus, Bergen","St. Olav, Trondheim",
- "Universitetssykehuset Nord-Norge, Tromsø","Stavanger Universitetssjukehus","Sørlandet Hospital Kristiansand", "Hospital Lillehammer")) |>
- mutate_geocode(location) |>
+hospitals_df <- tibble(location = c(
+ "Ullevål Sykehus, Oslo",
+ "Haukeland universitetssjukehus, Bergen",
+ "St. Olav, Trondheim",
+ "Universitetssykehuset Nord-Norge, Tromsø",
+ "Stavanger Universitetssjukehus",
+ "Sørlandet Hospital Kristiansand",
+ "Hospital Lillehammer"
+)) |>
+ mutate_geocode(location) |>
rename(long = lon)
```
@@ -216,14 +262,14 @@ hospitals_df <- tibble(location = c("Ullevål Sykehus, Oslo","Haukeland universi
#| echo: false
#| eval: false
-write_rds(hospitals_df, "norway_hospitals.rds")
+write_rds(hospitals_df, "./data/norway_hospitals.rds")
```
```{r}
#| label: load-cities-geo
#| echo: false
-hospitals_df <- read_rds("norway_hospitals.rds")
+hospitals_df <- read_rds("./data/norway_hospitals.rds")
```
This is the list of coordinates it gave us:
@@ -241,27 +287,47 @@ Now let's put these on top of the map. We'll use the same map we used earlier. W
set.seed(21)
-map_df |>
- left_join(county_names, by = "location_code") |>
- ggplot(aes(x = long, y = lat, fill = location_name, group = group)) +
- geom_polygon(key_glyph = "point") +
- geom_segment(data = hospitals_df |> filter(str_detect(location, "Oslo")),
- aes(x = long, y = lat, xend = 19.5, yend = 62), inherit.aes = FALSE) +
- geom_point(data = hospitals_df, aes(x = long, y = lat), inherit.aes = FALSE,
- shape = 18, color = "firebrick", size = 4, show.legend = FALSE) +
- ggrepel::geom_label_repel(data = hospitals_df, aes(x = long, y = lat, label = location),
- size = 2, alpha = 0.75, label.size = 0, inherit.aes = FALSE) +
- labs(x = NULL,
- y = NULL,
- fill = NULL) +
- scale_fill_manual(values = county_colors,
- guide = guide_legend(override.aes = list(size = 4, shape = 21,
- color = "transparent"))) +
- coord_map(projection = "conic", lat0 = 60) +
+map_df |>
+ left_join(county_names, by = "location_code") |>
+ ggplot(aes(
+ x = long, y = lat,
+ fill = location_name, group = group
+ )) +
+ geom_polygon(key_glyph = "point") +
+ geom_segment(
+ data = hospitals_df |> filter(str_detect(location, "Oslo")),
+ aes(x = long, y = lat, xend = 19.5, yend = 62),
+ inherit.aes = FALSE
+ ) +
+ geom_point(
+ data = hospitals_df, aes(x = long, y = lat),
+ inherit.aes = FALSE,
+ shape = 18, color = "firebrick",
+ size = 4, show.legend = FALSE
+ ) +
+ ggrepel::geom_label_repel(
+ data = hospitals_df, aes(x = long, y = lat, label = location),
+ size = 2, alpha = 0.75, label.size = 0, inherit.aes = FALSE
+ ) +
+ labs(
+ x = NULL,
+ y = NULL,
+ fill = NULL
+ ) +
+ scale_fill_manual(
+ values = county_colors,
+ guide = guide_legend(override.aes = list(
+ size = 4, shape = 21,
+ color = "transparent"
+ ))
+ ) +
+ coord_map(projection = "conic", lat0 = 60) +
theme_void() +
- theme(legend.position = c(0.2,0.7),
- legend.text = element_text(size = 5),
- legend.key.height = unit(10,"pt"))
+ theme(
+ legend.position = c(0.2, 0.7),
+ legend.text = element_text(size = 5),
+ legend.key.height = unit(10, "pt")
+ )
```
@@ -273,7 +339,9 @@ Let's take it a step further and now look at how we can combine our map with dat
#| label: load-area-use
#| message: false
-area_use <- read_delim("areal.csv", delim = ";", skip = 2) |>
+area_use <- read_delim("./data/areal.csv",
+ delim = ";", skip = 2
+) |>
janitor::clean_names()
print(area_use)
@@ -284,17 +352,19 @@ You can see there's 356 rows, each representing a different kommune in Norway. T
```{r}
#| label: wrangle-area-use
-area_use <- area_use |>
- mutate(total_area = rowSums(across(where(is.numeric))),
- kommune_code = parse_number(region),
- kommune_code = format(kommune_code, digits = 4),
- kommune_code = str_replace_all(kommune_code, " ", "0"),
- location_code = str_glue("municip_nor{kommune_code}"),
- perc_rocks = area_2021_bare_rock_gravel_and_blockfields / total_area,
- perc_wetland = area_2021_wetland / total_area,
- perc_forest = area_2021_forest / total_area,
- perc_open_ground = area_2021_open_firm_ground / total_area) |>
- arrange(kommune_code) |>
+area_use <- area_use |>
+ mutate(
+ total_area = rowSums(across(where(is.numeric))),
+ kommune_code = parse_number(region),
+ kommune_code = format(kommune_code, digits = 4),
+ kommune_code = str_replace_all(kommune_code, " ", "0"),
+ location_code = str_glue("municip_nor{kommune_code}"),
+ perc_rocks = area_2021_bare_rock_gravel_and_blockfields / total_area,
+ perc_wetland = area_2021_wetland / total_area,
+ perc_forest = area_2021_forest / total_area,
+ perc_open_ground = area_2021_open_firm_ground / total_area
+ ) |>
+ arrange(kommune_code) |>
glimpse()
```
@@ -303,51 +373,77 @@ Then the next step is very similar to what we've done before. We'll use `left_jo
```{r}
#| label: plot-kommune-faceted
-nor_municip_map_b2020_split_dt |>
- left_join(area_use, by = "location_code") |>
- pivot_longer(cols = starts_with("perc"), names_to = "land_type", values_to = "percentage") |>
- mutate(land_type_label = case_when(str_detect(land_type, "rocks") ~ "Bare rock, gravel and rockfields",
- str_detect(land_type, "wetland") ~ "Wetland",
- str_detect(land_type, "forest") ~ "Forest",
- str_detect(land_type, "open_ground") ~ "Open firm ground")) |>
- ggplot(aes(x = long, y = lat, group = group, fill = percentage)) +
- geom_polygon() +
+nor_municip_map_b2020_split_dt |>
+ left_join(area_use, by = "location_code") |>
+ pivot_longer(
+ cols = starts_with("perc"),
+ names_to = "land_type", values_to = "percentage"
+ ) |>
+ mutate(land_type_label = case_when(
+ str_detect(land_type, "rocks") ~ "Bare rock, gravel and rockfields",
+ str_detect(land_type, "wetland") ~ "Wetland",
+ str_detect(land_type, "forest") ~ "Forest",
+ str_detect(land_type, "open_ground") ~ "Open firm ground"
+ )) |>
+ ggplot(aes(
+ x = long, y = lat,
+ group = group, fill = percentage
+ )) +
+ geom_polygon() +
labs(fill = "Percentage") +
- scico::scale_fill_scico(palette = "acton", labels = scales::label_percent(), limits = c(0,1),
- guide = guide_colorbar(barheight = 0.5, barwidth = 12,
- ticks = FALSE, direction = "horizontal",
- title.position = "top", title.hjust = 0.5)) +
- facet_wrap(~ land_type_label) +
- coord_map(projection = "conic", lat0 = 60) +
+ scico::scale_fill_scico(
+ palette = "acton",
+ labels = scales::label_percent(),
+ limits = c(0, 1),
+ guide = guide_colorbar(
+ barheight = 0.5, barwidth = 12,
+ ticks = FALSE, direction = "horizontal",
+ title.position = "top", title.hjust = 0.5
+ )
+ ) +
+ facet_wrap(~land_type_label) +
+ coord_map(projection = "conic", lat0 = 60) +
theme_void() +
- theme(legend.position = "bottom",
- strip.text.x = element_textbox_simple(size = rel(1.25), halign = 0.5,
- margin = margin(10,0,10,0, "pt")))
+ theme(
+ legend.position = "bottom",
+ strip.text.x = element_textbox_simple(
+ size = rel(1.25), halign = 0.5,
+ margin = margin(10, 0, 10, 0, "pt")
+ )
+ )
```
## Oslo
The last thing I want to show is a map of Oslo! `{csmaps}` contains a detailed map of the bydeler that we will use. Now, these bydeler are again coded and `{csdata}` (since it's update) now contains a [data frame](https://www.csids.no/csmaps/articles/customization.html#add-location-name-for-ward-and-population-for-oslo-map) with the corresponding names for all geography levels (fylker, municipalities, bydeler, etc.). We could get our data from there, but we also need something to visualize so we'll scrape a Wikipedia article for Oslo's bydeler which contains a table with the bydel numbers, the names, and some data we can use for visualization. We'll extract the table from the website using `{rvest}`, do some data wrangling and prepare it for merging into the data frame with the map. I won't go into the wrangling much here, we're interested mainly in the plotting of the data right now.
+{{{< sidenote >}}}
+I won't go into much detail on web scraping here, you can see one of the [other posts](https://danielroelfs.com/blog/) for more details
+{{{< /sidenote >}}}
+
```{r}
#| label: scrape-bydel-names
-bydel_data <- "https://en.wikipedia.org/wiki/List_of_boroughs_of_Oslo" |>
- rvest::read_html() |>
- rvest::html_table() |>
- pluck(1) |>
- janitor::clean_names() |>
- mutate(inhabitants = str_remove_all(residents, "[[:blank:]]"),
- inhabitants = as.numeric(inhabitants),
- area = str_remove_all(area, "km2"),
- area = str_replace_all(area, ",", "."),
- area = str_squish(area),
- area = as.numeric(area),
- pop_density = inhabitants / area) |>
- arrange(number) |>
- mutate(bydel_nr = format(number, digits = 2),
- bydel_nr = str_replace_all(bydel_nr, " ", "0"),
- location_code = str_glue("wardoslo_nor0301{bydel_nr}"))
+bydel_data <- "https://en.wikipedia.org/wiki/List_of_boroughs_of_Oslo" |>
+ rvest::read_html() |>
+ rvest::html_table() |>
+ pluck(1) |>
+ janitor::clean_names() |>
+ mutate(
+ inhabitants = str_remove_all(residents, "[[:blank:]]"),
+ inhabitants = as.numeric(inhabitants),
+ area = str_remove_all(area, "km2"),
+ area = str_replace_all(area, ",", "."),
+ area = str_squish(area),
+ area = as.numeric(area),
+ pop_density = inhabitants / area
+ ) |>
+ arrange(number) |>
+ mutate(
+ bydel_nr = format(number, digits = 2),
+ bydel_nr = str_replace_all(bydel_nr, " ", "0"),
+ location_code = str_glue("wardoslo_nor0301{bydel_nr}")
+ )
print(bydel_data)
```
@@ -367,19 +463,34 @@ Then we'll create the final plot. This will be more-or-less identical to what we
#| label: plot-oslo
oslo_ward_map_b2020_default_dt |>
- left_join(bydel_data, by = "location_code") |>
+ left_join(bydel_data, by = "location_code") |>
ggplot(aes(x = long, y = lat, group = group)) +
- geom_polygon(aes(color = pop_density, fill = pop_density)) +
- geom_label(data = bydel_centres, aes(label = borough, group = 1), alpha = 0.5, label.size = 0) +
- labs(fill = "No of inhabitants per km2") +
- scico::scale_color_scico(palette = "turku", limits = c(0,1.5e4), guide = "none") +
- scico::scale_fill_scico(palette = "turku", limits = c(0,1.5e4), labels = scales::number_format(),
- guide = guide_colorbar(title.position = "top", title.hjust = 0.5,
- barwidth = 15, barheight = 0.75, ticks = FALSE)) +
- theme_void() +
- theme(legend.position = "bottom",
- legend.title = element_markdown(),
- legend.direction = "horizontal")
+ geom_polygon(aes(color = pop_density, fill = pop_density)) +
+ geom_label(
+ data = bydel_centres, aes(label = borough, group = 1),
+ alpha = 0.5, label.size = 0
+ ) +
+ labs(fill = "No of inhabitants per km2") +
+ scico::scale_color_scico(
+ palette = "turku",
+ limits = c(0, 1.5e4),
+ guide = "none"
+ ) +
+ scico::scale_fill_scico(
+ palette = "turku",
+ limits = c(0, 1.5e4),
+ labels = scales::number_format(),
+ guide = guide_colorbar(
+ title.position = "top", title.hjust = 0.5,
+ barwidth = 15, barheight = 0.75, ticks = FALSE
+ )
+ ) +
+ theme_void() +
+ theme(
+ legend.position = "bottom",
+ legend.title = element_markdown(),
+ legend.direction = "horizontal"
+ )
```
@@ -394,11 +505,11 @@ An example of when you might use the `sf` data format is in interactive maps. He
library(leaflet)
map_sf <- csmaps::nor_county_map_b2020_default_sf |>
- sf::st_as_sf() |>
+ sf::st_as_sf() |>
left_join(county_names, by = "location_code")
-map_sf |>
- leaflet() |>
+map_sf |>
+ leaflet() |>
addProviderTiles(providers$Esri.WorldStreetMap) |>
addPolygons(
fillColor = unname(county_colors),
@@ -407,7 +518,8 @@ map_sf |>
fillOpacity = 0.75,
highlightOptions = highlightOptions(
color = "#333", bringToFront = TRUE,
- weight = 2, opacity = 1)
+ weight = 2, opacity = 1
+ )
)
```
@@ -416,8 +528,8 @@ map_sf |>
#| echo: false
#| eval: false
-map <- map_sf |>
- leaflet() |>
+map <- map_sf |>
+ leaflet() |>
addProviderTiles(providers$Esri.WorldStreetMap) |>
addPolygons(
fillColor = unname(county_colors),
@@ -426,10 +538,11 @@ map <- map_sf |>
fillOpacity = 0.75,
highlightOptions = highlightOptions(
color = "#333", bringToFront = TRUE,
- weight = 2, opacity = 1)
+ weight = 2, opacity = 1
+ )
)
-htmlwidgets::saveWidget(map, str_glue("{here::here('content','blog','2021-easy-map-norway')}/leafMap.html"))
+htmlwidgets::saveWidget(map, "leafMap.html")
```
@@ -437,4 +550,8 @@ htmlwidgets::saveWidget(map, str_glue("{here::here('content','blog','2021-easy-m
**EDIT (2022-09-04)**: Updated the blogpost to replace usage of the retiring `{fhimaps}` and `{fhidata}` packages with the newer `{splmaps}` and `{spldata}` packages from FHI. The `{fhidata}` package included a dataset on vaccination rates in Norway, but since this isn't incorporated in the new `{spldata}` package I replaced that plot with a plot about age distribution.
In addition, I also replaced all usage of the NORMENT internal `{normentR}` package with the `{scico}` package which is available on CRAN.
-**EDIT (2023-06-18)**: Updated the blogpost to reflect the deprecation of the `{splmaps}` and `{spldata}` packages. Replaced them with the replacement package `{csmaps}` available on CRAN.
\ No newline at end of file
+**EDIT (2023-06-18)**: Updated the blogpost to reflect the deprecation of the `{splmaps}` and `{spldata}` packages. Replaced them with the replacement package `{csmaps}` available on CRAN.
+
+
+
+
diff --git a/content/blog/2021-easy-map-norway/leafMap.html b/content/blog/2021-easy-map-norway/leafMap.html
index adfc58d..43b2e78 100644
--- a/content/blog/2021-easy-map-norway/leafMap.html
+++ b/content/blog/2021-easy-map-norway/leafMap.html
@@ -4917,4 +4917,4 @@