how_is_this_computed.Rmd

---
title: "How is this computed?"
description: |
  A quick look at different approaches
author:
  - name: Giorgio Comai 
    url: https://giorgiocomai.eu
    affiliation: OBCT/EDJNet
    affiliation_url: https://www.europeandatajournalism.eu/
date: "`r Sys.Date()`"
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

library("tidyverse", quietly = TRUE)
library("sf", quietly = TRUE)
library("latlon2map")
library("RSQLite") # for caching
library("patchwork")
options(timeout = 60000)

cache_folder <- fs::path(fs::path_home_r(),
                         "R",
                         "ll_data")

fs::dir_create(cache_folder)

ll_set_folder(path = cache_folder)

## set db
db <- DBI::dbConnect(
  drv = RSQLite::SQLite(),
  fs::path(cache_folder, "pop_weighted_centre.sqlite")
)

source("functions_quality_checks.R")
```


# Basic example with a medium-sized town - Full intersection

At the most basic, the appoach at the base of the dataset takes all the cells of the population grid that intersects with a given municipality. In the population grid, each cell has a value, that corresponds to an estimate of how many people live within that given square kilometer. If a given cell of the population grid overlaps only partially with a given LAU, then the value of that cell is adjusted proportionally to the overlapping area. 

It then calculates a weighted average of the centroids of each cell of the population grid. 

As the point is to identify a point close the place where most people live, as an optional additional step, the value for each grid is raised to the power of 2, to "push" the point towards the most densely populated areas rather than in the periphery when, for example, more centres are included within the same municipality.

This is how it plays out in a rather typical town, with most residents clustered in an area, and few living along the administrative boundary.

The centroid points at some location in the mountains, while the population-weighted centre falls nicely into town. 

```{r}
gisco_id <- "IT_007003"

current_lau_sf <- ll_get_lau_eu(gisco_id = gisco_id)

intersect_sf <- ll_get_population_grid(
  year = 2018,
  match_sf = current_lau_sf,
  match_name = stringr::str_c(gisco_id,
                              "lau_2020",
                              "pop_grid_2018",
                              "intersects",
                              sep = "-"),
  join = sf::st_intersects,
  silent = TRUE
)


custom_ll_map_pop_grid(gisco_id = gisco_id,
                       pop_grid_sf = intersect_sf,
                       use_leaflet = TRUE,
                       differentiate = FALSE) 
```

As highlighted in the [original post outlining this method](https://medium.com/european-data-journalism-network/how-to-find-the-population-weighted-centre-of-local-administrative-units-a0d198fc91f7), there are however a number of situations that may complicate things, including poly-centric municipalities, municipalities with non-contiguous territories, islands, coastal areas, etc. 


## Examples of cases where the centre is not quite right

There are however cases when a large share of the population lives in cells that cut across the administrative boundary line, and it is unclear on which side of the line they live. In larger centres few will live within 1km of the administrative boundaries, but in smaller location this can be an issue. 

Let's look at the small Portuguese municipality of "Urros e Peredo dos Castelhanos" (PT_040921): with a total population of 376 residents, mostly living in one of the two locations that give the municipality its name. It so happens that the neighbouring village of Ligares falls largely into one of the cells cutting the boundary, skewing the population weighted centre (in blue). Notice that, by chance, the centroid - in purple, would be fine in this case.

To reduce the risk that grid cells crossing the boundary skew the result, this repository includes additional datasets, marked as "adjusted_intersection", that takes the cells crossing the boundary, and divide the value of the population grid according to the surface of that specific cell that actually falls into the boundary. As you see (green pin), the population-weighted centre thus adjusted falls closer to where we'd want it to be. The difference is negligible in the vast majority of cases, but in particular in mountain locations it is not rare to have large municipalities with lots of residents living close to the boundary lines... in such cases, the adjusted approach is very likely preferrable. It may, however, slightly decrease accuracy in some coastal locations, if grid cells with a substantial number of residents include also sea (in principle, this may be adjusted by ignoring the adjustment in coastal locations, but this is not yet included in the current scripts).

```{r}
gisco_id <- "PT_040921"
current_lau_sf <- ll_get_lau_eu(gisco_id = gisco_id)

intersect_sf <- ll_get_population_grid(
  year = 2018,
  match_sf = current_lau_sf,
  match_name = stringr::str_c(gisco_id,
                              "lau_2020",
                              "pop_grid_2018",
                              "intersects",
                              sep = "-"),
  join = sf::st_intersects,
  silent = TRUE
)


custom_ll_map_pop_grid(gisco_id = gisco_id,
                       pop_grid_sf = intersect_sf,
                       use_leaflet = TRUE) %>% 
  addAwesomeMarkers(data = 
ll_find_pop_centre(sf_location = current_lau_sf,
                   sf_population_grid = intersect_sf,
                   adjusted = TRUE),
                        popup="Adjusted pop-weighted",
                        icon = makeAwesomeIcon(icon = "home",
                                               markerColor = "green",
                                               library="ion"))
```

# Final notes

Alternative solutions for finding the centre include:

- increasing the disproportionate weight given to cells with larger population size (e.g. using power 5 instead of power 2), in order to facilitate coalescing on the most popoulous centre when more than one included
- use higher resolution population grids