discussion about buffer vs st_is_within_distance #16

defuneste · 2021-11-02T13:32:16Z

Not really an issue just a start of discussion about buffer versus sf::st_is_within_distance.

Context : reproducing an analysis done in PostGIS with R & sf

PostGIS in Action by Leo Hsu and Regina Obe came out with its third edition this year. My SQL and PostGIS were rusty so it was a good time to read it again and try to reproduce some of it in R. When I read a technical book, I like to try the exercises in another language I am more familiar with. It takes more time, but I feel like this is a more active way of learning that fits me better.

The Hello world of this book provides you with restaurants and roads in the US and asks "to find the number of fast-food restaurants within one mile of a highway". Restaurants will be represented as points and highways as linestrings. This can be applied to a lot of other topics (pollution sites and rivers for example).

You can download the data here. The first chapter ("What is a spatial database") is also available for free on the editor's website.

In this post, we will compare two ways of doing this task in R: using buffers and using sf::st_is_within_distance().

Loading library and data

You have two options to load the data. The first option is to use PostgreSQL/PostGIS to do it and put it in a database (DB) and then access it with the help of R's packages DBI and RporstgreSQL. The second option is simply to use R and sf.

We will start with the second, because it doesn't involve using SQL but we will provide you a quick way of using the first one with R afterward if you are curious.

Loading data with R and sf

library(sf)
library(dplyr)

# Loading restaurants,  =======================================================
# CSV doesn't have header
restaurants.dat <- read.csv("data/restaurants.csv",
                        header = FALSE,
                        col.names = c("id",
                                      "lat",
                                      "lon")) 

# classic lat/long column to sf see https://geocompr.robinlovelace.net/spatial-class.html                        
restaurants.shp <- sf::st_as_sf(restaurants.dat,
                                coords = c("lon", "lat"), 
                                crs = 4326)

# Here we used the same CRS recommended in PostGIS in action 
# EPSG:2163 is also a projected CRS which is important to calculate distance and do buffer
# see : https://geocompr.robinlovelace.net/reproj-geo-data.html#which-crs-to-use 
restaurants2163.shp <- sf::st_transform(restaurants.shp, 2163)

# Loading highways ============================================================

roads.shp <- sf::st_read("data/roadtrl020.shp",
                            crs = 4269)

# same projection as our restaurants
roads2163.shp <- sf::st_transform(roads.shp, 2163)

# Creating franchises table ===================================================
# this is not really needed for our example but PostGIS uses it so why not add it!
# this also shows one good DB practice: Normalisation  

franchises.dat <- data.frame(
    id = c("BKG",  "CJR",  "HDE", "INO", "JIB", "KFC", "MCD", "PZH", "TCB", "WDY"),
    franchise = c("Burger King", "Carl''s Jr", 
                  "Hardee", "In-N-Out",
                  "Jack in the Box",  "Kentucky Fried Chicken",
                  "McDonald", "Pizza Hut",
                  "Taco Bell", "Wendys"))

Now we have the 3 tables. You can explore them a bit. restaurants2163.shp has 50 002 features (points) and roads2163.shp has 47 014 features with 10 fields. We have roads but we want highways. In SQL it was done while populating the table with the "WHERE" clause:

INSERT INTO ch01.highways (gid, feature, name, state, geom)
SELECT gid, feature, name, state, ST_Transform(geom, 2163)
FROM ch01.highways_staging
WHERE feature LIKE 'Principal Highway%';

roads2163.shp has a field called feature where the type of road is stored (to mess with your GIS brain, where feature indicates an observation!). This is a bit tricky because you have Principal Highway but also Principal Highway Business Route and a bunch of other options. In PostgreSQL, % is a wildcard used with the LIKE operator to represent 0, 1 or multiple characters. In R we use grepl() with ^Principal Highway to create a vector with all the values needed and then we could use it with the %in% operator to return only the roads that match the pattern. In the end we get 16433 highways.

highway_type <- unique(roads2163.shp$FEATURE)

principal_highway <- highway_type[grepl(pattern = "^Principal Highway", 
                                        highway_type)]

highways2163.shp <- roads2163.shp[roads2163.shp$FEATURE %in% principal_highway,]

If we compare in terms of lines of code, doing it with R is shorter. This is thanks to sf and also because with PostgreSQL/PostGIS we have to define tables and relations between them (ie we have a strict data model). Both options have their pros & cons.

Loading data from PostGIS

If you followed the book's instructions and loaded data in a spatial database you can also access it with R and sf. Feel free to skip this part if you prefer to read about buffer versus sf::st_is_within_distance().

I usually create a file called code.R where I store the DB connection information (user, password, etc) and I add this file in .gitignore. Then you can create a small function that calls it and creates a connection to the DB. Here is an example (it can be improved!):

connect_to_db  <- function() {
# 1 Load library and code =====================================================
source("code.R", local = TRUE) # local = TRUE load in function envt

library(RPostgreSQL) 

# 2 establish connection  =====================================================

drv <- dbDriver("PostgreSQL")

# db, adress, usr, pwd are all defined in code.R 
con <- dbConnect(drv, 
                 dbname = db,
                 host = adress , 
                 # the default port for postgres I usually change it when the DB is not on localhost
                 port = 5432, 
                 user = usr, 
                 password = pwd) 

# The connection needs to be in globalenv to be used 
# If you have a better way of doing it I will take it! 
assign("con", con, envir = .GlobalEnv)

# 3 Print useful stuff  =====================================================

# list tables in the DB and check connection 
print(DBI::dbListTables(con))

# a reminder 
print("Reminder: use DBI::dbDisconnect(con)!")
}

After that, you can use your function to create a connection to the DB and query the data using sf.

connect_to_db()
# sf is awesome! This also illustrates how easily R interacts with other languages/systems. 
restaurants2163.shp <- sf::st_read(con, query = "select * from ch01.restaurants;")
highways2163.shp  <- sf::st_read(con, query = "select * from ch01.highways;")
franchises.dat <- DBI::dbGetQuery(con, statement = "select * from ch01.lu_franchises;")

# Be friendly with your sys admin and disconnect!
dbDisconnect(con)

Finding the number of fast-food restaurants within one mile of a highway!

The PostGIS way :

Let's first see the PostGIS way of doing it:

SELECT f.franchise
 , COUNT(DISTINCT r.id) As total -- <1>
FROM ch01.restaurants As r 
  INNER JOIN ch01.lu_franchises As f ON r.franchise = f.id
    INNER JOIN ch01.highways As h 
      ON ST_DWithin(r.geom, h.geom, 1609) -- <2>
GROUP BY f.franchise
ORDER BY total DESC;

The real "magic" is in <2> with the INNER JOIN. The ST_DWithin() function will return TRUE or FALSE if the geometry is within the distance. Yes, 1609 m is one mile for those like me who live in the metric system utopia. You have a nice trick on <1> with a DISTINCT inside a COUNT() to remove duplicate id's because one restaurant can be counted twice (or more) if it meets the condition on more than one highway. The rest is just either classic SQL organizing data or some aliasing needed in PostgreSQL.

This is the result:

franchise	total
McDonald	5343
Burger King	3049
Pizza Hut	2920
Wendys	2446
Taco Bell	2428
Kentucky Fried chicken	2371
Hardee	1077
Jack in the Box	509
Carl's Jr	224
In-N-Out	44

Thanks to \timing we know that this was done in 539,606 ms on my three year old laptop with 8GB of RAM. This is really quick but keep in mind most of the work, like adding a spatial index, was done before.

Doing it with R

At first we tried to do it with the sf function st_is_within_distance()

dist_to_check <- 1609

# I have run this code but you shouldn't unless you have plenty of time!
# timer <- Sys.time()
# the difficult part is the use of lengthS() <- see the S
# see: https://geocompr.robinlovelace.net/spatial-operations.html#topological-relations 
# near_or_not <- lengths(sf::st_is_within_distance(restaurants2163.shp,
#                                                              highways2163.shp,
#                                                              dist_to_check)) > 0
# 
# resto_near_roadv2 <- restaurants2163.shp[near_or_not,]
# Timing_result <- difftime(Sys.time(),  timer)

It take us a bit less than 13 mins. A lot of time and it also crashed Rstudio, probably because I didn't have enough memory.

When you just care about how many points are within a specific distance you can also use some buffer: we can buffer the points and see which ones intersect our lines. We could also have buffered the road and counted points in it.

timer <- Sys.time()
# step 1 buffer
buffered_restaurants <- sf::st_buffer(restaurants2163.shp, dist = dist_to_check) 
# step 2 buffered restaurant intersecting a highway
resto_near_road <- buffered_restaurants[highways2163.shp,]
Timing_result <- difftime(Sys.time(),  timer)

This was just a bit more than 6 seconds. Yes, what a difference!

We then just need to use some dplyr and we are done:

# this is just tidyverse go brrr!
resto_near_road %>% 
    sf::st_drop_geometry() %>% 
    dplyr::left_join(franchises.dat, 
                     by = c("franchise" = "id")) %>% 
    dplyr::group_by(franchise) %>% 
    dplyr::summarise(total = dplyr::n()) %>% 
    dplyr::arrange(dplyr::desc(total))

If we compare that result with the one we get with Postgis and the one we get from st_is_within_distance().We have one more Wendy's restaurants in st_is_within_distance() and postgis than with the buffer's way.

A more accurate benchmark?

To produce a bit more accurate benchmark we will need to take a sample of this data set. Prototyping our scripts on a smaller subset is also a good practice. For that we will use the library spData to get the US States and just look at the fast-foods restaurants and highways in Utah.

library(spData)

Utah <- us_states[us_states$NAME == "Utah",]
Utah <- sf::st_transform(Utah, 2163)

resto_utah.shp <- restaurants2163.shp[Utah,]
road_utah.shp <- highways2163.shp[Utah,]

Now we can use the microbenchmark package to get better indicators than our sys.time() tricks.

library(microbenchmark)

microbenchmark(
    buffer = {
        buffered_restaurants <- sf::st_buffer(resto_utah.shp, dist = dist_to_check) 
        resto_near_road <- buffered_restaurants[road_utah.shp,]
    },
    within_distance ={
        near_or_not <- lengths(sf::st_is_within_distance(resto_utah.shp,
                                                             road_utah.shp,
                                                             dist_to_check)) > 0
        resto_near_roadv2 <- restaurants2163.shp[near_or_not,]
    },
    times = 100
 )

expr	min	lq	mean	median	uq	max	neval
buffer	118.7830	123.9558	132.3502	125.8452	132.0432	235.9454	100
with_distance	618.5249	633.8456	672.2927	640.7561	668.1274	1149.8452	100

The buffer version is 5 times quicker! So if you just want to know if some points are close enough to other objects, it is better to do it with a buffer. If you need to have a list containing each index of objects matching the predicate for every point (way more information), you should use the st_is_within_distance() functions.

The text was updated successfully, but these errors were encountered:

defuneste · 2022-03-14T15:29:24Z

Stuff TODO to improve it:

R code to download data
sql script to create table with timing
sql script to load data with timing
Use EXPLAIN and see if we can do better with postgis see https://blog.crunchydata.com/blog/performance-and-spatial-joins
Loading data in R, add {sfeader} and add test with {sp} -> {sf} see https://dewey.dunnington.ca/post/2022/profiling-point-in-polygon-joins-in-r/
see if we can improve either the buffer or st_is_within_distance path with {geos}

Robinlovelace · 2022-04-02T20:35:39Z

Many thanks for starting this discussion @defuneste and apologies for the slow reply. I think this is worthy of a blog post for sure and am happy to take steps in that direction, the results are fascinating. I plan to draft a PR implementing your code, does that sound like a plan?

Robinlovelace · 2022-04-03T09:25:20Z

Progress update, in the buffer-post branch I've got this going, think it will make a great blog post. Could do a quick audio chat/pair programming session to get it moving at some point. Looking forward to it, will be a learning experience!

Nowosad assigned Robinlovelace Nov 18, 2021

Robinlovelace removed their assignment Apr 2, 2022

Robinlovelace added a commit that referenced this issue Apr 2, 2022

Buffer post for #16

914f3d7

Robinlovelace added a commit that referenced this issue Apr 5, 2022

Big progress on #16

2d7df2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discussion about buffer vs st_is_within_distance #16

discussion about buffer vs st_is_within_distance #16

defuneste commented Nov 2, 2021

defuneste commented Mar 14, 2022

Robinlovelace commented Apr 2, 2022

Robinlovelace commented Apr 3, 2022

discussion about buffer vs st_is_within_distance #16

discussion about buffer vs st_is_within_distance #16

Comments

defuneste commented Nov 2, 2021

Context : reproducing an analysis done in PostGIS with R & sf

Loading library and data

Loading data with R and sf

Loading data from PostGIS

Finding the number of fast-food restaurants within one mile of a highway!

The PostGIS way :

Doing it with R

A more accurate benchmark?

defuneste commented Mar 14, 2022

Robinlovelace commented Apr 2, 2022

Robinlovelace commented Apr 3, 2022