Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discussion about buffer vs st_is_within_distance #16

Open
defuneste opened this issue Nov 2, 2021 · 3 comments
Open

discussion about buffer vs st_is_within_distance #16

defuneste opened this issue Nov 2, 2021 · 3 comments

Comments

@defuneste
Copy link

Not really an issue just a start of discussion about buffer versus sf::st_is_within_distance.

Context : reproducing an analysis done in PostGIS with R & sf

PostGIS in Action by Leo Hsu and Regina Obe came out with its third edition this year. My SQL and PostGIS were rusty so it was a good time to read it again and try to reproduce some of it in R. When I read a technical book, I like to try the exercises in another language I am more familiar with. It takes more time, but I feel like this is a more active way of learning that fits me better.

The Hello world of this book provides you with restaurants and roads in the US and asks "to find the number of fast-food restaurants within one mile of a highway". Restaurants will be represented as points and highways as linestrings. This can be applied to a lot of other topics (pollution sites and rivers for example).

You can download the data here. The first chapter ("What is a spatial database") is also available for free on the editor's website.

In this post, we will compare two ways of doing this task in R: using buffers and using sf::st_is_within_distance().

Loading library and data

You have two options to load the data. The first option is to use PostgreSQL/PostGIS to do it and put it in a database (DB) and then access it with the help of R's packages DBI and RporstgreSQL. The second option is simply to use R and sf.

We will start with the second, because it doesn't involve using SQL but we will provide you a quick way of using the first one with R afterward if you are curious.

Loading data with R and sf

library(sf)
library(dplyr)

# Loading restaurants,  =======================================================
# CSV doesn't have header
restaurants.dat <- read.csv("data/restaurants.csv",
                        header = FALSE,
                        col.names = c("id",
                                      "lat",
                                      "lon")) 

# classic lat/long column to sf see https://geocompr.robinlovelace.net/spatial-class.html                        
restaurants.shp <- sf::st_as_sf(restaurants.dat,
                                coords = c("lon", "lat"), 
                                crs = 4326)

# Here we used the same CRS recommended in PostGIS in action 
# EPSG:2163 is also a projected CRS which is important to calculate distance and do buffer
# see : https://geocompr.robinlovelace.net/reproj-geo-data.html#which-crs-to-use 
restaurants2163.shp <- sf::st_transform(restaurants.shp, 2163)

# Loading highways ============================================================

roads.shp <- sf::st_read("data/roadtrl020.shp",
                            crs = 4269)

# same projection as our restaurants
roads2163.shp <- sf::st_transform(roads.shp, 2163)

# Creating franchises table ===================================================
# this is not really needed for our example but PostGIS uses it so why not add it!
# this also shows one good DB practice: Normalisation  

franchises.dat <- data.frame(
    id = c("BKG",  "CJR",  "HDE", "INO", "JIB", "KFC", "MCD", "PZH", "TCB", "WDY"),
    franchise = c("Burger King", "Carl''s Jr", 
                  "Hardee", "In-N-Out",
                  "Jack in the Box",  "Kentucky Fried Chicken",
                  "McDonald", "Pizza Hut",
                  "Taco Bell", "Wendys"))

Now we have the 3 tables. You can explore them a bit. restaurants2163.shp has 50 002 features (points) and roads2163.shp has 47 014 features with 10 fields. We have roads but we want highways. In SQL it was done while populating the table with the "WHERE" clause:

INSERT INTO ch01.highways (gid, feature, name, state, geom)
SELECT gid, feature, name, state, ST_Transform(geom, 2163)
FROM ch01.highways_staging
WHERE feature LIKE 'Principal Highway%';

roads2163.shp has a field called feature where the type of road is stored (to mess with your GIS brain, where feature indicates an observation!). This is a bit tricky because you have Principal Highway but also Principal Highway Business Route and a bunch of other options. In PostgreSQL, % is a wildcard used with the LIKE operator to represent 0, 1 or multiple characters. In R we use grepl() with ^Principal Highway to create a vector with all the values needed and then we could use it with the %in% operator to return only the roads that match the pattern. In the end we get 16433 highways.

highway_type <- unique(roads2163.shp$FEATURE)

principal_highway <- highway_type[grepl(pattern = "^Principal Highway", 
                                        highway_type)]

highways2163.shp <- roads2163.shp[roads2163.shp$FEATURE %in% principal_highway,] 

If we compare in terms of lines of code, doing it with R is shorter. This is thanks to sf and also because with PostgreSQL/PostGIS we have to define tables and relations between them (ie we have a strict data model). Both options have their pros & cons.

Loading data from PostGIS

If you followed the book's instructions and loaded data in a spatial database you can also access it with R and sf. Feel free to skip this part if you prefer to read about buffer versus sf::st_is_within_distance().

I usually create a file called code.R where I store the DB connection information (user, password, etc) and I add this file in .gitignore. Then you can create a small function that calls it and creates a connection to the DB. Here is an example (it can be improved!):

connect_to_db  <- function() {
# 1 Load library and code =====================================================
source("code.R", local = TRUE) # local = TRUE load in function envt

library(RPostgreSQL) 

# 2 establish connection  =====================================================

drv <- dbDriver("PostgreSQL")

# db, adress, usr, pwd are all defined in code.R 
con <- dbConnect(drv, 
                 dbname = db,
                 host = adress , 
                 # the default port for postgres I usually change it when the DB is not on localhost
                 port = 5432, 
                 user = usr, 
                 password = pwd) 

# The connection needs to be in globalenv to be used 
# If you have a better way of doing it I will take it! 
assign("con", con, envir = .GlobalEnv)

# 3 Print useful stuff  =====================================================

# list tables in the DB and check connection 
print(DBI::dbListTables(con))

# a reminder 
print("Reminder: use DBI::dbDisconnect(con)!")
}

After that, you can use your function to create a connection to the DB and query the data using sf.

connect_to_db()
# sf is awesome! This also illustrates how easily R interacts with other languages/systems. 
restaurants2163.shp <- sf::st_read(con, query = "select * from ch01.restaurants;")
highways2163.shp  <- sf::st_read(con, query = "select * from ch01.highways;")
franchises.dat <- DBI::dbGetQuery(con, statement = "select * from ch01.lu_franchises;")

# Be friendly with your sys admin and disconnect!
dbDisconnect(con)

Finding the number of fast-food restaurants within one mile of a highway!

The PostGIS way :

Let's first see the PostGIS way of doing it:

SELECT f.franchise
 , COUNT(DISTINCT r.id) As total -- <1>
FROM ch01.restaurants As r 
  INNER JOIN ch01.lu_franchises As f ON r.franchise = f.id
    INNER JOIN ch01.highways As h 
      ON ST_DWithin(r.geom, h.geom, 1609) -- <2>
GROUP BY f.franchise
ORDER BY total DESC;

The real "magic" is in <2> with the INNER JOIN. The ST_DWithin() function will return TRUE or FALSE if the geometry is within the distance. Yes, 1609 m is one mile for those like me who live in the metric system utopia. You have a nice trick on <1> with a DISTINCT inside a COUNT() to remove duplicate id's because one restaurant can be counted twice (or more) if it meets the condition on more than one highway. The rest is just either classic SQL organizing data or some aliasing needed in PostgreSQL.

This is the result:

franchise total
McDonald 5343
Burger King 3049
Pizza Hut 2920
Wendys 2446
Taco Bell 2428
Kentucky Fried chicken 2371
Hardee 1077
Jack in the Box 509
Carl's Jr 224
In-N-Out 44

Thanks to \timing we know that this was done in 539,606 ms on my three year old laptop with 8GB of RAM. This is really quick but keep in mind most of the work, like adding a spatial index, was done before.

Doing it with R

At first we tried to do it with the sf function st_is_within_distance()

dist_to_check <- 1609

# I have run this code but you shouldn't unless you have plenty of time!
# timer <- Sys.time()
# the difficult part is the use of lengthS() <- see the S
# see: https://geocompr.robinlovelace.net/spatial-operations.html#topological-relations 
# near_or_not <- lengths(sf::st_is_within_distance(restaurants2163.shp,
#                                                              highways2163.shp,
#                                                              dist_to_check)) > 0
# 
# resto_near_roadv2 <- restaurants2163.shp[near_or_not,]
# Timing_result <- difftime(Sys.time(),  timer)

It take us a bit less than 13 mins. A lot of time and it also crashed Rstudio, probably because I didn't have enough memory.

When you just care about how many points are within a specific distance you can also use some buffer: we can buffer the points and see which ones intersect our lines. We could also have buffered the road and counted points in it.

timer <- Sys.time()
# step 1 buffer
buffered_restaurants <- sf::st_buffer(restaurants2163.shp, dist = dist_to_check) 
# step 2 buffered restaurant intersecting a highway
resto_near_road <- buffered_restaurants[highways2163.shp,]
Timing_result <- difftime(Sys.time(),  timer)

This was just a bit more than 6 seconds. Yes, what a difference!

We then just need to use some dplyr and we are done:

# this is just tidyverse go brrr!
resto_near_road %>% 
    sf::st_drop_geometry() %>% 
    dplyr::left_join(franchises.dat, 
                     by = c("franchise" = "id")) %>% 
    dplyr::group_by(franchise) %>% 
    dplyr::summarise(total = dplyr::n()) %>% 
    dplyr::arrange(dplyr::desc(total))

If we compare that result with the one we get with Postgis and the one we get from st_is_within_distance().We have one more Wendy's restaurants in st_is_within_distance() and postgis than with the buffer's way.

A more accurate benchmark?

To produce a bit more accurate benchmark we will need to take a sample of this data set. Prototyping our scripts on a smaller subset is also a good practice. For that we will use the library spData to get the US States and just look at the fast-foods restaurants and highways in Utah.

library(spData)

Utah <- us_states[us_states$NAME == "Utah",]
Utah <- sf::st_transform(Utah, 2163)

resto_utah.shp <- restaurants2163.shp[Utah,]
road_utah.shp <- highways2163.shp[Utah,]

Now we can use the microbenchmark package to get better indicators than our sys.time() tricks.

library(microbenchmark)

microbenchmark(
    buffer = {
        buffered_restaurants <- sf::st_buffer(resto_utah.shp, dist = dist_to_check) 
        resto_near_road <- buffered_restaurants[road_utah.shp,]
    },
    within_distance ={
        near_or_not <- lengths(sf::st_is_within_distance(resto_utah.shp,
                                                             road_utah.shp,
                                                             dist_to_check)) > 0
        resto_near_roadv2 <- restaurants2163.shp[near_or_not,]
    },
    times = 100
 )
expr min lq mean median uq max neval
buffer 118.7830 123.9558 132.3502 125.8452 132.0432 235.9454 100
with_distance 618.5249 633.8456 672.2927 640.7561 668.1274 1149.8452 100

The buffer version is 5 times quicker! So if you just want to know if some points are close enough to other objects, it is better to do it with a buffer. If you need to have a list containing each index of objects matching the predicate for every point (way more information), you should use the st_is_within_distance() functions.

@defuneste
Copy link
Author

Stuff TODO to improve it:

@Robinlovelace Robinlovelace removed their assignment Apr 2, 2022
@Robinlovelace
Copy link
Member

Many thanks for starting this discussion @defuneste and apologies for the slow reply. I think this is worthy of a blog post for sure and am happy to take steps in that direction, the results are fascinating. I plan to draft a PR implementing your code, does that sound like a plan?

Robinlovelace added a commit that referenced this issue Apr 2, 2022
@Robinlovelace
Copy link
Member

Progress update, in the buffer-post branch I've got this going, think it will make a great blog post. Could do a quick audio chat/pair programming session to get it moving at some point. Looking forward to it, will be a learning experience!

image

Robinlovelace added a commit that referenced this issue Apr 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants