-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discussion about buffer vs st_is_within_distance #16
Comments
Stuff TODO to improve it:
|
Many thanks for starting this discussion @defuneste and apologies for the slow reply. I think this is worthy of a blog post for sure and am happy to take steps in that direction, the results are fascinating. I plan to draft a PR implementing your code, does that sound like a plan? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Not really an issue just a start of discussion about buffer versus sf::st_is_within_distance.
Context : reproducing an analysis done in PostGIS with R & sf
PostGIS in Action by Leo Hsu and Regina Obe came out with its third edition this year. My SQL and PostGIS were rusty so it was a good time to read it again and try to reproduce some of it in R. When I read a technical book, I like to try the exercises in another language I am more familiar with. It takes more time, but I feel like this is a more active way of learning that fits me better.
The Hello world of this book provides you with restaurants and roads in the US and asks "to find the number of fast-food restaurants within one mile of a highway". Restaurants will be represented as points and highways as linestrings. This can be applied to a lot of other topics (pollution sites and rivers for example).
You can download the data here. The first chapter ("What is a spatial database") is also available for free on the editor's website.
In this post, we will compare two ways of doing this task in R: using buffers and using
sf::st_is_within_distance()
.Loading library and data
You have two options to load the data. The first option is to use PostgreSQL/PostGIS to do it and put it in a database (DB) and then access it with the help of R's packages DBI and RporstgreSQL. The second option is simply to use R and sf.
We will start with the second, because it doesn't involve using SQL but we will provide you a quick way of using the first one with R afterward if you are curious.
Loading data with R and sf
Now we have the 3 tables. You can explore them a bit.
restaurants2163.shp
has 50 002 features (points) androads2163.shp
has 47 014 features with 10 fields. We have roads but we want highways. In SQL it was done while populating the table with the "WHERE" clause:roads2163.shp
has a field called feature where the type of road is stored (to mess with your GIS brain, where feature indicates an observation!). This is a bit tricky because you havePrincipal Highway
but alsoPrincipal Highway Business Route
and a bunch of other options. In PostgreSQL,%
is a wildcard used with theLIKE
operator to represent 0, 1 or multiple characters. In R we usegrepl()
with^Principal Highway
to create a vector with all the values needed and then we could use it with the%in%
operator to return only the roads that match the pattern. In the end we get 16433 highways.If we compare in terms of lines of code, doing it with R is shorter. This is thanks to
sf
and also because with PostgreSQL/PostGIS we have to define tables and relations between them (ie we have a strict data model). Both options have their pros & cons.Loading data from PostGIS
If you followed the book's instructions and loaded data in a spatial database you can also access it with R and sf. Feel free to skip this part if you prefer to read about buffer versus
sf::st_is_within_distance()
.I usually create a file called
code.R
where I store the DB connection information (user, password, etc) and I add this file in.gitignore
. Then you can create a small function that calls it and creates a connection to the DB. Here is an example (it can be improved!):After that, you can use your function to create a connection to the DB and query the data using
sf
.Finding the number of fast-food restaurants within one mile of a highway!
The PostGIS way :
Let's first see the PostGIS way of doing it:
The real "magic" is in <2> with the
INNER JOIN
. TheST_DWithin()
function will return TRUE or FALSE if the geometry is within the distance. Yes, 1609 m is one mile for those like me who live in the metric system utopia. You have a nice trick on <1> with aDISTINCT
inside aCOUNT()
to remove duplicate id's because one restaurant can be counted twice (or more) if it meets the condition on more than one highway. The rest is just either classic SQL organizing data or some aliasing needed in PostgreSQL.This is the result:
Thanks to
\timing
we know that this was done in 539,606 ms on my three year old laptop with 8GB of RAM. This is really quick but keep in mind most of the work, like adding a spatial index, was done before.Doing it with R
At first we tried to do it with the
sf
functionst_is_within_distance()
It take us a bit less than 13 mins. A lot of time and it also crashed Rstudio, probably because I didn't have enough memory.
When you just care about how many points are within a specific distance you can also use some buffer: we can buffer the points and see which ones intersect our lines. We could also have buffered the road and counted points in it.
This was just a bit more than 6 seconds. Yes, what a difference!
We then just need to use some dplyr and we are done:
If we compare that result with the one we get with Postgis and the one we get from
st_is_within_distance()
.We have one more Wendy's restaurants inst_is_within_distance()
and postgis than with the buffer's way.A more accurate benchmark?
To produce a bit more accurate benchmark we will need to take a sample of this data set. Prototyping our scripts on a smaller subset is also a good practice. For that we will use the library
spData
to get the US States and just look at the fast-foods restaurants and highways in Utah.Now we can use the
microbenchmark
package to get better indicators than oursys.time()
tricks.The buffer version is 5 times quicker! So if you just want to know if some points are close enough to other objects, it is better to do it with a buffer. If you need to have a list containing each index of objects matching the predicate for every point (way more information), you should use the
st_is_within_distance()
functions.The text was updated successfully, but these errors were encountered: