diff --git a/02-Feature-representation_files/figure-html/unnamed-chunk-4-1.png b/02-Feature-representation_files/figure-html/unnamed-chunk-4-1.png index 6122b11..95c69a5 100755 Binary files a/02-Feature-representation_files/figure-html/unnamed-chunk-4-1.png and b/02-Feature-representation_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/10-Map-Algebra_files/figure-html/f10-local02-1.png b/10-Map-Algebra_files/figure-html/f10-local02-1.png index b191957..4f023b9 100755 Binary files a/10-Map-Algebra_files/figure-html/f10-local02-1.png and b/10-Map-Algebra_files/figure-html/f10-local02-1.png differ diff --git a/14-Spatial-Interpolation_files/figure-html/f14-idw1-1.png b/14-Spatial-Interpolation_files/figure-html/f14-idw1-1.png index bf6effd..3f69b3a 100755 Binary files a/14-Spatial-Interpolation_files/figure-html/f14-idw1-1.png and b/14-Spatial-Interpolation_files/figure-html/f14-idw1-1.png differ diff --git a/14-Spatial-Interpolation_files/figure-html/f14-idw2-1.png b/14-Spatial-Interpolation_files/figure-html/f14-idw2-1.png index a61e000..fb75bcb 100755 Binary files a/14-Spatial-Interpolation_files/figure-html/f14-idw2-1.png and b/14-Spatial-Interpolation_files/figure-html/f14-idw2-1.png differ diff --git a/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-21-1.png b/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-21-1.png index 59ff8cf..dba7671 100755 Binary files a/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-21-1.png and b/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-21-1.png differ diff --git a/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-22-1.png b/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-22-1.png index 733c87d..2af2d0b 100755 Binary files a/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-22-1.png and b/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-22-1.png differ diff --git a/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-38-1.png b/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-38-1.png index c07e8be..023804c 100755 Binary files a/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-38-1.png and b/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-38-1.png differ diff --git a/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-39-1.png b/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-39-1.png index 9843cc9..be9c30c 100755 Binary files a/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-39-1.png and b/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-39-1.png differ diff --git a/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-41-1.png b/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-41-1.png index 95c1981..3469d3f 100755 Binary files a/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-41-1.png and b/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-41-1.png differ diff --git a/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-42-1.png b/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-42-1.png index a996964..3da6efb 100755 Binary files a/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-42-1.png and b/A08-Point-pattern-analysis_files/figure-html/unnamed-chunk-42-1.png differ diff --git a/A09-Spatial-Autocorrelation_files/figure-html/unnamed-chunk-16-1.png b/A09-Spatial-Autocorrelation_files/figure-html/unnamed-chunk-16-1.png index 75d5356..12a7cb2 100755 Binary files a/A09-Spatial-Autocorrelation_files/figure-html/unnamed-chunk-16-1.png and b/A09-Spatial-Autocorrelation_files/figure-html/unnamed-chunk-16-1.png differ diff --git a/A09-Spatial-Autocorrelation_files/figure-html/unnamed-chunk-20-1.png b/A09-Spatial-Autocorrelation_files/figure-html/unnamed-chunk-20-1.png index f21c2f7..9020681 100755 Binary files a/A09-Spatial-Autocorrelation_files/figure-html/unnamed-chunk-20-1.png and b/A09-Spatial-Autocorrelation_files/figure-html/unnamed-chunk-20-1.png differ diff --git a/A09-Spatial-Autocorrelation_files/figure-html/unnamed-chunk-25-1.png b/A09-Spatial-Autocorrelation_files/figure-html/unnamed-chunk-25-1.png index ab0079b..0980a91 100755 Binary files a/A09-Spatial-Autocorrelation_files/figure-html/unnamed-chunk-25-1.png and b/A09-Spatial-Autocorrelation_files/figure-html/unnamed-chunk-25-1.png differ diff --git a/A10-Interpolation_files/figure-html/unnamed-chunk-10-1.png b/A10-Interpolation_files/figure-html/unnamed-chunk-10-1.png index 208ed50..0aaee0a 100755 Binary files a/A10-Interpolation_files/figure-html/unnamed-chunk-10-1.png and b/A10-Interpolation_files/figure-html/unnamed-chunk-10-1.png differ diff --git a/A10-Interpolation_files/figure-html/unnamed-chunk-12-1.png b/A10-Interpolation_files/figure-html/unnamed-chunk-12-1.png index 13a7f16..aa10b3f 100755 Binary files a/A10-Interpolation_files/figure-html/unnamed-chunk-12-1.png and b/A10-Interpolation_files/figure-html/unnamed-chunk-12-1.png differ diff --git a/A10-Interpolation_files/figure-html/unnamed-chunk-13-1.png b/A10-Interpolation_files/figure-html/unnamed-chunk-13-1.png index 0031fd6..e904c3c 100755 Binary files a/A10-Interpolation_files/figure-html/unnamed-chunk-13-1.png and b/A10-Interpolation_files/figure-html/unnamed-chunk-13-1.png differ diff --git a/A10-Interpolation_files/figure-html/unnamed-chunk-14-1.png b/A10-Interpolation_files/figure-html/unnamed-chunk-14-1.png index 19ba3ff..e7dcc61 100755 Binary files a/A10-Interpolation_files/figure-html/unnamed-chunk-14-1.png and b/A10-Interpolation_files/figure-html/unnamed-chunk-14-1.png differ diff --git a/A10-Interpolation_files/figure-html/unnamed-chunk-5-1.png b/A10-Interpolation_files/figure-html/unnamed-chunk-5-1.png index 513cf51..a1501ac 100755 Binary files a/A10-Interpolation_files/figure-html/unnamed-chunk-5-1.png and b/A10-Interpolation_files/figure-html/unnamed-chunk-5-1.png differ diff --git a/A10-Interpolation_files/figure-html/unnamed-chunk-9-1.png b/A10-Interpolation_files/figure-html/unnamed-chunk-9-1.png index 99d6272..5189014 100755 Binary files a/A10-Interpolation_files/figure-html/unnamed-chunk-9-1.png and b/A10-Interpolation_files/figure-html/unnamed-chunk-9-1.png differ diff --git a/coordinate-systems-in-r.html b/coordinate-systems-in-r.html index b442ba9..ff8cb3e 100755 --- a/coordinate-systems-in-r.html +++ b/coordinate-systems-in-r.html @@ -1332,8 +1332,8 @@
Next, we’ll explore other transformations using a tmap
dataset of the world
library(tmap)
diff --git a/index.html b/index.html
index b7796e4..71e4752 100755
--- a/index.html
+++ b/index.html
@@ -580,7 +580,7 @@
Intro to GIS and Spatial Analysis
-Last edited on 2023-11-06
+Last edited on 2023-11-13
Preface
diff --git a/point-pattern-analysis-in-r.html b/point-pattern-analysis-in-r.html
index 1ec52f7..3f6d142 100755
--- a/point-pattern-analysis-in-r.html
+++ b/point-pattern-analysis-in-r.html
@@ -802,7 +802,7 @@ Kernel density adjusted for covariateplot(K1_vs_pred$pred ~ K1_vs_pred$K1, pch=20,
diff --git a/reading-and-writing-spatial-data-in-r.html b/reading-and-writing-spatial-data-in-r.html
index 35f3a52..1298a92 100755
--- a/reading-and-writing-spatial-data-in-r.html
+++ b/reading-and-writing-spatial-data-in-r.html
@@ -956,7 +956,7 @@ Geocoding street addressesgeocoding service for creating lat/lon values from a file of US street addresses. This needs to be completed via their web interface and the resulting data table (a CSV file) would then need to be loaded into R as a data frame.
diff --git a/search_index.json b/search_index.json
index ad1975c..daa37d7 100755
--- a/search_index.json
+++ b/search_index.json
@@ -1 +1 @@
-[["index.html", "Intro to GIS and Spatial Analysis Preface", " Intro to GIS and Spatial Analysis Manuel Gimond Last edited on 2023-11-06 Preface 2023 UPDATE: Removed dependence on rgdal and maptools in Appendices Added Statistical Maps chapter (wrapped confidence maps into this chapter) 2021 UPDATE: This book has been updated for the 2021-2022 academic year. Most changes are in the Appendix and pertain to the sf ecosystem. This includes changes in the mapping appendix, and coordinate systems appendix. This also includes a new appendix that describes the simple feature anatomy and step-by-step instructions on creating new geometries from scratch. These pages are a compilation of lecture notes for my Introduction to GIS and Spatial Analysis course (ES214). They are ordered in such a way to follow the course outline, but most pages can be read in any desirable order. The course (and this book) is split into two parts: data manipulation & visualization and exploratory spatial data analysis. The first part of this book is usually conducted using ArcGIS Desktop whereas the latter part of the book is conducted in R. ArcGIS was chosen as the GIS data manipulation environment because of its “desirability” in job applications for undergraduates in the Unites States. But other GIS software environments, such as the open source software QGIS, could easily be adopted in lieu of ArcGIS–even R can be used to perform many spatial data manipulations such as clipping, buffering and projecting. Even though some of the chapters of this book make direct reference to ArcGIS techniques, most chapters can be studied without access to the software. The latter part of this book (and the course) make heavy use of R because of a) its broad appeal in the world of data analysis b) its rich (if not richest) array of spatial analysis and spatial statistics packages c) its scripting environment (which facilitates reproducibility) d) and its very cheap cost (it’s completely free and open source!). But R can be used for many traditional “GIS” application that involve most data manipulation operations–the only benefit in using a full-blown GIS environment like ArcGIS or QGIS is in creating/editing spatial data, rendering complex maps and manipulating spatial data. The Appendix covers various aspects of spatial data manipulation and analysis using R. The course only focuses on point pattern analysis and spatial autocorrelation using R, but I’ve added other R resources for students wishing to expand their GIS skills using R. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. "],["introGIS.html", "Chapter 1 Introduction to GIS 1.1 What is a GIS? 1.2 What is Spatial Analysis? 1.3 What’s in an Acronym?", " Chapter 1 Introduction to GIS 1.1 What is a GIS? A Geographic Information System is a multi-component environment used to create, manage, visualize and analyze data and its spatial counterpart. It’s important to note that most datasets you will encounter in your lifetime can all be assigned a spatial location whether on the earth’s surface or within some arbitrary coordinate system (such as a soccer field or a gridded petri dish). So in essence, any dataset can be represented in a GIS: the question then becomes “does it need to be analyzed in a GIS environment?” The answer to this question depends on the purpose of the analysis. If, for example, we are interested in identifying the ten African countries with the highest conflict index scores for the 1966-78 period, a simple table listing those scores by country is all that is needed. Table 1.1: Index of total African conflict for the 1966-78 period (Anselin and O’Loughlin 1992). Country Conflicts Country Conflicts EGYPT 5246 LIBERIA 980 SUDAN 4751 SENEGAL 933 UGANDA 3134 CHAD 895 ZAIRE 3087 TOGO 848 TANZANIA 2881 GABON 824 LIBYA 2355 MAURITANIA 811 KENYA 2273 ZIMBABWE 795 SOMALIA 2122 MOZAMBIQUE 792 ETHIOPIA 1878 IVORY COAST 758 SOUTH AFRICA 1875 MALAWI 629 MOROCCO 1861 CENTRAL AFRICAN REPUBLIC 618 ZAMBIA 1554 CAMEROON 604 ANGOLA 1528 BURUNDI 604 ALGERIA 1421 RWANDA 487 TUNISIA 1363 SIERRA LEONE 423 BOTSWANA 1266 LESOTHO 363 CONGO 1142 NIGER 358 NIGERIA 1130 BURKINA FASO 347 GHANA 1090 MALI 299 GUINEA 1015 THE GAMBIA 241 BENIN 998 SWAZILAND 147 Data source: Anselin, L. and John O’Loughlin. 1992. Geography of international conflict and cooperation: spatial dependence and regional context in Africa. In The New Geopolitics, ed. M. Ward, pp. 39-75. A simple sort on the Conflict column reveals that EGYPT, SUDAN, UGANDA, ZAIRE, TANZANIA, LIBYA, KENYA, SOMALIA, ETHIOPIA, SOUTH AFRICA are the top ten countries. What if we are interested in knowing whether countries with a high conflict index score are geographically clustered, does the above table provide us with enough information to help answer this question? The answer, of course, is no. We need additional data pertaining to the geographic location and shape of each country. A map of the countries would be helpful. Figure 1.1: Choropleth representation of African conflict index scores. Countries for which a score was not available are not mapped. Maps are ubiquitous: available online and in various print medium. But we seldom ask how the boundaries of the map features are encoded in a computing environment? After all, if we expect software to assist us in the analysis, the spatial elements of our data should be readily accessible in a digital form. Spending a few minutes thinking through this question will make you realize that simple tables or spreadsheets are not up to this task. A more complex data storage mechanism is required. This is the core of a GIS environment: a spatial database that facilitates the storage and retrieval of data that define the spatial boundaries, lines or points of the entities we are studying. This may seem trivial, but without a spatial database, most spatial data exploration and analysis would not be possible! 1.1.1 GIS software Many GIS software applications are available–both commercial and open source. Two popular applications are ArcGIS and QGIS. 1.1.1.1 ArcGIS A popular commercial GIS software is ArcGIS developed by ESRI (ESRI, pronounced ez-ree),was once a small land-use consulting firm which did not start developing GIS software until the mid 1970s. The ArcGIS desktop environment encompasses a suite of applications which include ArcMap, ArcCatalog, ArcScene and ArcGlobe. ArcGIS comes in three different license levels (basic, standard and advanced) and can be purchased with additional add-on packages. As such, a single license can range from a few thousand dollars to well over ten thousand dollars. In addition to software licensing costs, ArcGIS is only available for Windows operating systems; so if your workplace is a Mac only environment, the purchase of a Windows PC would add to the expense. 1.1.2 QGIS A very capable open source (free) GIS software is QGIS. It encompasses most of the functionality included in ArcGIS. If you are looking for a GIS application for your Mac or Linux environment, QGIS is a wonderful choice given its multi-platform support. Built into the current versions of QGIS are functions from another open source software: GRASS. GRASS has been around since the 1980’s and has many advanced GIS data manipulation functions however, its use is not as intuitive as that of QGIS or ArcGIS (hence the preferred QGIS alternative). 1.2 What is Spatial Analysis? A distinction is made in this course between GIS and spatial analysis. In the context of mainstream GIS software, the term analysis refers to data manipulation and data querying. In the context of spatial analysis, the analysis focuses on the statistical analysis of patterns and underlying processes or more generally, spatial analysis addresses the question “what could have been the genesis of the observed spatial pattern?” It’s an exploratory process whereby we attempt to quantify the observed pattern then explore the processes that may have generated the pattern. For example, you record the location of each tree in a well defined study area. You then map the location of each tree (a GIS task). At this point, you might be inclined to make inferences about the observed pattern. Are the trees clustered or dispersed? Is the tree density constant across the study area? Could soil type or slope have led to the observed pattern? Those are questions that are addressed in spatial analysis using quantitative and statistical techniques. Figure 1.2: Distribution of Maple trees in a 1,000 x 1,000 ft study area. What you will learn in this course is that popular GIS software like ArcGIS are great tools to create and manipulate spatial data, but if one wishes to go beyond the data manipulation and analyze patterns and processes that may have led to these patterns, other quantitative tools are needed. One such tool we will use in this class is R: an open source (freeware) data analysis environment. R has one, if not the richest set of spatial data analysis and statistics tools available today. Learning the R programming environment will prove to be quite beneficial given that many of the operations learnt are transferable across many other (non-spatial) quantitative analysis projects. R can be installed on both Windows and Mac operating systems. Another related piece of software that you might find useful is RStudio which offers a nice interface to R. To learn more about data analysis in R, visit the ES218 course website. 1.3 What’s in an Acronym? GIS is a ubiquitous technology. Many of you are taking this course in part because you have seen GIS listed as a “desirable”” or “required” skill in job postings. Many of you will think of GIS as a “map making” environment as do many ancillary users of GIS in the workforce. While “visualizing” data is an important feature of a GIS, one must not lose sight of what data is being visualized and for what purpose. O’Sullivan and Unwin (O’Sullivan and Unwin 2010) use the term accidental geographer to refer to those “whose understanding of geographic science is based on the operations made possible by GIS software”. We can expand on this idea and define accidental data analyst as one whose understanding of data and its analysis is limited to the point-and-click environment of popular pieces of software such as spreadsheet environments, statistical packages and GIS software. The aggressive marketing of GIS technology has the undesirable effect of placing the technology before purpose and theory. This is not unique to GIS, however. Such concerns were shared decades ago when personal computers made it easier for researchers and employees to graph non-spatial data as well as perform many statistical procedures. The different purposes of mapping spatial data have strong parallels to that of graphing (or plotting) non-spatial data. John Tukey (Tukey 1972) offers three broad classes of the latter: “Graphs from which numbers are to be read off- substitutes for tables. Graphs intended to show the reader what has already been learned (by some other technique)–these we shall sometimes impolitely call propaganda graphs. Graphs intended to let us see what may be happening over and above what we have already described- these are the analytical graphs that are our main topic.” A GIS world analogy is proposed here: Reference maps (USGS maps, hiking maps, road maps). Such maps are used to navigate landscapes or identify locations of points-of-interest. Presentation maps presented in the press such as the NY Times and the Wall Street Journal, but also maps presented in journals. Such maps are designed to convey a very specific narrative of the author’s choosing. (Here we’ll avoid Tukey’s harsh description of such visual displays, but the idea that maps can be used as propaganda is not farfetched). Statistical maps whose purpose it is to manipulate the raw data in such a way to tease out patterns otherwise not discernable in its original form. This usually requires multiple data manipulation operations and visualization and can sometimes benefit from being explored outside of a spatial context. This course will focus on the last two spatial data visualization purposes with a strong emphasis on the latter (Statistical maps). References "],["chp02_0.html", "Chapter 2 Feature Representation 2.1 Vector vs. Raster 2.2 Object vs. Field 2.3 Scale 2.4 Attribute Tables", " Chapter 2 Feature Representation 2.1 Vector vs. Raster To work in a GIS environment, real world observations (objects or events that can be recorded in 2D or 3D space) need to be reduced to spatial entities. These spatial entities can be represented in a GIS as a vector data model or a raster data model. Figure 2.1: Vector and raster representations of a river feature. 2.1.1 Vector Vector features can be decomposed into three different geometric primitives: points, polylines and polygons. 2.1.1.1 Point Figure 2.2: Three point objects defined by their X and Y coordinate values. A point is composed of one coordinate pair representing a specific location in a coordinate system. Points are the most basic geometric primitives having no length or area. By definition a point can’t be “seen” since it has no area; but this is not practical if such primitives are to be mapped. So points on a map are represented using symbols that have both area and shape (e.g. circle, square, plus signs). We seem capable of interpreting such symbols as points, but there may be instances when such interpretation may be ambiguous (e.g. is a round symbol delineating the area of a round feature on the ground such as a large oil storage tank or is it representing the point location of that tank?). 2.1.1.2 Polyline Figure 2.3: A simple polyline object defined by connected vertices. A polyline is composed of a sequence of two or more coordinate pairs called vertices. A vertex is defined by coordinate pairs, just like a point, but what differentiates a vertex from a point is its explicitly defined relationship with neighboring vertices. A vertex is connected to at least one other vertex. Like a point, a true line can’t be seen since it has no area. And like a point, a line is symbolized using shapes that have a color, width and style (e.g. solid, dashed, dotted, etc…). Roads and rivers are commonly stored as polylines in a GIS. 2.1.1.3 Polygon Figure 2.4: A simple polygon object defined by an area enclosed by connected vertices. A polygon is composed of three or more line segments whose starting and ending coordinate pairs are the same. Sometimes you will see the words lattice or area used in lieu of ‘polygon’. Polygons represent both length (i.e. the perimeter of the area) and area. They also embody the idea of an inside and an outside; in fact, the area that a polygon encloses is explicitly defined in a GIS environment. If it isn’t, then you are working with a polyline feature. If this does not seem intuitive, think of three connected lines defining a triangle: they can represent three connected road segments (thus polyline features), or they can represent the grassy strip enclosed by the connected roads (in which case an ‘inside’ is implied thus defining a polygon). 2.1.2 Raster Figure 2.5: A simple raster object defined by a 10x10 array of cells or pixels. A raster data model uses an array of cells, or pixels, to represent real-world objects. Raster datasets are commonly used for representing and managing imagery, surface temperatures, digital elevation models, and numerous other entities. A raster can be thought of as a special case of an area object where the area is divided into a regular grid of cells. But a regularly spaced array of marked points may be a better analogy since rasters are stored as an array of values where each cell is defined by a single coordinate pair inside of most GIS environments. Implicit in a raster data model is a value associated with each cell or pixel. This is in contrast to a vector model that may or may not have a value associated with the geometric primitive. Also note that a raster data structure is square or rectangular. So, if the features in a raster do not cover the full square or rectangular extent, their pixel values will be set to no data values (e.g. NULL or NoData). 2.2 Object vs. Field The traditional vector/raster perspective of our world is one that has been driven by software and data storage environments. But this perspective is not particularly helpful if one is interested in analyzing the pattern. In fact, it can mask some important properties of the entity being studied. An object vs. field view of the world proves to be more insightful even though it may seem more abstract. 2.2.1 Object View An object view of the world treats entities as discrete objects; they need not occur at every location within a study area. Point locations of cities would be an example of an object. So would be polygonal representations of urban areas which may be non-contiguous. 2.2.2 Field View A field view of the world treats entities as a scalar field. This is a mathematical concept in which a scalar is a quantity having a magnitude. It is measurable at every location within the study region. Two popular examples of a scalar field are surface elevation and surface temperature. Each represents a property that can be measured at any location. Another example of a scalar field is the presence and absence of a building. This is a binary scalar where a value of 0 is assigned to a location devoid of buildings and a value of 1 is assigned to locations having one or more buildings. A field representation of buildings may not seem intuitive, in fact, given the definition of an object view of the world in the last section, it would seem only fitting to view buildings as objects. In fact, buildings can be viewed as both field or objects. The context of the analysis is ultimately what will dictate which view to adopt. If we’re interested in studying the distribution of buildings over a study area, then an object view of the features makes sense. If, on the other hand, we are interested in identifying all locations where buildings don’t exist, then a binary field view of these entities would make sense. 2.3 Scale How one chooses to represent a real-world entity will be in large part dictated by the scale of the analysis. In a GIS, scale has a specific meaning: it’s the ratio of distance on the map to that in the real world. So a large scale map implies a relatively large ratio and thus a small extent. This is counter to the layperson’s interpretation of large scale which focuses on the scope or extent of a study; so a large scale analysis would imply one that covers a large area. The following two maps represent the same entity: the Boston region. At a small scale (e.g. 1:10,000,000), Boston and other cities may be best represented as points. At a large scale (e.g. 1:34,000), Boston may be best represented as a polygon. Note that at this large scale, roads may also be represented as polygon features instead of polylines. Figure 2.6: Map of the Boston area at a 1:10,000,000 scale. Note that in geography, this is considered small scale whereas in layperson terms, this extent is often referred to as a large scale (i.e. covering a large area). Figure 2.7: Map of the Boston area at a 1:34,000 scale. Note that in geography, this is considered large scale whereas in layperson terms, this extent is often referred to as a small scale (i.e. covering a small area). 2.4 Attribute Tables Non-spatial information associated with a spatial feature is referred to as an attribute. A feature on a GIS map is linked to its record in the attribute table by a unique numerical identifier (ID). Every feature in a layer has an identifier. It is important to understand the one-to-one or many-to-one relationship between feature, and attribute record. Because features on the map are linked to their records in the table, many GIS software will allow you to click on a map feature and see its related attributes in the table. Raster data can also have attributes only if pixels are represented using a small set of unique integer values. Raster datasets that contain attribute tables typically have cell values that represent or define a class, group, category, or membership. NOTE: not all GIS raster data formats can store attribute information; in fact most raster datasets you will work with in this course will not have attribute tables. 2.4.1 Measurement Levels Attribute data can be broken down into four measurement levels: Nominal data which have no implied order, size or quantitative information (e.g. paved and unpaved roads) Ordinal data have an implied order (e.g. ranked scores), however, we cannot quantify the difference since a linear scale is not implied. Interval data are numeric and have a linear scale, however they do not have a true zero and can therefore not be used to measure relative magnitudes. For example, one cannot say that 60°F is twice as warm as 30°F since when presented in degrees °C the temperature values are 15.5°C and -1.1°C respectively (and 15.5 is clearly not twice as big as -1.1). Ratio scale data are interval data with a true zero such as monetary value (e.g. $1, $20, $100). 2.4.2 Data type Another way to categorize an attribute is by its data type. ArcGIS supports several data types such as integer, float, double and text. Knowing your data type and measurement level should dictate how they are stored in a GIS environment. The following table lists popular data types available in most GIS applications. Type Stored values Note Short integer -32,768 to 32,768 Whole numbers Long integer -2,147,483,648 to 2,147,483,648 Whole numbers Float -3.4 * E-38 to 1.2 E38 Real numbers Double -2.2 * E-308 to 1.8 * E308 Real numbers Text Up to 64,000 characters Letters and words While whole numbers can be stored as a float or double (i.e. we can store the number 2 as 2.0) doing so comes at a cost: an increase in storage space. This may not be a big deal if the dataset is small, but if it consists of tens of thousands of records the increase in file size and processing time may become an issue. While storing an integer value as a float may not have dire consequences, the same cannot be said of storing a float as an integer. For example, if your values consist of 0.2, 0.01, 0.34, 0.1 and 0.876, their integer counterpart would be 0, 0, 0, and 1 (i.e. values rounded to the nearest whole number). This can have a significant impact on a map as shown in the following example. Figure 2.8: Map of data represented as decimal (float) values. Figure 2.9: Map of same data represented as integers instead of float. "],["gis-data-management.html", "Chapter 3 GIS Data Management 3.1 GIS File Data Formats 3.2 Managing GIS Files in ArcGIS 3.3 Managing a Map Project in ArcGIS", " Chapter 3 GIS Data Management 3.1 GIS File Data Formats In the GIS world, you will encounter many different GIS file formats. Some file formats are unique to specific GIS applications, others are universal. For this course, we will focus on a subset of spatial data file formats: shapefiles for vector data, Imagine and GeoTiff files for rasters and file geodatabases and geopackages for both vector and raster data. 3.1.1 Vector Data File Formats 3.1.1.1 Shapefile A shapefile is a file-based data format native to ArcView 3.x software (a much older version of ArcMap). Conceptually, a shapefile is a feature class–it stores a collection of features that have the same geometry type (point, line, or polygon), the same attributes, and a common spatial extent. Despite what its name may imply, a “single” shapefile is actually composed of at least three files, and as many as eight. Each file that makes up a “shapefile” has a common filename but different extension type. The list of files that define a “shapefile” are shown in the following table. Note that each file has a specific role in defining a shapefile. File extension Content .dbf Attribute information .shp Feature geometry .shx Feature geometry index .aih Attribute index .ain Attribute index .prj Coordinate system information .sbn Spatial index file .sbx Spatial index file 3.1.1.2 File Geodatabase A file geodatabase is a relational database storage format. It’s a far more complex data structure than the shapefile and consists of a .gdb folder housing dozens of files. Its complexity renders it more versatile allowing it to store multiple feature classes and enabling topological definitions (i.e. allowing the user to define rules that govern the way different feature classes relate to one another). An example of the contents of a geodatabase is shown in the following figure. Figure 3.1: Sample content of an ArcGIS file geodatabase. (src: esri) 3.1.1.3 GeoPackage This is a relatively new data format that follows open format standards (i.e. it is non-proprietary). It’s built on top of SQLite (a self-contained relational database). Its one big advantage over many other vector formats is its compactness–coordinate value, metadata, attribute table, projection information, etc…, are all stored in a single file which facilitates portability. Its filename usually ends in .gpkg. Applications such as QGIS (2.12 and up), R and ArcGIS will recognize this format (ArcGIS version 10.2.2 and above will read the file from ArcCatalog but requires a script to create a GeoPackage). 3.1.2 Raster Data File Formats Rasters are in part defined by their pixel depth. Pixel depth defines the range of distinct values the raster can store. For example, a 1-bit raster can only store 2 distinct values: 0 and 1. Figure 3.2: Examples of different pixel depths. There is a wide range of raster file formats used in the GIS world. Some of the most popular ones are listed below. 3.1.2.1 Imagine The Imagine file format was originally created by an image processing software company called ERDAS. This file format consists of a single .img file. This is a simpler file format than the vector shapefile. It is sometimes accompanied by an .xml file which usually stores metadata information about the raster layer. 3.1.2.2 GeoTiff A popular public domain raster data format is the GeoTIFF format. If maximum portability and platform independence is important, this file format may be a good choice. 3.1.2.3 File Geodatabase A raster file can also be stored in a file geodatabase alongside vector files. Geodatabases have the benefit of defining image mosaic structures thus allowing the user to create “stitched” images from multiple image files stored in the geodatabase. Also, processing very large raster files can be computationally more efficient when stored in a file geodatabase as opposed to an Imagine or GeoTiff file format. 3.2 Managing GIS Files in ArcGIS Unless you are intimately familiar with the file structure of a GIS file, it is best to copy/move/delete GIS files from within the software environment. Figure 3.3: Windows File Explorer view vs. ArcGIS Catalog view. Note, for example, how the many files that make up the Cities shapefile (as viewed in a Windows file manager environment) appears as a single entry in the Catalog view. This makes it easier to rename the shapefile since it needs to be done only for a single entry in the GIS software (as opposed to renaming the Cities files seven times in the Windows file manager environment). 3.3 Managing a Map Project in ArcGIS Unlike many other software environments such as word processors and spreadsheets, a GIS map project is not self-contained in a single file. A GIS map consists of many files: ArcGIS’ .aprx file and the various vector and/or raster files used in the map project. The .aprx file only stores information about how the different layers are to be symbolized and the GIS file locations these layers point to. Because of the complex data structure associated with GIS maps, it’s usually best to store the .aprx and all associated GIS files under a single project directory. Then, when you are ready to share your map project with someone else, just pass along that project folder as is or compressed in a zip or tar file. Because .aprx map files read data from GIS files, it must know where to find these files on your computer or across the network. There are two ways in which a map document can store the location to the GIS files: as a relative pathname or a full pathname. In older esri GIS applications, like ArcMap, the user had the choice to save a project using relative or full pathnames. Note that ArcMap is a legacy GIS software replaced by ArcGIS Pro. What follows pertains to the ArcMap software environment and not the ArcGIS Pro software environment. A relative pathname defines the location of the GIS files relative to the location of the map file on your computer. For example, let’s say that you created a project folder called HW05 under D:/Username/. In that folder, you have an ArcMap map document, Map.aprx (ArcMap GIS files have an .mxd extension and not an .aprx extension). The GIS document displays two layers stored in the GIS files Roads.shp and Cities.shp. In this scenario, the .mxd document and shapefiles are in the same project folder. If you set the Pathnames parameter to “Store relative pathnames to data sources” (accessed from ArcMap’s File >> Map Document Properties menu) ArcMap will not need to know the entire directory structure above the HW05/ folder to find the two shapefiles as illustrated below. If the “Store relative pathnames to data sources” is not checked in the map’s document properties, then ArcMap will need to know the entire directory structure leading to the HW05/ folder as illustrated below. Your choice of full vs relative pathnames matters if you find yourself having to move or copy your project folder to another directory structure. For example, if you share you HW05/ project folder with another user and that user places the project folder under a different directory structure such as C:/User/Jdoe/GIS/, ArcMap will not find the shapefiles if the pathnames is set to full (i.e. the Store relative pathnames option is not checked). This will result in exclamation marks in your map document TOC. This problem can be avoided by making sure that the map document is set to use relative pathnames and by placing all GIS files (raster and vector) in a common project folder. NOTE: Exclamation marks in your map document indicate that the GIS files are missing or that the directory structure has changed. Figure 3.4: In ArcGIS, an exclamation mark next to a layer indicates that the GIS file the layer is pointing to cannot be found. "],["symbolizing-features.html", "Chapter 4 Symbolizing features 4.1 Color 4.2 Color Space 4.3 Classification 4.4 So how do I find a proper color scheme for my data? 4.5 Classification Intervals", " Chapter 4 Symbolizing features 4.1 Color Each color is a combination of three perceptual dimensions: hue, lightness and saturation. 4.1.1 Hue Hue is the perceptual dimension associated with color names. Typically, we use different hues to represent different categories of data. Figure 4.1: An example of eight different hues. Hues are associated with color names such as green, red or blue. Note that magentas and purples are not part of the natural visible light spectrum; instead they are a mix of reds and blues (or violets) from the spectrum’s tail ends. 4.1.2 Lightness Lightness (sometimes referred to as value) describes how much light reflects (or is emitted) off of a surface. Lightness is an important dimension for representing ordinal/interval/ratio data. Figure 4.2: Eight different hues (across columns) with decreasing lightness values (across rows). 4.1.3 Saturation Saturation (sometimes referred to as chroma) is a measure of a color’s vividness. You can use saturated colors to help distinguish map symbols. But be careful when manipulating saturation, its property should be modified sparingly in most maps. Figure 4.3: Eight different hues (across columns) with decreasing saturation values (across rows). 4.2 Color Space The three perceptual dimensions of color can be used to construct a 3D color space. This 3D space need not be a cube (as one would expect given that we are combining three dimensions) but a cone where lightness, saturation and hue are the cone’s height, radius and circumference respectively. Figure 4.4: This is how the software defines the color space. But does this match our perception of color space? The cone shape reflects the fact that as one decreases saturation, the distinction between different hues disappears leading to a grayscale color (the central axis of the cone). So if one sets the saturation value of a color to 0, the hue ends up being some shade of grey. The color space implemented in most software is symmetrical about the value/lightness axis. However, this is not how we “perceive” color space: our perceptual view of the color space is not perfectly symmetrical. Let’s examine a slice of the symmetrical color space along the blue/yellow hue axis at a lightness value of about 90%. Figure 4.5: A cross section of the color space with constant hues and lightness values and decreasing saturation values where the two hues merge. Now, how many distinct yellows can you make out? How many distinct blues can you make out? Do the numbers match? Unless you have incredible color perception, you will probably observe that the number of distinct colors do not match when in fact they do! There are exactly 30 distinct blues and 30 distinct yellows. Let’s add a border to each color to convince ourselves that the software did indeed generate the same number of distinct colors. Figure 4.6: A cross section of the color space with each color distinctly outlined. It should be clear by now that a symmetrical color space does not reflect the way we “perceive” colors. There are more rigorously designed color spaces such as CIELAB and Munsell that depict the color space as a non-symmetrical object as perceived by humans. For example, in a Munsell color space, a vertical slice of the cone along the blue/yellow axis looks like this. Figure 4.7: A slice of the Munsell color space. Note that based on the Munsell color space, we can make out fewer yellows than blues across all lightness values. In fact, for these two hues, we can make out only 29 different shades of yellow (we do not include the gray levels where saturation = 0) vs 36 shades of blue. So how do we leverage our understanding of color spaces when choosing colors for our map features? The next section highlights three different color schemes: qualitative, sequential and divergent. 4.3 Classification 4.3.1 Qualitative color scheme Qualitative schemes are used to symbolize data having no inherent order (i.e. categorical data). Different hues with equal lightness and saturation values are normally used to distinguish different categorical values. Figure 4.8: Example of four different qualitative color schemes. Color hex numbers are superimposed on each palette. Election results is an example of a dataset that can be displayed using a qualitative color scheme. But be careful in your choice of hues if a cultural bias exists (i.e. it may not make sense to assign “blue” to republican or “red”” to democratic regions). Figure 4.9: Map of 2012 election results shown in a qualitative color scheme. Note the use of three hues (red, blue and gray) of equal lightness and saturation. Most maps created in this course will be generated from polygon layers where continuous values will be assigned discrete color swatches. Such maps are referred to as choropleth maps. The choice of classification schemes for choropleth maps are shown next. 4.3.2 Sequential color scheme Sequential color schemes are used to highlight ordered data such as income, temperature, elevation or infection rates. A well designed sequential color scheme ranges from a light color (representing low attribute values) to a dark color (representing high attribute values). Such color schemes are typically composed of a single hue, but may include two hues as shown in the last two color schemes of the following figure. Figure 4.10: Example of four different sequential color schemes. Color hex numbers are superimposed on each palette. Distribution of income is a good example of a sequential map. Income values are interval/ratio data which have an implied order. Figure 4.11: Map of household income shown in a sequential color scheme. Note the use of a single hue (green) and 7 different lightness levels. 4.3.3 Divergent color scheme Divergent color schemes apply to ordered data as well. However, there is an implied central value about which all values are compared. Typically, a divergent color scheme is composed of two hues–one for each side of the central value. Each hue’s lightness/saturation value is then adjusted symmetrically about the central value. Examples of such a color scheme follows: Figure 4.12: Example of four different divergent color schemes. Color hex numbers are superimposed onto each palette. Continuing with the last example, we now focus on the divergence of income values about the median value of $36,641. We use a brown hue for income values below the median and a green/blue hue for values above the median. Figure 4.13: This map of household income uses a divergent color scheme where two different hues (brown and blue-green) are used for two sets of values separated by the median income of 36,641 dollars. Each hue is then split into three separate colors using decreasing lightness values away from the median. 4.4 So how do I find a proper color scheme for my data? Fortunately, there is a wonderful online resource that will guide you through the process of picking a proper set of color swatches given the nature of your data (i.e. sequential, diverging, and qualitative) and the number of intervals (aka classes). The website is http://colorbrewer2.org/ and was developed by Cynthia Brewer et. al at the Pennsylvania State University. You’ll note that the ColorBrewer website limits the total number of color swatches to 12 or less. There is a good reason for this in that our eyes can only associate so many different colors with value ranges/bins. Try matching 9 different shades of green in a map to the legend box swatches! Additional features available on that website include choosing colorblind safe colors and color schemes that translate well into grayscale colors (useful if your work is to be published in journals that do not offer color prints). 4.5 Classification Intervals You may have noticed the use of different classification breaks in the last two maps. For the sequential color scheme map, an equal interval classification scheme was used where the full range of values in the map are split equally into 7 intervals so that each color swatch covers an equal range of values. The divergent color scheme map adopts a quantile interval classification where each color swatch is represented an equal number of times across each polygon. Using different classification intervals will result in different looking maps. In the following figure, three maps of household income (aggregated at the census tract level) are presented using different classification intervals: quantile, equal and Jenks. Note the different range of values covered by each color swatch. Figure 4.14: Three different representations of the same spatial data using different classification intervals. The quantile interval scheme ensures that each color swatch is represented an equal number of times. If we have 20 polygons and 5 classes, the interval breaks will be such that each color is assigned to 4 different polygons. The equal interval scheme breaks up the range of values into equal interval widths. If the polygon values range from 10,000 to 25,000 and we have 5 classes, the intervals will be [10,000 ; 13,000], [13,000 ; 16,000], …, [22,000 ; 25,000]. The Jenks interval scheme (aka natural breaks) uses an algorithm that identifies clusters in the dataset. The number of clusters is defined by the desired number of intervals. It may help to view the breaks when superimposed on top of a distribution of the attribute data. In the following graphics the three classification intervals are superimposed on a histogram of the per-household income data. The histogram shows the distribution of values as “bins” where each bin represents a range of income values. The y-axis shows the frequency (or number of occurrences) for values in each bin. Figure 4.15: Three different classification intervals used in the three maps. Note how each interval scheme encompasses different ranges of values (hence the reason all three maps look so different). 4.5.1 An Interactive Example The following interactive frame demonstrates the different “looks” a map can take given different combinations of classification schemes and class numbers. "],["statistical-maps.html", "Chapter 5 Statistical maps 5.1 Statistical distribution maps 5.2 Mapping uncertainty", " Chapter 5 Statistical maps 5.1 Statistical distribution maps The previous chapter demonstrated how the choice of a classification scheme can generate different looking maps. Your choice of classification breaks should be driven by the data. This chapter will focus on statistical approaches to generating classification breaks. Many spatial datasets consist of continuous values. As such, one can have as many unique values as there are unique polygons in a data layer. For example, a Massachusetts median household income map where a unique color is assigned to each unique value will look like this: Figure 5.1: Example of a continuous color scheme applied to a choropleth map. Such a map may not be as informative as one would like it to be. In statistics, we seek to reduce large sets of continuous values to discrete entities to help us better “handle” the data. In the field of statistics, discretization of values can take on the form of a histogram where values are assigned to one of several equal width bins. A choropleth map classification equivalent is the equal interval classification scheme. Figure 5.2: An equal interval choropleth map using 10 bins. The histogram in the above figure is “flipped” so as to match the bins with the color swatches. The length of each grey bin reflects the number of polygons assigned their respective color swatches. An equal interval map benefits from having each color swatch covering an equal range of values. This makes is easier to compare differences between pairs of swatches. Note that a sequential color scheme is used since there is no implied central value in this classification scheme. 5.1.1 Quantile map While an equal interval map benefits from its intuitiveness, it may not be well suited for data that are not uniformly distributed across their range (note the disproportionate distribution of observations in each color a bin in the above figure). Quantiles define ranges of values that have equal number of observations. For example, the following plot groups the data into six quantiles with each quantile representing the same number of observations (Exceptions exist when multiple observations share the same exact value). Figure 5.3: Example of a quantile map. You’ll note the differing color swatch lengths in the color bar reflecting the different ranges of values covered by each color swatch. For example, the darkest color swatch covers the largest range of values, [131944, 250001], yet it is applied to the same number of polygons as most other color swatches in this classification scheme. 5.1.2 Boxplot map The discretization of continuous values can also include measures of centrality (e.g. the mean and the median) and measures of spread (e.g. standard deviation units) with the goal of understanding the nature of the distribution such as its shape (e.g. symmetrical, skewed, etc…) and range. The boxplot is an example of statistical plot that offers both. This plot reduces the data to a five summary statistics including the median, the upper and lower quartiles (within which 50% of the data lie–also known as the interquartile range,IQR), and upper and lower “whiskers” that encompass 1.5 times the interquartile range. The boxplot may also display “outliers”–data points that may be deemed unusual or not characteristic of the bulk of the data. Figure 5.4: Example of a boxplot map. Here, we make use of a divergent color scheme to take advantage of the implied measure of centrality (i.e. the median). 5.1.3 IQR map The IQR map is a reduction of the boxplot map whereby we reduce the classes to just three: the interquartile range (IQR) and the upper and lower extremes. The map’s purpose is to highlight the polygons covering the mid 50% range of values. This mid range usually benefits from a darker hue to help distinguish it from the upper and lower sets of values. Figure 5.5: Example of an IQR map. The IQR map differs from the preceding maps shown in this chapter in that upper and lower values are no longer emphasized–whether implicitly or explicitly. While these maps consistently highlighted the prominent east-west gradient in income values with the higher values occurring in the east and the lower values occurring in the west, the IQR map reveals that the distribution of middle income households follows a pattern that is more dispersed across the state of Massachusetts. 5.1.4 Standard deviation map If the data distribution can be approximated by a Normal distribution (a theoretical distribution defined by a mathematical function), the classification scheme can be broken up into different standard deviation units. Figure 5.6: Example of a standard deviation map. You’ll note from the figure that the income data do not follow a Normal distribution exactly–they have a slight skew toward higher values. This results in more polygons being assigned higher class breaks than lower ones. 5.1.5 Outlier maps So far, emphasis has been placed on the distribution of values which attempts to place emphasis on the full range of values. However, there may be times when we want to place emphasis on the extreme values. For example, we may want to generate a map that identifies the regions with unusually high or unusually low values. What constitutes an outlier can be subjective. For this reason, we will rely on statistical techniques covered in the last section to help characterize regions with unusually high and/or low values. 5.1.5.1 Boxplot outlier map We can tweak the boxplot map from the last section by assigning darker hues to observations outside the whiskers (outliers) and a single light colored hue to all other values. By minimizing the range of color swatches, we place emphasis on the outliers. Figure 5.7: Example of a boxplot outlier choropleth map. You’ll note the asymmetrical distribution of outliers with a bit more than a dozen regions with unusually high income values and just one region with unusually low income values. 5.1.5.2 Standard deviation outliers In this next example, we use the +/- 2 standard deviation bounds from the Normal distribution to identify outliers in the income data. Hence, if the data were to follow a perfectly Normal distribution, this would translate to roughly the top 2.5% and bottom 2.5% of the distribution. Figure 5.8: Example of a standard deviation outlier choropleth map. 5.1.5.3 quantile outliers In this last example, we’ll characterize the top and bottom 2.5% of values as outliers by splitting the data into 40 quantiles then maping the top and bottom quantiles to capture the 2.5% fo values. Figure 5.9: Example of a quantile outlier choropleth map where the top and bottom 2.5% regions are characterized as outliers. 5.2 Mapping uncertainty Many census datasets such as the U.S. Census Bureau’s American Community Survey (ACS) data are based on surveys from small samples. This entails that the variables provided by the Census Bureau are only estimates with a level of uncertainty often provided as a margin of error (MoE) or a standard error (SE). Note that the Bureau’s MoE encompasses a 90% confidence interval1 (i.e. there is a 90% chance that the MoE range covers the true value being estimated). This poses a challenge to both the visual exploration of the data as well as any statistical analyses of that data. One approach to mapping both estimates and SE’s is to display both as side-by-side maps. Figure 5.10: Maps of income estimates (left) and associated standard errors (right). While there is nothing inherently wrong in doing this, it can prove to be difficult to mentally process the two maps, particularly if the data consist of hundreds or thousands of small polygons. Another approach is to overlay the measure of uncertainty (SE or MoE) as a textured layer on top of the income layer. Figure 5.11: Map of estimated income (in shades of green) superimposed with different hash marks representing the ranges of income SE. Or, one could map both the upper and lower ends of the MoE range side by side. Figure 5.12: Maps of top end of 90 percent income estimate (left) and bottom end of 90 percent income estimate (right). 5.2.1 Problems in mapping uncertainty Attempting to convey uncertainty using the aforementioned maps fails to highlight the reason one chooses to map values in the first place: that is to compare values across a spatial domain. More specifically, we are interested in identifying spatial patterns of high or low values. What is implied in the above maps is that the estimates will always maintain their order across the polygons. In other words, if one polygon’s estimate is greater than all neighboring estimates, this order will always hold true if another sample was surveyed. But this assumption is incorrect. Each polygon (or county in the above example) can derive different estimates independently from its neighboring polygon. Let’s look at a bar plot of our estimates. Figure 5.13: Income estimates by county with 90 percent confidence interval. Note that many counties have overlapping estimate ranges. Note, for example, how Piscataquis county’s income estimate (grey point in the graphic) is lower than that of Oxford county. If another sample of the population was surveyed in each county, the new estimates could place Piscataquis above Oxford county in income rankings as shown in the following example: Figure 5.14: Example of income estimates one could expect to sample based on the 90 percent confidence interval shown in the previous plot. Note how, in this sample, Oxford’s income drops in ranking below that of Piscataquis and Franklin counties. A similar change in ranking is observed for Sagadahoc county which drops down two counties: Hancock and Lincoln. How does the estimated income map compare with the simulated income map? Figure 5.15: Original income estimate (left) and realization of a simulated sample (right). A few more simulated samples (using the 90% confidence interval) are shown below: Figure 5.16: Original income estimate (left) and realizations from simulated samples (R2 through R5). 5.2.2 Class comparison maps There is no single solution to effectively convey both estimates and associated uncertainty in a map. Sun and Wong (Sun and Wong 2010) offer several suggestions dependent on the context of the problem. One approach adopts a class comparison method whereby a map displays both the estimate and a measure of whether the MoE surrounding that estimate extends beyond the assigned class. For example, if we adopt the classification breaks [0 , 20600 , 22800 , 25000 , 27000 , 34000 ], we will find that many of the estimates’ MoE extend beyond the classification breaks assigned to them. Figure 5.17: Income estimates by county with 90 percent confidence interval. Note that many of the counties’ MoE have ranges that cross into an adjacent class. Take Piscataquis county, for example. Its estimate is assigned the second classification break (20600 to 22800 ), yet its lower confidence interval stretches into the first classification break indicating that we cannot be 90% confident that the estimate is assigned the proper class (i.e. its true value could fall into the first class). Other counties such as Cumberland and Penobscot don’t have that problem since their 90% confidence intervals fall inside the classification breaks. This information can be mapped as a hatch mark overlay. For example, income could be plotted using varying shades of green with hatch symbols indicating if the lower interval crosses into a lower class (135° hatch), if the upper interval crosses into an upper class (45° hatch), if both interval ends cross into a different class (90°-vertical-hatch) or if both interval ends remain inside the estimate’s class (no hatch). Figure 5.18: Plot of income with class comparison hatches. 5.2.3 Problem when performing bivariate analysis Data uncertainty issues do not only affect choropleth map presentations but also affect bivariate or multivariate analyses where two or more variables are statistically compared. One popular method in comparing variables is the regression analysis where a line is best fit to a bivariate scatterplot. For example, one can regress “percent not schooled”” to “income”” as follows: Figure 5.19: Regression between percent not having completed any school grade and median per capita income for each county. The \\(R^2\\) value associated with this regression analysis is 0.2 and the p-value is 0.081. But another realization of the survey could produce the following output: Figure 5.20: Example of what a regression line could look like had another sample been surveyed for each county. With this new (simulated) sample, the \\(R^2\\) value dropped to 0.07 and the p-value is now 0.322–a much less significant relationship then computed with the original estimate! In fact, if we were to survey 1000 different samples within each county we would get the following range of regression lines: Figure 5.21: A range of regression lines computed from different samples from each county. These overlapping lines define a type of confidence interval (aka confidence envelope). In other words, the true regression line between both variables lies somewhere within the dark region delineated by this interval. References "],["pitfalls-to-avoid.html", "Chapter 6 Pitfalls to avoid 6.1 Representing Count 6.2 MAUP 6.3 Ecological Fallacy 6.4 Mapping rates 6.5 Coping with Unstable Rates", " Chapter 6 Pitfalls to avoid 6.1 Representing Count Let’s define a 5km x 5km area and map the location of each individual inside the study area. Let’s assume, for sake of argument, that individuals are laid out in a perfect grid pattern. Now let’s define two different zoning schemes: one which follows a uniform grid pattern and another that does not. The layout of individuals relative to both zonal schemes are shown in Figure 6.1. Figure 6.1: Figure shows the layout of individuals inside two different zonal unit configurations. If we sum the number of individuals in each polygon, we get two maps that appear to be giving us two completely different population distribution patterns: Figure 6.2: Count of individuals in each zonal unit. Note how an underlying point distribution can generate vastly different looking choropleth maps given different aggregation schemes. The maps highlight how non-uniform aerial units can fool us into thinking a pattern exists when in fact this is just an artifact of the aggregation scheme. A solution to this problem is to represent counts as ratios such as number of deaths per number of people or number of people per square kilometer. In Figure 6.3, we opt for the latter ratio (number of people per square kilometer). Figure 6.3: Point density choropleth maps. The sample study extent is 20x20 units which generates a uniform point density of 1. The slight discrepancy in values for the map on the right is to be expected given that the zonal boundaries do not split the distance between points exactly. 6.2 MAUP Continuing with the uniform point distribution from the last section, let’s assume that as part of the survey, two variables (v1 and v2) were recorded for each point (symbolized as varying shades of green and reds in the two left-hand maps of Figure 6.4). We might be interested in assessing if the variables v1 and v2 are correlated (i.e. as variable v1 increases in value, does this trigger a monotonic increase or decrease in variable v2?). One way to visualize the relationship between two variables is to generate a bivariate scatter plot (right plot of Figure 6.4). Figure 6.4: Plots of variables v1 and v2 for each individual in the survey. The color scheme is sequential with darker colors depicting higher values and lighter colors depicting lower values. It’s obvious from the adjoining scatter plot that there is little to no correlation between variables v1 and v2 at the individual level; both the slope and coefficient of determination, \\(R^2\\), are close to \\(0\\). But many datasets (such as the US census data) are provided to us not at the individual level but at various levels of aggregation units such as the census tract, the county or the state levels. When aggregated, the relationship between variables under investigation may change. For example, if we aggregated v1 and v2 using the uniform aggregation scheme highlighted earlier we get the following relationship. Figure 6.5: Data summarized using a uniform aggregation scheme. The resulting regression analysis is shown in the right-hand plot. Note the slight increase in slope and \\(R^2\\) values. If we aggregate the same point data using the non-homogeneous aggregation scheme, we get yet another characterization of the relationship between v1 and v2. Figure 6.6: Data summarized using a non-uniform aggregation scheme.The resulting regression analysis is shown in the right-hand plot. Note the high \\(R^2\\) value, yet the underlying v1 and v2 variables from which the aggregated values were computed were not at all correlated! It should be clear by now that different aggregation schemes can result in completely different analyses outcomes. In fact, it would not be impossible to come up with an aggregation scheme that would produce near perfect correlation between variables v1 and v2. This problem is often referred to as the modifiable aerial unit problem (MAUP) and has, as you can well imagine by now, some serious implications. Unfortunately, this problem is often overlooked in many analyses that involve aggregated data. 6.3 Ecological Fallacy But, as is often the case, our analysis is constrained by the data at hand. So when analyzing aggregated data, you must be careful in how you frame the results. For example, if your analysis was conducted with the data summarized using the non-uniform aggregation scheme shown in Figure 6.6, you might be tempted to state that there is a strong relationship between variables v1 and v2 at the individual level. But doing so leads to the ecological fallacy where the statistical relationship at one level of aggregation is (wrongly) assumed to hold at any other levels of aggregation (including at the individual level). In fact, all you can really say is that “at this level of aggregation, we observe a strong relationship between v1 and v2” and nothing more! 6.4 Mapping rates One of the first pitfalls you’ve been taught to avoid is the mapping of counts when the aerial units associated with these values are not uniform in size and shape. Two options in resolving this problem are: normalizing counts to area or normalizing counts to some underlying population count. An example of the latter is the mapping of infection rates or mortality rates. For example, the following map displays the distribution of kidney cancer death rates (by county) for the period 1980 to 1984. Figure 6.7: Kidney cancer death rates for the period spanning 1980-1984. Now let’s look at the top 10% of counties with the highest death rates. Figure 6.8: Top 10% of counties with the highest kidney cancer death rates. And now let’s look at the bottom 10% of counties with the lowest death rates. Figure 6.9: Bottom 10% of counties with the lowest kidney cancer death rates. A quick glance of these maps suggests clustering of high and low rates around the same parts of the country. In fact, if you were to explore these maps in a GIS, you would note that many of the bottom 10% counties are adjacent to the top 10% counties! If local environmental factors are to blame for kidney cancer deaths, why would they be present in one county and not in an adjacent county? Could differences in regulations between counties be the reason? These are hypotheses that one would probably want to explore, but before pursuing these hypotheses, it would behoove us to look a bit more closely at the batch of numbers we are working with. Let’s first look at a population count map (note that we are purposely not normalizing the count data). Figure 6.10: Population count for each county. Note that a quantile classification scheme is adopted forcing a large range of values to be assigned a single color swatch. The central part of the states where we are observing both very high and very low cancer death rates seem to have low population counts. Could population count have something to do with this odd congruence of high and low cancer rates? Let’s explore the relationship between death rates and population counts outside of a GIS environment and focus solely on the two batches of numbers. The following plot is a scatterplot of death rates and population counts. Figure 6.11: Plot of rates vs population counts. Note the skewed nature of both data batches. Transforming both variables reveals much more about the relationship between them. Figure 6.12: Plot of rates vs population counts on log scales. One quickly notices a steady decrease in death rate variability about some central value of ~0.000045 (or 4.5e-5) as the population count increases. This is because lower population counts tend to generate the very high and very low rates observed in our data. This begs the question: does low population count cause very high and low cancer death rates, or is this simply a numerical artifact? To answer this question, let’s simulate some data. Let’s assume that the real death rate is 5 per 100,000 people . If a county has a population of 1000, then \\(1000 \\times 5e-5 = 0.05\\) persons would die of kidney cancer; when rounded to the next whole person, that translates to \\(0\\) deaths in that county. Now, there is still the possibility that a county of a 1000 could have one person succumb to the disease in which case the death rate for that county would be \\(1/1000=0.001\\) or 1 in a 1000, a rate much greater than the expected rate of 5 in 100,000! This little exercise reveals that you could never calculate a rate of 5 in 100,000 with a population count of just 1000. You either compute a rate of \\(0\\) or a rate of \\(0.001\\) (or more). In fact, you would need a large population count to accurately estimate the real death rate. Turning our attention back to our map, you will notice that a large majority of the counties have a small population count (about a quarter have a population count of 22,000 or less). This explains the wide range of rates observed for these smaller counties; the larger counties don’t have such a wide swing in values because they have a larger sample size which can more accurately reflect the true death rate. Rates that are computed using relatively small “at risk” population counts are deemed unstable. 6.5 Coping with Unstable Rates To compensate for the small population counts, we can minimize the influence those counties have on the representation of the spatial distribution of rates. One such technique, empirical Bayes (EB) method, does just that. Where county population counts are small, the “rates” are modified to match the overall expected rate (which is an average value of all rates in the map). This minimizes the counties’ influence on the range of rate values. EB techniques for rate smoothing aren’t available in ArcGIS but are available in a couple of free and open source applications such as GeoDa and R. An example implementation in R is shown in the Appendix section. An EB smoothed representation of kidney cancer deaths gives us the following rate vs population plot: Figure 6.13: Plot of EB smoothed rates vs population counts on log scales. The variability in rates for smaller counties has decreased. The range of rate values has dropped from 0.00045 to 0.00023. Variability is still greater for smaller counties than larger ones, but not as pronounced as it was with the raw rates Maps of the top 10% and bottom 10% EB smoothed rates are shown in the next two figures. Figure 6.14: Top 10% of counties with the highest kidney cancer death rates using EB smoothing techniques. Figure 6.15: Bottom 10% of counties with the lowest kidney cancer death rates using EB smoothing technique. Note the differences in rate distribution. For example, higher rates now show up in Florida which would be expected given the large retirement population, and clusters are now contiguous which could suggest local effects. But it’s important to remember that EB smoothing does not reveal the true underlying rate; it only masks those that are unreliable. Also, EB smoothing does not completely eliminate unstable rates–note the slighlty higher rates for low population counts in Figure 6.15. Other solutions to the unstable rate problem include: Grouping small counties into larger ones–thus increasing population sample size. Increasing the study’s time interval. In this example, data were aggregated over the course of 5 years (1980-1984) but could be increased by adding 5 more years thus increasing sample sizes in each county. Grouping small counties AND increasing the study’s time interval. These solutions do have their downside in that they decrease the spatial and/or temporal resolutions. It should be clear by now that there is no single one-size-fits-all solution to the unstable rate problem. A sound analysis will usually require that one or more of the aforementioned solutions be explored. "],["good-map-making-tips.html", "Chapter 7 Good Map Making Tips 7.1 Elements of a map 7.2 How to create a good map 7.3 Typefaces and Fonts", " Chapter 7 Good Map Making Tips 7.1 Elements of a map A map can be composed of many different map elements. They may include: Main map body, legend, title, scale indicator, orientation indicator, inset map and source and ancillary information. Not all elements need to be present in a map. In fact, in some cases they may not be appropriate at all. A scale bar, for instance, may not be appropriate if the coordinate system used does not preserve distance across the map’s extent. Knowing why and for whom a map is being made will dictate its layout. If it’s to be included in a paper as a figure, then parsimony should be the guiding principle. If it’s intended to be a standalone map, then additional map elements may be required. Knowing the intended audience should also dictate what you will convey and how. If it’s a general audience with little technical expertise then a simpler presentation may be in order. If the audience is well versed in the topic, then the map may be more complex. Figure 7.1: Map elements. Note that not all elements are needed, nor are they appropriate in some cases. Can you identify at least one element that does not belong in the map (hint, note the orientation of the longitudinal lines; are they parallel to one another? What implication does this have on the North direction and the placement of the North arrow?) 7.2 How to create a good map Here’s an example of a map layout that showcases several bad practices. Figure 7.2: Example of a bad map. Can you identify the problematic elements in this map? A good map establishes a visual hierarchy that ensures that the most important elements are at the top of this hierarchy and the least important are at the bottom. Typically, the top elements should consist of the main map body, the title (if this is a standalone map) and a legend (when appropriate). When showcasing Choropleth maps, it’s best to limit the color swatches to less than a dozen–it becomes difficult for the viewer to tie too many different colors in a map to a color swatch element in the legend. Also, classification breaks should not be chosen at random but should be chosen carefully; for example adopting a quantile classifications scheme to maximize the inclusion of the different color swatches in the map; or a classification system designed based on logical breaks (or easy to interpret breaks) when dictated by theory or cultural predisposition. Scale bars and north arrows should be used judiciously and need not be present in every map. These elements are used to measure orientation and distances. Such elements are critical in reference maps such as USGS Topo maps and navigation maps but serve little purpose in a thematic map where the goal is to highlight differences between aerial units. If, however, these elements are to be placed in a thematic map, reduce their visual prominence (see Figure 7.3 for examples of scale bars). The same principle applies to the selection of an orientation indicator (north arrow) element. Use a small north arrow design if it is to be placed low in the hierarchy, larger if it is to be used as a reference (such as a nautical chart). Figure 7.3: Scale bar designs from simplest (top) to more complex (bottom). Use the simpler design if it’s to be placed low in the visual hierarchy. Title and other text elements should be concise and to the point. If the map is to be embedded in a write-up such as a journal article, book or web page, title and text(s) elements should be omitted in favor of figure captions and written description in the accompanying text. Following the aforementioned guidelines can go a long way in producing a good map. Here, a divergent color scheme is chosen whereby the two hues converge to the median income value. A coordinate system that minimizes distance error measurements and that preserves “north” orientation across the main map’s extent is chosen since a scale bar and north arrow are present in the map. The inset map (lower left map body) is placed lower in the visual hierarchy and could be omitted if the intended audience was familiar with the New England area. A unique (and unconventional) legend orders the color swatches in the order in which they appear in the map (i.e. following a strong north-south income gradient). Figure 7.4: Example of an improved map. 7.3 Typefaces and Fonts Maps may include text elements such as labels and ancillary text blocks. The choice of typeface (font family) and font (size, weight and style of a typeface) can impact the legibility of the map. A rule of thumb is to limit the number of fonts to two: a serif and a sans serif font. Figure 7.5: Serif fonts are characterized by brush strokes at the letter tips (circled in red in the figure). Sans Serif fonts are devoid of brush strokes. Serif fonts are generally used to label natural features such as mountain ridges and water body names. Sans serif fonts are usually used to label anthropogenic features such as roads, cities and countries. Varying the typeset size across the map should be avoided unless a visual hierarchy of labels is desired. You also may want to stick with a single font color across the map unless the differences in categories need to be emphasized. In the following example, a snapshot of a map before (left) and after (right) highlight how manipulating typeset colors and styles (i.e. italic, bold) can have a desirable effect if done properly. Figure 7.6: The lack of typeset differences makes the map on the left difficult to differentiate county names from lake/river names. The judicious use of font colors and style on the right facilitate the separation of features. "],["spatial-operations-and-vector-overlays.html", "Chapter 8 Spatial Operations and Vector Overlays 8.1 Selection by Attribute 8.2 Selection by location 8.3 Vector Overlay", " Chapter 8 Spatial Operations and Vector Overlays 8.1 Selection by Attribute Features in a GIS layer can be selected graphically or by querying attribute values. For example, if a GIS layer represents land parcels, one could use the Area field to select all parcels having an area greater than 2.0 acres. Set algebra is used to define conditions that are to be satisfied while Boolean algebra is used to combine a set of conditions. 8.1.1 Set Algebra Set algebra consists of four basic operators: less than (<), greater than (>), equal to (=) not equal to (<>). In some programming environments (such as R and Python), the equality condition is presented as two equal signs, ==, and not one. In such an environment x = 3 is interpreted as “pass the value 3 to x” and x == 3 is interpreted as “is x equal to 3?. If you have a GIS layer of major cities and you want to identify all census tracts having a population count greater than 50000, you would write the expression as \"POP\" > 50000 (of course, this assumes that the attribute field name for population count is POP). Figure 8.1: An example of the Select Layer by Attributes tool in ArcGIS Pro where the pull-down menu is used to define the selection. Figure 8.2: An example of the Select Layer by Attributes tool in ArcGIS Pro where the SQL syntax is used to define the selection. Figure 8.3: Selected cities meeting the criterion are shown in cyan color in ArcGIS Pro. The result of this operation is a selected subset of the Cities point layer. Note that in most GIS applications the selection process does not create a new layer. 8.1.2 Boolean Algebra You can combine conditions from set algebra operations using the following Boolean algebra operators: or (two conditions can be met), and (two conditions must be met), not (condition must not be met). Following up with the last example, let’s now select cities having a population greater than 50000 and that are in the US (and not Canada or Mexico). Assuming that the country field is labeled FIPS_CNTRY we could setup the expression as \"POP\" > 50000 AND \"FIPS_CNTRY\" = US. Note that a value need not be numeric. In this example we are asking that an attribute value equal a specific string value (i.e. that it equal the string 'US'). Figure 8.4: Selected cities meeting where POP > 50000 AND FIPS_CNTRY == US criteria are shown in cyan color. 8.2 Selection by location We can also select features from one GIS layer based on their spatial association with another GIS layer. This type of spatial association can be broken down into four categories: adjacency (whether features from one layer share a boundary with features of another), containment (whether features from one layer are inside features of another), intersection (whether features of one layer intersect features of another), and distance (whether one feature is within a certain distance from another). Continuing with our working example, we might be interested in Cities that are within 100 miles of earthquakes. The earthquake points are from another GIS layer called Earthquakes. Figure 8.5: An example of a Select Layer by Location tool in ArcGIS Pro. The spatial association chosen is distance. Figure 8.6: Selected cities meeting the criterion are shown in cyan color. 8.3 Vector Overlay The concept of vector overlay is not new and goes back many decades–even before GIS became ubiquitous. It was once referred to as sieve mapping by land use planners who combined different layers–each mapped onto separate transparencies–to isolate or eliminate areas that did or did not meet a set of criteria. Map overlay refers to a group of procedures and techniques used in combining information from different data layers. This is an important capability of most GIS environments. Map overlays involve at least two input layers and result in at least one new output layer. A basic set of overlay tools include clipping, intersecting and unioning. 8.3.1 Clip Clipping takes one GIS layer (the clip feature) and another GIS layer (the to-be-clipped input feature). The output is a clipped version of the original input layer. The output attributes table is a subset of the original attributes table where only records for the clipped polygons are preserved. Figure 8.7: The Maine counties polygon layer is clipped to the circle polygon. Note that the ouput layer is limited to the county polygon geometry and its attributes (and does not include the clipping circle polygon). 8.3.2 Intersect Intersecting takes both layers as inputs then outputs the features from both layers that share the same spatial extent. Note that the output attribute table inherits attributes from both input layers (this differs from clipping where attributes from just one layer are carried through). Figure 8.8: The Maine counties polygon layer is intersected with the circle polygon. The ouput layer combines both intersecting geometries and attributes. 8.3.3 Union Unioning overlays both input layers and outputs all features from the two layers. Features that overlap are intersected creating new polygons. This overlay usually produces more polygons than are present in both input layers combined. The output attributes table contains attribute values from both input features (note that only a subset of the output attributes table is shown in the following figure). Figure 8.9: The Maine counties polygon layer is unioned with the circle polygon. The ouput layer combines both (complete) geometries and attributes. Where spatial overlaps do not occur, most software will either assign a NULL value or a 0. "],["chp09_0.html", "Chapter 9 Coordinate Systems 9.1 Geographic Coordinate Systems 9.2 Projected Coordinate Systems 9.3 Spatial Properties 9.4 Geodesic geometries", " Chapter 9 Coordinate Systems Implicit with any GIS data is a spatial reference system. It can consist of a simple arbitrary reference system such as a 10 m x 10 m sampling grid in a wood lot or, the boundaries of a soccer field or, it can consist of a geographic reference system, i.e. one where the spatial features are mapped to an earth based reference system. The focus of this topic is on earth reference systems which can be based on a Geographic Coordinate System (GCS) or a Project Coordinate System (PCS). 9.1 Geographic Coordinate Systems A geographic coordinate system is a reference system for identifying locations on the curved surface of the earth. Locations on the earth’s surface are measured in angular units from the center of the earth relative to two planes: the plane defined by the equator and the plane defined by the prime meridian (which crosses Greenwich England). A location is therefore defined by two values: a latitudinal value and a longitudinal value. Figure 9.1: Examples of latitudinal lines are shown on the left and examples of longitudinal lines are shown on the right. The 0° degree reference lines for each are shown in red (equator for latitudinal measurements and prime meridian for longitudinal measurements). A latitude measures the angle from the equatorial plane to the location on the earth’s surface. A longitude measures the angle between the prime meridian plane and the north-south plane that intersects the location of interest. For example Colby College is located at around 45.56° North and 69.66° West. In a GIS system, the North-South and East-West directions are encoded as signs. North and East are assigned a positive (+) sign and South and West are assigned a negative (-) sign. Colby College’s location is therefore encoded as +45.56° and -69.66°. Figure 9.2: A slice of earth showing the latitude and longitude measurements. A GCS is defined by an ellipsoid, geoid and datum. These elements are presented next. 9.1.1 Sphere and Ellipsoid Assuming that the earth is a perfect sphere greatly simplifies mathematical calculations and works well for small-scale maps (maps that show a large area of the earth). However, when working at larger scales, an ellipsoid representation of earth may be desired if accurate measurements are needed. An ellipsoid is defined by two radii: the semi-major axis (the equatorial radius) and the semi-minor axis (the polar radius). The reason the earth has a slightly ellipsoidal shape has to do with its rotation which induces a centripetal force along the equator. This results in an equatorial axis that is roughly 21 km longer than the polar axis. Figure 9.3: The earth can be mathematically modeled as a simple sphere (left) or an ellipsoid (right). Our estimate of these radii is quite precise thanks to satellite and computational capabilities. The semi-major axis is 6,378,137 meters and the semi-minor axis is 6,356,752 meters. Differences in distance measurements along the surfaces of an ellipsoid vs. a sphere are small but measurable (the difference can be as high as 20 km) as illustrated in the following lattice plots. Figure 9.4: Differences in distance measurements between the surface of a sphere and an ellipsoid. Each graphic plots the differences in distance measurements made from a single point location along the 0° meridian identified by the green colored box (latitude value) to various latitudinal locations along a longitude (whose value is listed in the bisque colored box). For example, the second plot from the top-left corner plot shows the differences in distance measurements made from a location at 90° north (along the prime meridian) to a range of latitudinal locations along the 45° meridian. 9.1.2 Geoid Representing the earth’s true shape, the geoid, as a mathematical model is crucial for a GIS environment. However, the earth’s shape is not a perfectly smooth surface. It has undulations resulting from changes in gravitational pull across its surface. These undulations may not be visible with the naked eye, but they are measurable and can influence locational measurements. Note that we are not including mountains and ocean bottoms in our discussion, instead we are focusing solely on the earth’s gravitational potential which can be best visualized by imagining the earth’s surface completely immersed in water and measuring the distance from the earth’s center to the water surface over the entire earth surface. Figure 9.5: Earth’s EGM 2008 geoid. The ondulations depicted in the graphics are exaggerated x4000. The earth’s gravitational field is dynamic and is tied to the flow of the earth’s hot and fluid core. Hence its geoid is constantly changing, albeit at a large temporal scale.The measurement and representation of the earth’s shape is at the heart of geodesy–a branch of applied mathematics. 9.1.3 Datum So how are we to reconcile our need to work with a (simple) mathematical model of the earth’s shape with the ondulating nature of the earth’s surface (i.e. its geoid)? The solution is to align the geoid with the ellipsoid (or sphere) representation of the earth and to map the earth’s surface features onto this ellipsoid/sphere. The alignment can be local where the ellipsoid surface is closely fit to the geoid at a particular location on the earth’s surface (such as the state of Kansas) or geocentric where the ellipsoid is aligned with the center of the earth. How one chooses to align the ellipsoid to the geoid defines a datum. Figure 9.6: Alignment of a geoid with a spheroid or ellipsoid help define a datum. 9.1.3.1 Local Datum Figure 9.7: A local datum couples a geoid with the ellipsoid at a location on each element’s surface. There are many local datums to choose from, some are old while others are more recently defined. The choice of datum is largely driven by the location of interest. For example, when working in the US, a popular local datum to choose from is the North American Datum of 1927 (or NAD27 for short). NAD27 works well for the US but it’s not well suited for other parts of the world. For example, a far better local datum for Europe is the European Datum of 1950 (ED50 for short). Examples of common local datums are shown in the following table: Local datum Acronym Best for Comment North American Datum of 1927 NAD27 Continental US This is an old datum but still prevalent because of the wide use of older maps. European Datum of 1950 ED50 Western Europe Developed after World War II and still quite popular today. Not used in the UK. World Geodetic System 1972 WGS72 Global Developed by the Department of Defense. 9.1.3.2 Geocentric Datum Figure 9.8: A geocentric datum couples a geoid with the ellipsoid at each element’s center of mass. Many of the modern datums use a geocentric alignment. These include the popular World Geodetic Survey for 1984 (WGS84) and the North American Datums of 1983 (NAD83). Most of the popular geocentric datums use the WGS84 ellipsoid or the GRS80 ellipsoid. These two ellipsoids share nearly identical semi-major and semi-minor axes: 6,378,137 meters and 6,356,752 meters respectively. Examples of popular geocentric datums are shown in the following table: Geocentric datum Acronym Best for Comment North American Datum of 1983 NAD83 Continental US This is one of the most popular modern datums for the contiguous US. European Terrestrial Reference System 1989 ETRS89 Western Europe This is the most popular modern datum for much of Europe. World Geodetic System 1984 WGS84 Global Developed by the Department of Defense. 9.1.4 Building the Geographic Coordinate System A Geographic Coordinate System (GCS) is defined by the ellipsoid model and by the way this ellipsoid is aligned with the geoid (thus defining the datum). It is important to know which GCS is associated with a GIS file or a map document reference system. This is particularly true when the overlapping layers are tied to different datums (and therefore GCS’). This is because a location on the earth’s surface can take on different coordinate values. For example, a location recorded in an NAD 1927 GCS having a coordinate pair of 44.56698° north and 69.65939° west will register a coordinate value of 44.56704° north and 69.65888° west in a NAD83 GCS and a coordinate value of 44.37465° north and -69.65888° west in a sphere based WGS84 GCS. If the coordinate systems for these point coordinate values were not properly defined, then they could be misplaced on a map. This is analogous to recording temperature using different units of measure (degrees Celsius, Fahrenheit and Kelvin)–each unit of measure will produce a different numeric value. Figure 9.9: Map of the Colby flagpole in two different geographic coordinate systems (GCS NAD 1983 on the left and GCS NAD 1927 on the right). Note the offset in the 44.5639° line of latitude relative to the flagpole. Also note the 0.0005° longitudinal offset between both reference systems. 9.2 Projected Coordinate Systems The surface of the earth is curved but maps are flat. A projected coordinate system (PCS) is a reference system for identifying locations and measuring features on a flat (map) surface. It consists of lines that intersect at right angles, forming a grid. Projected coordinate systems (which are based on Cartesian coordinates) have an origin, an x axis, a y axis, and a linear unit of measure. Going from a GCS to a PCS requires mathematical transformations. The myriad of projection types can be aggregated into three groups: planar, cylindrical and conical. 9.2.1 Planar Projections A planar projection (aka Azimuthal projection) maps the earth surface features to a flat surface that touches the earth’s surface at a point (tangent case), or along a line of tangency (a secant case). This projection is often used in mapping polar regions but can be used for any location on the earth’s surface (in which case they are called oblique planar projections). Figure 9.10: Examples of three planar projections: orthographic (left), gnomonic (center) and equidistant (right). Each covers a different spatial range (with the latter covering both northern and southern hemispheres) and each preserves a unique set of spatial properties. 9.2.2 Cylindrical Projection A cylindrical map projection maps the earth surface onto a map rolled into a cylinder (which can then be flattened into a plane). The cylinder can touch the surface of the earth along a single line of tangency (a tangent case), or along two lines of tangency (a secant case). The cylinder can be tangent to the equator or it can be oblique. A special case is the Transverse aspect which is tangent to lines of longitude. This is a popular projection used in defining the Universal Transverse Mercator (UTM) and State Plane coordinate systems. The UTM PCS covers the entire globe and is a popular coordinate system in the US. It’s important to note that the UTM PCS is broken down into zones and therefore limits its extent to these zones that are 6° wide. For example, the State of Maine (USA) uses the UTM coordinate system (Zone 19 North) for most of its statewide GIS maps. Most USGS quad maps are also presented in a UTM coordinate system. Popular datums tied to the UTM coordinate system in the US include NAD27 and NAD83. There is also a WGS84 based UTM coordinate system. Distortion is minimized along the tangent or secant lines and increases as the distance from these lines increases. Figure 9.11: Examples of two cylindrical projections: Mercator (preserves shape but distortes area and distance) and equa-area (preserves area but distorts shape). 9.2.3 Conical Projection A conical map projection maps the earth surface onto a map rolled into a cone. Like the cylindrical projection, the cone can touch the surface of the earth along a single line of tangency (a tangent case), or along two lines of tangency (a secant case). Distortion is minimized along the tangent or secant lines and increases as the distance from these lines increases. When distance or area measurements are needed for the contiguous 48 states, use one of the conical projections such as Equidistant Conic (distance preserving) or Albers Equal Area Conic (area preserving). Conical projections are also popular PCS’ in European maps such as Europe Albers Equal Area Conic and Europe Lambert Conformal Conic. Figure 9.12: Examples of three conical projections: Albers equal area (preserves area), equidistant (preserves distance) and conformal (preserves shape). 9.3 Spatial Properties All projections distort real-world geographic features to some degree. The four spatial properties that are subject to distortion are: shape, area, distance and direction. A map that preserves shape is called conformal; one that preserves area is called equal-area; one that preserves distance is called equidistant; and one that preserves direction is called azimuthal. For most GIS applications (e.g. ArcGIS and QGIS), many of the built-in projections are named after the spatial properties they preserve. Each map projection is good at preserving only one or two of the four spatial properties. So when working with small-scale (large area) maps and when multiple spatial properties are to be preserved, it is best to break the analyses across different projections to minimize errors associated with spatial distortion. If you want to assess a projection’s spatial distortion across your study region, you can generate Tissot indicatrix (TI) ellipses. The idea is to project a small circle (i.e. small enough so that the distortion remains relatively uniform across the circle’s extent) and to measure its distorted shape on the projected map. For example, in assessing the type of distortion one could expect with a Mollweide projection across the continental US, a grid of circles could be generated at regular latitudinal and longitudinal intervals. Note the varying levels of distortion type and magnitude across the region. Let’s explore a Tissot circle at 44.5°N and 69.5°W (near Waterville Maine): The plot shows a perfect circle (displayed in a filled bisque color) that one would expect to see if no distortion was at play. The blue distorted ellipse (the indicatrix) is the transformed circle for this particular projection and location. The green and red lines show the magnitude and direction of the ellipse’s major and minor axes respectively. These lines can also be used to assess scale distortion (note that scale distortion can vary as a function of bearing). The green line shows maximum scale distortion and the red line shows minimum scale distortion–these are sometimes referred to as the principal directions. In this working example, the principal directions are 1.1293 and 0.8856. A scale value of 1 indicates no distortion. A value less than 1 indicates a smaller-than-true scale and a value greater than 1 indicates a greater-than-true scale. Projections can distort scale, but this does not necessarily mean that area is distorted. In fact, for this particular projection, area is relatively well preserved despite distortion in principal directions. Area distortion can easily be computed by taking the product of the two aforementioned principal directions. In this working example, area distortion is 1.0001 (i.e. negligible). The north-south dashed line in the graphic shows the orientation of the meridian. The east-west dotted line shows the orientation of the parallel. It’s important to recall that these distortions occur at the point where the TI is centered and not necessarily across the region covered by the TI circle. 9.4 Geodesic geometries The reason projected coordinate systems introduce errors in their geometric measurements has to do with the nature of the projection whereby the distance between two points on a sphere or ellipsoid will be difficult to replicate on a projected coordinate system unless these points are relatively close to one another. In most cases, such errors can be tolerated if the expected level of precision is met; many other sources of error in the spatial representation of the features can often usurp any measurement errors made in a projected coordinate system. However, if the scale of analysis is small (i.e. the spatial extent covers a large proportion of the earth’s surface such as the North American continent), then the measurement errors associated with a projected coordinate system may no longer be acceptable. A way to circumvent projected coordinate system limitations is to adopt a geodesic solution. A geodesic distance is the shortest distance between two points on an ellipsoid (or spheroid). Likewise, a geodesic area measurement is one that is measured on an ellipsoid. Such measurements are independent of the underlying projected coordinate system. The Tissot circles presented in figures from the last section were all generated using geodesic geometry. If you are not convinced of the benefits afforded by geodesic geometry, compare the distances measured between two points located on either sides of the Atlantic in the following map. The blue solid line represents the shortest distance between the two points on a planar coordinate system. The red dashed line represents the shortest distance between the two points as measured on a spheroid. At first glance, the geodesic distance may seem nonsensical given its curved appearance on the projected map. However, this curvature is a byproduct of the current reference system’s increasing distance distortion as one progresses poleward. If you are still not convinced, you can display the geodesic and planar distance layers on a 3D globe (or a projection that mimics the view of the 3D earth as viewed from space, centered on the mid-point of the geodesic line segment). So if a geodesic measurement is more precise than a planar measurement, why not perform all spatial operations using geodesic geometry? In many cases, a geodesic approach to spatial operations can be perfectly acceptable and is even encouraged. The downside is in its computational requirements. It’s far more computationally efficient to compute area/distance on a plane than it is on a spheroid. This is because geodesic calculations have no simple algebraic solutions and involve approximations that may require iterative solutions. So this may be a computationally taxing approach if processing millions of line segments. Note that not all geodesic measurement implementations are equal. Some more efficient algorithms that minimize computation time may reduce precision in the process. Some of ArcGIS’s functions offer the option to compute geodesic distances and areas. The data analysis environment R has several packages that will compute geodesic measurements including geosphere (which implements a well defined geodesic measurement algorithms adopted from the authoritative set of GeographicLib libraries), lwgeom, and an implementation of Google’s spherical measurement library called s2 . "],["chp10_0.html", "Chapter 10 Map Algebra 10.1 Local operations and functions 10.2 Focal operations and functions 10.3 Zonal operations and functions 10.4 Global operations and functions 10.5 Operators and functions", " Chapter 10 Map Algebra Dana Tomlin (Tomlin 1990) is credited with defining a framework for the analysis of field data stored as gridded values (i.e. rasters). He coined this framework map algebra. Though gridded data can be stored in a vector format, map algebra is usually performed on raster data. Map algebra operations and functions are broken down into four groups: local, focal, zonal and global. Each is explored in the following sections. 10.1 Local operations and functions Local operations and functions are applied to each individual cell and only involve those cells sharing the same location. For example, if we start off with an original raster, then multiply it by 2 then add 1, we get a new raster whose cell values reflect the series of operations performed on the original raster cells. This is an example of a unary operation where just one single raster is involved in the operation. Figure 10.1: Example of a local operation where output=(2 * raster + 1). More than one raster can be involved in a local operation. For example, two rasters can be summed (i.e. each overlapping pixels are summed) to generate a new raster. Figure 10.2: Example of a local operation where output=(raster1+raster2). Note how each cell output only involves input raster cells that share the same exact location. Local operations also include reclassification of values. This is where a range of values from the input raster are assigned a new (common) value. For example, we might want to reclassify the input raster values as follows: Original values Reclassified values 0-25 25 26-50 50 51-75 75 76-100 100 Figure 10.3: Example of a local operation where the output results from the reclassification of input values. 10.2 Focal operations and functions Focal operations are also referred to as neighborhood operations. Focal operations assign to the output cells some summary value (such as the mean) of the neighboring cells from the input raster. For example, a cell output value can be the average of all 9 neighboring input cells (including the center cell); this acts as a smoothing function. Figure 10.4: Example of a focal operation where the output cell values take on the average value of neighboring cells from the input raster. Focal cells surrounded by non-existent cells are assigned an NA in this example. Notice how, in the above example, the edge cells from the output raster have been assigned a value of NA (No Data). This is because cells outside of the extent have no value. Some GIS applications will ignore the missing surrounding values and just compute the average of the available cells as demonstrated in the next example. Figure 10.5: Example of a focal operation where the output cell values take on the average value of neighboring cells from the input raster. Surrounding non-existent cells are ignored. Focal (or neighbor) operations require that a window region (a kernel) be defined. In the above examples, a simple 3 by 3 kernel (or window) was used in the focal operations. The kernel can take on different dimensions and shape such as a 3 by 3 square where the central pixel is ignored (thus reducing the number of neighbors to 8) or a circular neighbor defined by a radius. Figure 10.6: Example of a focal operation where the kernel is defined by a 3 by 3 cell without the center cell and whose output cell takes on the average value of those neighboring cells. In addition to defining the neighborhood shape and dimension, a kernel also defines the weight each neighboring cell contributes to the summary statistic. For example, all cells in a 3 by 3 neighbor could each contribute 1/9th of their value to the summarized value (i.e. equal weight). But the weight can take on a more complex form defined by a function; such weights are defined by a kernel function. One popular function is a Gaussian weighted function which assigns greater weight to nearby cells than those further away. Figure 10.7: Example of a focal operation where the kernel is defined by a Gaussian function whereby the closest cells are assigned a greater weight. 10.3 Zonal operations and functions A zonal operation computes a new summary value (such as the mean) from cells aggregated for some zonal unit. In the following example, the cell values from the raster layer are aggregated into three zones whose boundaries are delineated in red. Each output zone shows the average value of the cells within that zone. Figure 10.8: Example of a zonal operation where the cell values are averaged for each of the three zones delineated in red. This technique is often used with rasters derived from remotely sensed imagery (e.g. NDVI) where areal units (such as counties or states) are used to compute the average cell values from the raster. 10.4 Global operations and functions Global operations and functions may make use of some or all input cells when computing an output cell value. An example of a global function is the Euclidean Distance tool which computes the shortest distance between a pixel and a source (or destination) location. In the following example, a new raster assigns to each cell a distance value to the closest cell having a value of 1 (there are just two such cells in the input raster). Figure 10.9: Example of a global function: the Euclidean distance. Each pixel is assigned its closest distance to one of the two source locations (defined in the input layer). Global operations and functions can also generate single value outputs such as the overall pixel mean or standard deviation. Another popular use of global functions is in the mapping of least-cost paths where a cost surface raster is used to identify the shortest path between two locations which minimizes cost (in time or money). 10.5 Operators and functions Operations and functions applied to gridded data can be broken down into three groups: mathematical, logical comparison and Boolean. 10.5.1 Mathematical operators and functions Two mathematical operators have already been demonstrated in earlier sections: the multiplier and the addition operators. Other operators include division and the modulo (aka the modulus) which is the remainder of a division. Mathematical functions can also be applied to gridded data manipulation. Examples are square root and sine functions. The following table showcases a few examples with ArcGIS and R syntax. Operation ArcGIS Syntax R Syntax Example Addition + + R1 + R2 Subtraction - - R1 - R2 Division / / R1 / R2 Modulo Mod() %% Mod(R1, 100), R1 %% 10 Square root SquareRoot() sqrt() SquareRoot(R1), sqrt(R1) 10.5.2 Logical comparison The logical comparison operators evaluate a condition then output a value of 1 if the condition is true and 0 if the condition is false. Logical comparison operators consist of greater than, less than, equal and not equal. Logical comparison Syntax Greater than > Less than < Equal == Not equal != For example, the following figure shows the output of the comparison between two rasters where we are assessing if cells in R1 are greater than those in R2 (on a cell-by-cell basis). Figure 10.10: Output of the operation R1 > R2. A value of 1 in the output raster indicates that the condition is true and a value of 0 indicates that the condition is false. When assessing whether two cells are equal, some programming environments such as R and ArcMap’s Raster Calculator require the use of the double equality syntax, ==, as in R1 == R2. In these programming environments, the single equality syntax is usually interpreted as an assignment operator so R1 = R2 would instruct the computer to assign the cell values in R2 to R1 (which is not what we want to do here). Some applications make use of special functions to test a condition. For example, ArcMap has a function called Con(condition, out1, out2) which assigns the value out1 if the condition is met and a value of out2 if it’s not. For example, ArcMap’s raster calculator expression Con( R1 > R2, 1, 0) outputs a value of 1 if R1 is greater than R2 and 0 if not. It generates the same output as the one shown in the above figure. Note that in most programming environments (including ArcMap), the expression R1 > R2 produces the same output because the value 1 is the numeric representation of TRUE and 0 that of FALSE. 10.5.3 Boolean (or Logical) operators In map algebra, Boolean operators are used to compare conditional states of a cell (i.e. TRUE or FALSE). The three Boolean operators are AND, OR and NOT. Boolean ArcGIS R Example AND & & R1 & R2 OR | | R1 | R2 NOT ~ ! ~R2, !R2 A “TRUE” state is usually encoded as a 1 or any non-zero integer while a “FALSE” state is usually encoded as a 0. For example, if cell1=0 and cell2=1, the Boolean operation cell1 AND cell2 results in a FALSE (or 0) output cell value. This Boolean operation can be translated into plain English as “are the cells 1 and 2 both TRUE?” to which we answer “No they are not” (cell1 is FALSE). The OR operator can be interpreted as “is x or y TRUE?” so that cell1 OR cell2 would return TRUE. The NOT interpreter can be interpreted as “is x not TRUE?” so that NOT cell1 would return TRUE. Figure 10.11: Output of the operation R1 AND R2. A value of 1 in the output raster indicates that the condition is true and a value of 0 indicates that the condition is false. Note that many programming environments treat any none 0 values as TRUE so that -3 AND -4 will return TRUE. Figure 10.12: Output of the operation NOT R2. A value of 1 in the output raster indicates that the input cell is NOT TRUE (i.e. has a value of 0). 10.5.4 Combining operations Both comparison and Boolean operations can be combined into a single expression. For example, we may wish to find locations (cells) that satisfy requirements from two different raster layers: e.g. 0<R1<4 AND R2>0. To satisfy the first requirement, we can write out the expression as (R1>0) & (R1<4). Both comparisons (delimited by parentheses) return a 0 (FALSE) or a 1 (TRUE). The ampersand, &, is a Boolean operator that checks that both conditions are met and returns a 1 if yes or a 0 if not. This expression is then combined with another comparison using another ampersand operator that assesses the criterion R2>0. The amalgamated expression is thus ((R1>0) & (R1<4)) & (R2>0). Figure 10.13: Output of the operation ((R1>0) & (R1<4)) & (R2>0). A value of 1 in the output raster indicates that the condition is true and a value of 0 indicates that the condition is false. Note that most software environments assign the ampersand character, &, to the AND Boolean operator. References "],["chp11_0.html", "Chapter 11 Point Pattern Analysis 11.1 Centrography 11.2 Density based analysis 11.3 Distance based analysis 11.4 First and second order effects", " Chapter 11 Point Pattern Analysis 11.1 Centrography A very basic form of point pattern analysis involves summary statistics such as the mean center, standard distance and standard deviational ellipse. These point pattern analysis techniques were popular before computers were ubiquitous since hand calculations are not too involved, but these summary statistics are too concise and hide far more valuable information about the observed pattern. More powerful analysis methods can be used to explore point patterns. These methods can be classified into two groups: density based approach and distance based approach. 11.2 Density based analysis Density based techniques characterize the pattern in terms of its distribution vis-a-vis the study area–a first-order property of the pattern. A first order property of a pattern concerns itself with the variation of the observations’ density across a study area. For example, the distribution of oaks will vary across a landscape based on underlying soil characteristics (resulting in areas having dense clusters of oaks and other areas not). In these lecture notes, we’ll make a distinction between the intensity of a spatial process and the observed density of a pattern under study. A point pattern can be thought of as a “realization” of an underlying process whose intensity \\(\\lambda\\) is estimated from the observed point pattern’s density (which is sometimes denoted as \\(\\widehat{\\lambda}\\) where the caret \\(\\verb!^!\\) is referring to the fact that the observed density is an estimate of the underlying process’ intensity). Density measurements can be broken down into two categories: global and local. 11.2.1 Global density A basic measure of a pattern’s density \\(\\widehat{\\lambda}\\) is its overall, or global, density. This is simply the ratio of observed number of points, \\(n\\), to the study region’s surface area, \\(a\\), or: \\(\\begin{equation} \\widehat{\\lambda} = \\frac{n}{a} \\label{eq:global-density} \\end{equation}\\) Figure 11.1: An example of a point pattern where n = 31 and the study area (defined by a square boundary) is 10 units squared. The point density is thus 31/100 = 0.31 points per unit area. 11.2.2 Local density A point pattern’s density can be measured at different locations within the study area. Such an approach helps us assess if the density–and, by extension, the underlying process’ local (modeled) intensity \\(\\widehat{\\lambda}_i\\)–is constant across the study area. This can be an important property of the data since it may need to be mitigated for when using the distance based analysis tools covered later in this chapter. Several techniques for measuring local density are available, here we will focus on two such methods: quadrat density and kernel density. 11.2.2.1 Quadrat density This technique requires that the study area be divided into sub-regions (aka quadrats). Then, the point density is computed for each quadrat by dividing the number of points in each quadrat by the quadrat’s area. Quadrats can take on many different shapes such as hexagons and triangles, here we use square shaped quadrats to demonstrate the procedure. Figure 11.2: An example of a quadrat count where the study area is divided into four equally sized quadrats whose area is 25 square units each. The density in each quadrat can be computed by dividing the number of points in each quadrat by that quadrat’s area. The choice of quadrat numbers and quadrat shape can influence the measure of local density and must be chosen with care. If very small quadrat sizes are used you risk having many quadrats with no points which may prove uninformative. If very large quadrat sizes are used, you risk missing subtle changes in spatial density distributions such as the east-west gradient in density values in the above example. Quadrat regions do not have to take on a uniform pattern across the study area, they can also be defined based on a covariate. For example, if it’s believed that the underlying point pattern process is driven by elevation, quadrats can be defined by sub-regions such as different ranges of elevation values (labeled 1 through 4 on the right-hand plot in the following example). This can result in quadrats having non-uniform shape and area. Converting a continuous field into discretized areas is sometimes referred to as tessellation. The end product is a tessellated surface. Figure 11.3: Example of a covariate. Figure on the left shows the elevation map. Figure on the right shows elevation broken down into four sub-regions (a tessellated surface) for which local density values will be computed. If the local intensity changes across the tessellated covariate, then there is evidence of a dependence between the process that generated the point pattern and the covariate. In our example, sub-regions 1 through 4 have surface areas of 23.54, 25.2, 25.21, 26.06 map units respectively. To compute these regions’ point densities, we simply divide the number of points by the respective area values. Figure 11.4: Figure on the left displays the number of points in each elevation sub-region (sub-regions are coded as values ranging from 1 to 4). Figure on the right shows the density of points (number of points divided by the area of the sub-region). We can plot the relationship between point density and elevation regions to help assess any dependence between the variables. Figure 11.5: Plot of point density vs elevation regions. It’s important to note that how one chooses to tessellate a surface can have an influence on the resulting density distribution. For example, dividing the elevation into seven sub-regions produces the following density values. Figure 11.6: Same analysis as last figure using different sub-regions. Note the difference in density distribution. While the high density in the western part of the study area remains, the density values to the east are no longer consistent across the other three regions. The quadrat analysis approach has its advantages in that it is easy to compute and interpret however, it does suffer from the modifiable areal unit problem (MAUP) as highlighted in the last two examples. Another density based approach that will be explored next (and that is less susceptible to the MAUP) is the kernel density. 11.2.2.2 Kernel density The kernel density approach is an extension of the quadrat method: Like the quadrat density, the kernel approach computes a localized density for subsets of the study area, but unlike its quadrat density counterpart, the sub-regions overlap one another providing a moving sub-region window. This moving window is defined by a kernel. The kernel density approach generates a grid of density values whose cell size is smaller than that of the kernel window. Each cell is assigned the density value computed for the kernel window centered on that cell. A kernel not only defines the shape and size of the window, but it can also weight the points following a well defined kernel function. The simplest function is a basic kernel where each point in the kernel window is assigned equal weight. Figure 11.7: An example of a basic 3x3 kernel density map (ArcGIS calls this a point density map) where each point is assigned an equal weight. For example, the cell centered at location x=1.5 and y =7.5 has one point within a 3x3 unit (pixel) region and thus has a local density of 1/9 = 0.11. Some of the most popular kernel functions assign weights to points that are inversely proportional to their distances to the kernel window center. A few such kernel functions follow a gaussian or quartic like distribution function. These functions tend to produce a smoother density map. Figure 11.8: An example of a kernel function is the 3x3 quartic kernel function where each point in the kernel window is weighted based on its proximity to the kernel’s center cell (typically, closer points are weighted more heavily). Kernel functions, like the quartic, tend to generate smoother surfaces. 11.2.2.3 Kernel Density Adjusted for Covariate In the previous section, we learned that we could use a covariate, like elevation, to define the sub-regions (quadrats) within which densities were computed. Here, instead of dividing the study region into discrete sub-regions (as was done with quadrat analysis), we create an intensity function that is dependent on the underlying covariate. This function, which we’ll denote as \\(\\rho\\), can be estimated in one of three different ways– by ratio, re-weight and transform methods. We will not delve into the differences between these methods, but note that there is more than one way to estimate \\(\\rho\\) in the presence of a covariate. In the following example, the elevation raster is used as the covariate in the \\(\\rho\\) function using the ratio method. The right-most plot maps the modeled intensity as a function of elevation. Figure 11.9: An estimate of \\(\\rho\\) using the ratio method. The figure on the left shows the point distribution superimposed on the elevation layer. The middle figure plots the estimated \\(\\rho\\) as a function of elevation. The envelope shows the 95% confidence interval. The figure on the right shows the modeled density of \\(\\widehat{\\lambda}\\) which is a function of the elevation raster (i.e. \\(\\widehat{\\lambda}=\\widehat{\\rho}_{elevation}\\)). We can compare the modeled intensity function to the kernel density function of the observed point pattern via a scatter plot. A red one-to-one diagonal line is added to the plot. While an increase in predicted intensity is accompanied with increasing observed intensity, the relationship is not linear. This can be explained by the small area covered by these high elevation locations which result in fewer observation opportunities and thus higher uncertainty for that corner of the study extent. This uncertainty is very apparent in the \\(\\rho\\) vs. elevation plot where the 95% confidence interval envelope widens at higher elevation values (indicating the greater uncertainty in our estimated \\(\\rho\\) value at those higher elevation values). 11.2.3 Modeling intensity as a function of a covariate So far, we have learned techniques that describe the distribution of points across a region of interest. But it is often more interesting to model the relationship between the distribution of points and some underlying covariate by defining that relationship mathematically. This can be done by exploring the changes in point density as a function of a covariate, however, unlike techniques explored thus far, this approach makes use of a statistical model. One such model is a Poisson point process model which can take on the form of: \\[ \\begin{equation} \\lambda(i) = e^{\\alpha + \\beta Z(i)} \\label{eq:density-covariate} \\end{equation} \\] where \\(\\lambda(i)\\) is the modeled intensity at location \\(i\\), \\(e^{\\alpha}\\) (the exponent of \\(\\alpha\\)) is the base intensity when the covariate is zero and \\(e^{\\beta}\\) is the multiplier by which the intensity increases (or decreases) for each 1 unit increase in the covariate \\(Z(i)\\). This is a form of the logistic regression model–popular in the field of statistics. This equation implies that the relationship between the process that lead to the observed point pattern is a loglinear function of the underlying covariate (i.e. one where the process’ intensity is exponentially increasing or decreasing as a function of the covariate). Note that taking the log of both sides of the equation yields the more familiar linear regression model where \\(\\alpha + \\beta Z(i)\\) is the linear predictor. Note: The left-hand side of a logistic regression model is often presented as the probability, \\(P\\), of occurrence and is related to \\(\\lambda\\) as \\(\\lambda=P/(1-P)\\) which is the ratio of probability of occurrence. Solving for \\(P\\) gives us \\(P = \\lambda/(1 + \\lambda)\\) which yields the following equation: \\[ P(i) = \\frac{e^{\\alpha + \\beta Z(i)}}{1 + e^{\\alpha + \\beta Z(i)}} \\] Let’s work with the point distribution of Starbucks cafes in the state of Massachusetts. The point pattern clearly exhibits a non-random distribution. It might be helpful to compare this distribution to some underlying covariate such as the population density distribution. Figure 11.10: Location of Starbucks relative to population density. Note that the classification scheme follows a log scale to more easily differentiate population density values. We can fit a poisson point process model to these data where the modeled intensity takes on the form: \\[ \\begin{equation} Starbucks\\ density(i) = e^{\\alpha + \\beta\\ population(i)} \\label{eq:walmart-model} \\end{equation} \\] The parameters \\(\\alpha\\) and \\(\\beta\\) are estimated from a method called maximum likelihood. Its implementation is not covered here but is widely covered in many statistics text books. The index \\((i)\\) serves as a reminder that the point density and the population distribution both can vary as a function of location \\(i\\). The estimated value for \\(\\alpha\\) in our example is -18.966. This is interpreted as stating that given a population density of zero, the base intensity of the point process is e-18.966 or 5.79657e-09 cafes per square meter (the units are derived from the point’s reference system)–a number close to zero (as one would expect). The estimated value for \\(\\beta\\) is 0.00017. This is interpreted as stating that for every unit increase in the population density derived from the raster, the intensity of the point process increases by a factor of e0.00017 or 1.00017. If we are to plot the relationship between density and population, we get: Figure 11.11: Poisson point process model fitted to the relationship between Starbucks store locations and population density. The model assumes a loglinear relationship. Note that the density is reported in number of stores per map unit area (the map units are in meters). 11.3 Distance based analysis An alternative to the density based methods explored thus far are the distance based methods for pattern analysis whereby the interest lies in how the points are distributed relative to one another (a second-order property of the point pattern) as opposed to how the points are distributed relative to the study extent A second order property of a pattern concerns itself with the observations’ influence on one another. For example, the distribution of oaks will be influenced by the location of parent trees–where parent oaks are present we would expect dense clusters of oaks to emerge. Three distance based approaches are covered next: The average nearest neighbor (ANN), the K and L functions, and the pair correlation function. 11.3.1 Average Nearest Neighbor An average nearest neighbor (ANN) analysis measures the average distance from each point in the study area to its nearest point. In the following example, the average nearest neighbor for all points is 1.52 units. Figure 11.12: Distance between each point and its closest point. For example, the point closest to point 1 is point 9 which is 2.32 map units away. An extension of this idea is to plot the ANN values for different order neighbors, that is for the first closest point, then the second closest point, and so forth. Figure 11.13: ANN values for different neighbor order numbers. For example, the ANN for the first closest neighbor is 1.52 units; the ANN for the 2nd closest neighbor is 2.14 map units; and so forth. The shape of the ANN curve as a function of neighbor order can provide insight into the spatial arrangement of points relative to one another. In the following example, three different point patterns of 20 points are presented. Figure 11.14: Three different point patterns: a single cluster, a dual cluster and a randomly scattered pattern. Each point pattern offers different ANN vs. neighbor order plots. Figure 11.15: Three different ANN vs. neighbor order plots. The black ANN line is for the first point pattern (single cluster); the blue line is for the second point pattern (double cluster) and the red line is for the third point pattern. The bottom line (black dotted line) indicates that the cluster (left plot) is tight and that the distances between a point and all other points is very short. This is in stark contrast with the top line (red dotted line) which indicates that the distances between points is much greater. Note that the way we describe these patterns is heavily influenced by the size and shape of the study region. If the region was defined as the smallest rectangle encompassing the cluster of points, the cluster of points would no longer look clustered. Figure 11.16: The same point pattern presented with two different study areas. How differently would you describe the point pattern in both cases? An important assumption that underlies our interpretation of the ANN results is that of stationarity of the underlying point process (i.e. that there is no overall drift or trend in the process’ intensity). If the point pattern is not stationary, then it will be difficult to assess if the results from the ANN analysis are due to interactions between the points or due to changes in some underlying factor that changes as a function of location. Correcting for lack of stationarity when performing hypothesis tests is described in the next chapter. 11.3.2 K and L functions 11.3.2.1 K function The average nearest neighbor (ANN) statistic is one of many distance based point pattern analysis statistics. Another statistic is the K-function which summarizes the distance between points for all distances. The calculation of K is fairly simple: it consists of dividing the mean of the sum of the number of points at different distance lags for each point by the area event density. For example, for point \\(S1\\) we draw circles, each of varying radius \\(d\\), centered on that point. We then count the number of points (events) inside each circle. We repeat this for point \\(S2\\) and all other points \\(Si\\). Next, we compute the average number of points in each circle then divide that number by the overall point density \\(\\hat{\\lambda}\\) (i.e. total number of events per study area). Distance band (km) # events from S1 # events from S2 # events from Si K 10 0 1 … 0.012 20 3 5 … 0.067 30 9 14 … 0.153 40 17 17 … 0.269 50 25 23 … 0.419 We can then plot K and compare that plot to a plot we would expect to get if an IRP/CSR process was at play (Kexpected). Figure 11.17: The K-function calculated from the Walmart stores point distribution in MA (shown in black) compared to\\(K_{expected}\\) under the IRP/CSR assumption (shown in red). \\(K\\) values greater than \\(K_{expected}\\) indicate clustering of points at a given distance band; K values less than \\(K_{expected}\\) indicate dispersion of points at a given distance band. In our example, the stores appear to be more clustered than expected at distances greater than 12 km. Note that like the ANN analysis, the \\(K\\)-function assumes stationarity in the underlying point process (i.e. that there is no overall drift or trend in the process’ intensity). 11.3.2.2 L function One problem with the \\(K\\) function is that the shape of the function tends to curve upward making it difficult to see small differences between \\(K\\) and \\(K_{expected}\\). A workaround is to transform the values in such a way that the expected values, \\(K_{expected}\\), lie horizontal. The transformation is calculated as follows: \\[ \\begin{equation} L=\\sqrt{\\dfrac{K(d)}{\\pi}}-d \\label{eq:L-function} \\end{equation} \\] The \\(\\hat{K}\\) computed earlier is transformed to the following plot (note how the \\(K_{expected}\\) red line is now perfectly horizontal): Figure 11.18: L-function (a simple transformation of the K-function). This graph makes it easier to compare \\(K\\) with \\(K_{expected}\\) at lower distance values. Values greater than \\(0\\) indicate clustering, while values less than \\(0\\) indicate dispersion. It appears that Walmart locations are more dispersed than expected under CSR/IRP up to a distance of 12 km but more clustered at distances greater than 12 km. 11.3.3 The Pair Correlation Function \\(g\\) A shortcoming of the \\(K\\) function (and by extension the \\(L\\) function) is its cumulative nature which makes it difficult to know at exactly which distances a point pattern may stray from \\(K_{expected}\\) since all points up to distance \\(r\\) can contribute to \\(K(r)\\). The pair correlation function, \\(g\\), is a modified version of the \\(K\\) function where instead of summing all points within a distance \\(r\\), points falling within a narrow distance band are summed instead. Figure 11.19: Difference in how the \\(K\\) and \\(g\\) functions aggregate points at distance \\(r\\) (\\(r\\) = 30 km in this example). All points up to \\(r\\) contribute to \\(K\\) whereas just the points in the annulus band at \\(r\\) contribute to \\(g\\). The plot of the \\(g\\) function follows. Figure 11.20: \\(g\\)-function of the Massachusets Walmart point data. Its interpretation is similar to that of the \\(K\\) and \\(L\\) functions. Here, we observe distances between stores greater than expected under CSR up to about 5 km. Note that this cutoff is less than the 12 km cutoff observed with the \\(K\\)/\\(L\\) functions. If \\(g(r)\\) = 1, then the inter-point distances (at and around distance \\(r\\)) are consistent with CSR. If \\(g(r)\\) > 1, then the points are more clustered than expected under CSR. If \\(g(r)\\) < 1, then the points are more dispersed than expected under CSR. Note that \\(g\\) can never be less than 0. Like its \\(K\\) and ANN counterparts, the \\(g\\)-function assumes stationarity in the underlying point process (i.e. that there is no overall drift or trend in the process’ intensity). 11.4 First and second order effects The concept of 1st order effects and 2nd order effects is an important one. It underlies the basic principles of spatial analysis. Figure 11.21: Tree distribution can be influenced by 1st order effects such as elevation gradient or spatial distribution of soil characteristics; this, in turn, changes the tree density distribution across the study area. Tree distribution can also be influenced by 2nd order effects such as seed dispersal processes where the process is independent of location and, instead, dependent on the presence of other trees. Density based measurements such as kernel density estimations look at the 1st order property of the underlying process. Distance based measurements such as ANN and K-functions focus on the 2nd order property of the underlying process. It’s important to note that it is seldom feasible to separate out the two effects when analyzing point patterns, thus the importance of relying on a priori knowledge of the phenomena being investigated before drawing any conclusions from the analyses results. "],["hypothesis-testing.html", "Chapter 12 Hypothesis testing 12.1 IRP/CSR 12.2 Testing for CSR with the ANN tool 12.3 Alternatives to CSR/IRP 12.4 Monte Carlo test with K and L functions 12.5 Testing for a covariate effect", " Chapter 12 Hypothesis testing 12.1 IRP/CSR Figure 12.1: Could the distribution of Walmart stores in MA have been the result of a CSR/IRP process? Popular spatial analysis techniques compare observed point patterns to ones generated by an independent random process (IRP) also called complete spatial randomness (CSR). CSR/IRP satisfy two conditions: Any event has equal probability of occurring in any location, a 1st order effect. The location of one event is independent of the location of another event, a 2nd order effect. In the next section, you will learn how to test for complete spatial randomness. In later sections, you will also learn how to test for other non-CSR processes. 12.2 Testing for CSR with the ANN tool 12.2.1 ArcGIS’ Average Nearest Neighbor Tool ArcMap offers a tool (ANN) that tests whether or not the observed first order nearest neighbor is consistent with a distribution of points one would expect to observe if the underlying process was completely random (i.e. IRP). But as we will learn very shortly, ArcMap’s ANN tool has its limitations. 12.2.1.1 A first attempt Figure 12.2: ArcGIS’ ANN tool. The size of the study area is not defined in this example. ArcGIS’ average nearest neighbor (ANN) tool computes the 1st nearest neighbor mean distance for all points. It also computes an expected mean distance (ANNexpected) under the assumption that the process that lead to the observed pattern is completely random. ArcGIS’ ANN tool offers the option to specify the study surface area. If the area is not explicitly defined, ArcGIS will assume that the area is defined by the smallest area encompassing the points. ArcGIS’ ANN analysis outputs the nearest neighbor ratio computed as: \\[ ANN_{ratio}=\\dfrac{ANN}{ANN_{expected}} \\] Figure 12.3: ANN results indicating that the pattern is consistent with a random process. Note the size of the study area which defaults to the point layer extent. If ANNratio is 1, the pattern results from a random process. If it’s greater than 1, it’s dispersed. If it’s less than 1, it’s clustered. In essence, ArcGIS is comparing the observed ANN value to ANNexpected one would compute if a complete spatial randomness (CSR) process was at play. ArcGIS’ tool also generates a p-value (telling us how confident we should be that our observed ANN value is consistent with a perfectly random process) along with a bell shaped curve in the output graphics window. The curve serves as an infographic that tells us if our point distribution is from a random process (CSR), or is more clustered/dispersed than one would expect under CSR. For example, if we were to run the Massachusetts Walmart point location layer through ArcGIS’ ANN tool, an ANNexpected value of 12,249 m would be computed along with an ANNratio of 1.085. The software would also indicate that the observed distribution is consistent with a CSR process (p-value of 0.28). But is it prudent to let the software define the study area for us? How does it know that the area we are interested in is the state of Massachusetts since this layer is not part of any input parameters? 12.2.1.2 A second attempt Figure 12.4: ArcGIS’ ANN tool. The size of the study is defined in this example. Here, we explicitly tell ArcGIS that the study area (Massachusetts) covers 21,089,917,382 m² (note that this is the MA shapefile’s surface area and not necessarily representative of MA’s actual surface area). ArcGIS’ ANN tool now returns a different output with a completely different conclusion. This time, the analysis suggests that the points are strongly dispersed across the state of Massachusetts and the very small p-value (p = 0.006) tells us that there is less than a 0.6% chance that a CSR process could have generated our observed point pattern. (Note that the p-value displayed by ArcMap is for a two-sided test). Figure 12.5: ArcGIS’ ANN tool output. Note the different output result with the study area size defined. The output indicates that the points are more dispersed than expected under IRP. So how does ArcGIS estimate the ANNexpected value under CSR? It does so by taking the inverse of the square root of the number of points divided by the area, and multiplying this quotient by 0.5. \\[ ANN_{Expected}=\\dfrac{0.5}{\\sqrt{n/A}} \\] In other words, the expected ANN value under a CSR process is solely dependent on the number of points and the study extent’s surface area. Do you see a problem here? Could different shapes encompassing the same point pattern have the same surface area? If so, shouldn’t the shape of our study area be a parameter in our ANN analysis? Unfortunately, ArcGIS’ ANN tool cannot take into account the shape of the study area. An alternative work flow is outlined in the next section. 12.2.2 A better approach: a Monte Carlo test The Monte Carlo technique involves three steps: First, we postulate a process–our null hypothesis, \\(Ho\\). For example, we hypothesize that the distribution of Walmart stores is consistent with a completely random process (CSR). Next, we simulate many realizations of our postulated process and compute a statistic (e.g. ANN) for each realization. Finally, we compare our observed data to the patterns generated by our simulated processes and assess (via a measure of probability) if our pattern is a likely realization of the hypothesized process. Following our working example, we randomly re-position the location of our Walmart points 1000 times (or as many times computationally practical) following a completely random process–our hypothesized process, \\(Ho\\)–while making sure to keep the points confined to the study extent (the state of Massachusetts). Figure 12.6: Three different outcomes from simulated patterns following a CSR point process. These maps help answer the question how would Walmart stores be distributed if their locations were not influenced by the location of other stores and by any local factors (such as population density, population income, road locations, etc…) For each realization of our process, we compute an ANN value. Each simulated pattern results in a different ANNexpected value. We plot all ANNexpected values using a histogram (this is our \\(Ho\\) sample distribution), then compare our observed ANN value of 13,294 m to this distribution. Figure 12.7: Histogram of simulated ANN values (from 1000 simulations). This is the sample distribution of the null hypothesis, ANNexpected (under CSR). The red line shows our observed (Walmart) ANN value. About 32% of the simulated values are greater (more extreme) than our observed ANN value. Note that by using the same study region (the state of Massachusetts) in the simulations we take care of problems like study area boundary and shape issues since each simulated point pattern is confined to the exact same study area each and every time. 12.2.2.1 Extracting a \\(p\\)-value from a Monte Carlo test The p-value can be computed from a Monte Carlo test. The procedure is quite simple. It consists of counting the number of simulated test statistic values more extreme than the one observed. If we are interested in knowing the probability of having simulated values more extreme than ours, we identify the side of the distribution of simulated values closest to our observed statistic, count the number of simulated values more extreme than the observed statistic then compute \\(p\\) as follows: \\[ \\dfrac{N_{extreme}+1}{N+1} \\] where Nextreme is the number of simulated values more extreme than our observed statistic and N is the total number of simulations. Note that this is for a one-sided test. A practical and more generalized form of the equation looks like this: \\[ \\dfrac{min(N_{greater}+1 , N + 1 - N_{greater})}{N+1} \\] where \\(min(N_{greater}+1 , N + 1 - N_{greater})\\) is the smallest of the two values \\(N_{greater}+1\\) and \\(N + 1 - N_{greater}\\), and \\(N_{greater}\\) is the number of simulated values greater than the observed value. It’s best to implement this form of the equation in a scripting program thus avoiding the need to visually seek the side of the distribution closest to our observed statistic. For example, if we ran 1000 simulations in our ANN analysis and found that 319 of those were more extreme (on the right side of the simulated ANN distribution) than our observed ANN value, our p-value would be (319 + 1) / (1000 + 1) or p = 0.32. This is interpreted as “there is a 32% probability that we would be wrong in rejecting the null hypothesis Ho.” This suggests that we would be remiss in rejecting the null hypothesis that a CSR process could have generated our observed Walmart point distribution. But this is not to say that the Walmart stores were in fact placed across the state of Massachusetts randomly (it’s doubtful that Walmart executives make such an important decision purely by chance), all we are saying is that a CSR process could have been one of many processes that generated the observed point pattern. If a two-sided test is desired, then the equation for the \\(p\\) value takes on the following form: \\[ 2 \\times \\dfrac{min(N_{greater}+1 , N + 1 - N_{greater})}{N+1} \\] where we are simply multiplying the one-sided p-value by two. 12.3 Alternatives to CSR/IRP Figure 12.8: Walmart store distribution shown on top of a population density layer. Could population density distribution explain the distribution of Walmart stores? The assumption of CSR is a good starting point, but it’s often unrealistic. Most real-world processes exhibit 1st and/or 2nd order effects. We therefore may need to correct for a non-stationary underlying process. We can simulate the placement of Walmart stores using the population density layer as our inhomogeneous point process. We can test this hypothesis by generating random points that follow the population density distribution. Figure 12.9: Examples of two randomly generated point patterns using population density as the underlying process. Note that even though we are not referring to a CSR/IRP point process, we are still treating this as a random point process since the points are randomly located following the underlying population density distribution. Using the same Monte Carlo (MC) techniques used with IRP/CSR processes, we can simulate thousands of point patterns (following the population density) and compare our observed ANN value to those computed from our MC simulations. Figure 12.10: Histogram showing the distribution of ANN values one would expect to get if population density distribution were to influence the placement of Walmart stores. In this example, our observed ANN value falls far to the right of our simulated ANN values indicating that our points are more dispersed than would be expected had population density distribution been the sole driving process. The percentage of simulated values more extreme than our observed value is 0% (i.e. a p-value \\(\\backsimeq\\) 0.0). Another plausible hypothesis is that median household income could have been the sole factor in deciding where to place the Walmart stores. Figure 12.11: Walmart store distribution shown on top of a median income distribution layer. Running an MC simulation using median income distribution as the underlying density layer yields an ANN distribution where about 16% of the simulated values are more extreme than our observed ANN value (i.e. p-value = 0.16): Figure 12.12: Histogram showing the distribution of ANN values one would expect to get if income distribution were to influence the placement of Walmart stores. Note that we now have two competing hypotheses: a CSR/IRP process and median income distribution process. Both cannot be rejected. This serves as a reminder that a hypothesis test cannot tell us if a particular process is the process involved in the generation of our observed point pattern; instead, it tells us that the hypothesis is one of many plausible processes. It’s important to remember that the ANN tool is a distance based approach to point pattern analysis. Even though we are randomly generating points following some underlying probability distribution map we are still concerning ourselves with the repulsive/attractive forces that might dictate the placement of Walmarts relative to one another–i.e. we are not addressing the question “can some underlying process explain the X and Y placement of the stores” (addressed in section 12.5). Instead, we are controlling for the 1st order effect defined by population density and income distributons. 12.4 Monte Carlo test with K and L functions MC techniques are not unique to average nearest neighbor analysis. In fact, they can be implemented with many other statistical measures as with the K and L functions. However, unlike the ANN analysis, the K and L functions consist of multiple test statistics (one for each distance \\(r\\)). This results in not one but \\(r\\) number of simulated distributions. Typically, these distributions are presented as envelopes superimposed on the estimated \\(K\\) or \\(L\\) functions. However, since we cannot easily display the full distribution at each \\(r\\) interval, we usually limit the envelope to a pre-defined acceptance interval. For example, if we choose a two-sided significance level of 0.05, then we eliminate the smallest and largest 2.5% of the simulated K values computed for for each \\(r\\) intervals (hence the reason you might sometimes see such envelopes referred to as pointwise envelopes). This tends to generate a saw-tooth like envelope. Figure 12.13: Simulation results for the IRP/CSR hypothesized process. The gray envelope in the plot covers the 95% significance level. If the observed L lies outside of this envelope at distance \\(r\\), then there is less than a 5% chance that our observed point pattern resulted from the simulated process at that distance. The interpretation of these plots is straight forward: if \\(\\hat K\\) or \\(\\hat L\\) lies outside of the envelope at some distance \\(r\\), then this suggests that the point pattern may not be consistent with \\(H_o\\) (the hypothesized process) at distance \\(r\\) at the significance level defined for that envelope (0.05 in this example). One important assumption underlying the K and L functions is that the process is uniform across the region. If there is reason to believe this not to be the case, then the K function analysis needs to be controlled for inhomogeneity in the process. For example, we might hypothesize that population density dictates the density distribution of the Walmart stores across the region. We therefore run an MC test by randomly re-assigning Walmart point locations using the population distribution map as the underlying point density distribution (in other words, we expect the MC simulation to locate a greater proportion of the points where population density is the greatest). Figure 12.14: Simulation results for an inhomogeneous hypothesized process. When controlled for population density, the significance test suggests that the inter-distance of Walmarts is more dispersed than expected under the null up to a distance of 30 km. It may be tempting to scan across the plot looking for distances \\(r\\) for which deviation from the null is significant for a given significance value then report these findings as such. For example, given the results in the last figure, we might not be justified in stating that the patterns between \\(r\\) distances of 5 and 30 are more dispersed than expected at the 5% significance level but at a higher significance level instead. This problem is referred to as the multiple comparison problem–details of which are not covered here. 12.5 Testing for a covariate effect The last two sections covered distance based approaches to point pattern analysis. In this section, we explore hypothesis testing on a density based approach to point pattern analysis: The Poisson point process model. Any Poisson point process model can be fit to an observed point pattern, but just because we can fit a model does not imply that the model does a good job in explaining the observed pattern. To test how well a model can explain the observed point pattern, we need to compare it to a base model (such as one where we assume that the points are randomly distributed across the study area–i.e. IRP). The latter is defined as the null hypothesis and the former is defined as the alternate hypothesis. For example, we may want to assess if the Poisson point process model that pits the placement of Walmarts as a function of population distribution (the alternate hypothesis) does a better job than the null model that assumes homogeneous intensity (i.e. a Walmart has no preference as to where it is to be placed). This requires that we first derive estimates for both models. A Poisson point process model (of the the Walmart point pattern) implemented in a statistical software such as R produces the following output for the null model: Stationary Poisson process Fitted to point pattern dataset 'P' Intensity: 2.1276e-09 Estimate S.E. CI95.lo CI95.hi Ztest Zval log(lambda) -19.96827 0.1507557 -20.26375 -19.6728 *** -132.4545 and the following output for the alternate model. Nonstationary Poisson process Fitted to point pattern dataset 'P' Log intensity: ~pop Fitted trend coefficients: (Intercept) pop -2.007063e+01 1.043115e-04 Estimate S.E. CI95.lo CI95.hi Ztest (Intercept) -2.007063e+01 1.611991e-01 -2.038657e+01 -1.975468e+01 *** pop 1.043115e-04 3.851572e-05 2.882207e-05 1.798009e-04 ** Zval (Intercept) -124.508332 pop 2.708284 Problem: Values of the covariate 'pop' were NA or undefined at 0.7% (4 out of 572) of the quadrature points Thus, the null model (homogeneous intensity) takes on the form: \\[ \\lambda(i) = e^{-19.96} \\] and the alternate model takes on the form: \\[ \\lambda(i) = e^{-20.1 + 1.04^{-4}population} \\] The models are then compared using the likelihood ratio test which produces the following output: Npar Df Deviance Pr(>Chi) 5 NA NA NA 6 1 4.253072 0.0391794 The value under the heading PR(>Chi) is the p-value which gives us the probability we would be wrong in rejecting the null. Here p=0.039 suggests that there is an 3.9% chance that we would be remiss to reject the base model in favor of the alternate model–put another way, the alternate model may be an improvement over the null. "],["spatial-autocorrelation.html", "Chapter 13 Spatial Autocorrelation 13.1 Global Moran’s I 13.2 Moran’s I at different lags 13.3 Local Moran’s I 13.4 Moran’s I equation explained", " Chapter 13 Spatial Autocorrelation “The first law of geography: Everything is related to everything else, but near things are more related than distant things.” Waldo R. Tobler (Tobler 1970) Mapped events or entities can have non-spatial information attached to them (some GIS software call these attributes). When mapped, these values often exhibit some degree of spatial relatedness at some scale. This is what Tobler was getting at: The idea that values close to one another tend to be similar. In fact, you will be hard-pressed to find mapped features that do not exhibit some kind of non-random pattern. So how do we model spatial patterns? The approach taken will depend on how one chooses to characterize the underlying process–this can be either a spatial trend model or a spatial clustering/dispersion model. This chapter focuses on the latter. 13.1 Global Moran’s I Though our visual senses can, in some cases, discern clustered regions from non-clustered regions, the distinction may not always be so obvious. We must therefore come up with a quantitative and objective approach to quantifying the degree to which similar features cluster or disperse and where such clustering occurs. One popular measure of spatial autocorrelation is the Moran’s I coefficient. 13.1.1 Computing the Moran’s I Let’s start with a working example: 2020 median per capita income for the state of Maine. Figure 13.1: Map of 2020 median per capita income for Maine counties (USA). It may seem apparent that, when aggregated at the county level, the income distribution appears clustered with high counties surrounded by high counties and low counties surrounded by low counties. But a qualitative description may not be sufficient; we might want to quantify the degree to which similar (or dissimilar) counties are clustered. One measure of this type or relationship is the Moran’s I statistic. The Moran’s I statistic is the correlation coefficient for the relationship between a variable (like income) and its neighboring values. But before we go about computing this correlation, we need to come up with a way to define a neighbor. One approach is to define a neighbor as being any contiguous polygon. For example, the northern most county (Aroostook), has four contiguous neighbors while the southern most county (York) has just two contiguous counties. Other neighborhood definitions can include distance bands (e.g. counties within 100 km) and k nearest neighbors (e.g. the 2 closest neighbors). Note that distance bands and k nearest neighbors are usually measured using the polygon’s centroids and not their boundaries. Figure 13.2: Maps show the links between each polygon and their respective neighbor(s) based on the neighborhood definition. A contiguous neighbor is defined as one that shares a boundary or a vertex with the polygon of interest. Orange numbers indicate the number of neighbors for each polygon. Note that the top most county has no neighbors when a neighborhood definition of a 100 km distance band is used (i.e. no centroids are within a 100 km search radius) Once we’ve defined a neighborhood for our analysis, we identify the neighbors for each polygon in our dataset then summaries the values for each neighborhood cluster (by computing their mean values, for example). This summarized neighborhood value is sometimes referred to as a spatially lagged value (Xlag). In our working example, we adopt a contiguity neighborhood and compute the average neighboring income value (Incomelag) for each county in our dataset. We then plot Incomelag vs. Income for each county. The Moran’s I coefficient between Incomelag and Income is nothing more than the slope of the least squares regression line that best fits the points after having equalized the spread between both sets of data. Figure 13.3: Scatter plot of spatially lagged income (neighboring income) vs. each countie’s income. If we equalize the spread between both axes (i.e. convert to a z-value) the slope of the regression line represents the Moran’s I statistic. If there is no degree of association between Income and Incomelag, the slope will be close to flat (resulting in a Moran’s I value near 0). In our working example, the slope is far from flat with a Moran’s I value is 0.28. So this begs the question, how significant is this Moran’s I value (i.e. is the computed slope significantly different from 0)? There are two approaches to estimating the significance: an analytical solution and a Monte Carlo solution. The analytical solution makes some restrictive assumptions about the data and thus cannot always be reliable. Another approach (and the one favored here) is a Monte Carlo test which makes no assumptions about the dataset including the shape and layout of each polygon. 13.1.2 Monte Carlo approach to estimating significance In a Monte Carlo test (a permutation bootstrap test, to be exact), the attribute values are randomly assigned to polygons in the data set and, for each permutation of the attribute values, a Moran’s I value is computed. Figure 13.4: Results from 199 permutations. Plot shows Moran’s I slopes (in gray) computed from each random permutation of income values. The observed Moran’s I slope for the original dataset is shown in red. The output is a sampling distribution of Moran’s I values under the (null) hypothesis that attribute values are randomly distributed across the study area. We then compare our observed Moran’s I value to this sampling distribution. Figure 13.5: Histogram shows the distribution of Moran’s I values for all 199 permutations; red vertical line shows our observed Moran’s I value of 0.28. In our working example, 199 simulations indicate that our observed Moran’s I value of 0.28 is not a value we would expect to compute if the income values were randomly distributed across each county. A (pseudo) P-value can easily be computed from the simulation results: \\[ \\dfrac{N_{extreme}+1}{N+1} \\] where \\(N_{extreme}\\) is the number of simulated Moran’s I values more extreme than our observed statistic and \\(N\\) is the total number of simulations. Here, out of 199 simulations, just three simulated I values were more extreme than our observed statistic, \\(N_{extreme}\\) = 3, so \\(p\\) is equal to (3 + 1) / (199 + 1) = 0.02. This is interpreted as “there is a 2% probability that we would be wrong in rejecting the null hypothesis Ho.” Note that in this permutation example, we shuffled around the observed income values such that all values were present in each permutation outcome–this is sometimes referred to as a randomization option in a few software implementations of the Moran’s I hypothesis test. Note that here, randomization is not to be confused with the way the permutation technique “randomly” assigns values to features in the data layer. Alternatively, one can choose to randomly assign a set of values to each feature in a data layer from a theorized distribution (for example, a Normal distribution). This may result in a completely different set of values for each permutation outcome. Note that you would only adopt this approach if the theorized distribution underpinning the value of interest is known a priori. Another important consideration when computing a P-value from a permutation test is the number of simulations to perform. In the above example we ran 199 permutations, thus, the smallest P-value we could possibly come up with is 1 / (199 + 1) or a P-value of 0.005. You should therefore chose a number of permutations, \\(N\\), large enough to ensure a reliable level of significance. 13.2 Moran’s I at different lags So far we have looked at spatial autocorrelation where we define neighbors as all polygons sharing a boundary with the polygon of interest. We may also be interested in studying the ranges of autocorrelation values as a function of distance. The steps for this type of analysis are straightforward: Compute lag values for a defined set of neighbors. Calculate the Moran’s I value for this set of neighbors. Repeat steps 1 and 2 for a different set of neighbors (at a greater distance for example) . For example, the Moran’s I values for income distribution in the state of Maine at distances of 75, 125, up to 325 km are presented in the following plot: Figure 13.6: Moran’s I at different spatial lags defined by a 50 km width annulus at 50 km distance increments. Red dots indicate Moran I values for which a P-value was 0.05 or less. The plot suggests that there is significant spatial autocorrelation between counties within 25 km of one another, but as the distances between counties increases, autocorrelation shifts from being positive to being negative meaning that at greater distances, counties tend to be more dissimilar. 13.3 Local Moran’s I We can decompose the global Moran’s I into a localized measure of autocorrelation–i.e. a map of “hot spots” and “cold spots”. A local Moran’s I analysis is best suited for relatively large datasets, especially if a hypothesis test is to be implemented. We’ll therefor switch to another dataset: Massachusetts household income data. Applying a contiguity based definition of a neighbor, we get the following scatter plot of spatially lagged income vs. income. Figure 13.7: Grey vertical and horizontal lines define the mean values for both axes values. Red points highlight counties with relatively high income values (i.e. greater than the mean) surrounded by counties whose average income value is relatively high. Likewise, dark blue points highlight counties with relatively low income values surrounded by counties whose average income value is relatively low. You’ll note that the mean value for Income, highlighted as light grey vertical and horizontal lines in the above plot, carve up the four quadrant defining the low-low, high-low, high-high and low-high quadrants when starting from the bottom-left quadrant and working counterclockwise. Note that other measures of centrality, such as the median, could be used to delineate these quadrants. The values in the above scatter plot can be mapped to each polygon in the dataset as shown in the following figure. Figure 13.8: A map view of the low-low (blue), high-low (light-blue), high-high (red) and low-high (orange) counties. Each observation that contributes to the global Moran’s I can be assigned a localized version of that statistic, \\(I_i\\) where the subscript \\(i\\) references the individual geometric unit. The calculation of \\(I_i\\) is shown later in the chapter. At this point, we have identified the counties that are surrounded by similar values. However, we have yet to assess which polygons are “significantly” similar or dissimilar to their neighbors. As with the global Moran’s I, there is both an analytical and Monte Carlo approach to computing significance of \\(I_i\\). In the case of a Monte Carlo approach, one shuffles all values in the dataset except for the value, \\(y_i\\), of the geometric unit \\(i\\) whose \\(I_i\\) we are assessing for significance. For each permutation, we compare the value at \\(y_i\\) to the average value of its neighboring values. From the permutations, we generate a distribution of \\(I_i\\) values (for each \\(y_i\\) feature) we would expect to get if the values were randomly distributed across all features. We can use the following polygon in eastern Massachusetts as example. Figure 13.9: Polygon whose signifcance value we are asessing in this example. Its local Moran’s I statistic is 0.85. A permutation test shuffles the income values around it all the while keeping its value constant. An example of an outcome of a few permutations follows: Figure 13.10: Local Moran’s I outcomes of a few permutations of income values. You’ll note that even though the income value remains the same in the polygon of interest, its local Moran’s I statistic will change because of the changing income values in its surrounding polygons. If we perform many more permutations, we come up with a distribution of \\(I_i\\) values under the null that the income values are randomly distributed across the state of Massachusetts. The distribution of \\(I_i\\) for the above example is plotted using a histogram. Figure 13.11: Distribution of \\(I_i\\) values under the null hypothesis that income values are randomly distributed across the study extent. The red vertical line shows the observed \\(I_i\\) for comparison. About 9.3% of the simulated values are more extreme than our observed \\(I_i\\) giving us a pseudo p-value of 0.09. If we perform this permutation for all polygons in our dataset, we can map the pseudo p-values for each polygon. Note that here, we are mapping the probability that the observed \\(I_i\\) value is more extreme than expected (equivalent to a one-tail test). Figure 13.12: Map of the pseudo p-values for each polygons’ \\(I_i\\) statistic. One can use the computed p-values to filter the \\(I_i\\) values based on a desired level of significance. For example, the following scatter plot and map shows the high/low “hotspots” for which a pseudo p-value of 0.05 or less was computed from the above simulation. Figure 13.13: Local Moran’s I values having a signifcance level of 0.05 or less. You’ll note that the levels of significance do not apply to just the high-high and low-low regions, they can apply to all combinations of highs and lows. Here’s another example where the \\(I_i\\) values are filtered based on a more stringent significance level of 0.01. Figure 13.14: Local Moran’s I values having a signifcance level of 0.01 or less. 13.4 Moran’s I equation explained The Moran’s I equation can take on many forms. One form of the equation can be presented as: \\[ I = \\frac{N}{\\sum\\limits_i (X_i-\\bar X)^2} \\frac{\\sum\\limits_i \\sum\\limits_j w_{ij}(X_i-\\bar X)(X_j-\\bar X)}{\\sum\\limits_i \\sum\\limits_j w_{ij}} \\tag{1} \\] \\(N\\) is the total number of features in the dataset, \\(X\\) is the quantity of interest. Subscripts \\(i\\) and \\(j\\) reference two different features in the dataset, and \\(w_{ij}\\) is a weight that defines the relationship between features \\(i\\) and \\(j\\) (i.e. the weights will determine if feature \\(j\\) is a neighbor of \\(i\\) and how much weight feature \\(j\\) should be given when computing some overall neighboring \\(X\\) value). There a few key components of this equation worth highlighting. First, you’ll note the standardization of both sets of values by the subtraction of each value in \\(X_i\\) or \\(X_j\\) by the mean of \\(X\\). This highlights the fact that we are seeking to compare the deviation of each value from an overall mean and not the deviation of their absolute values. Second, you’ll note an inverted variance term on the left-hand side of equation (1)–this is a measure of spread. You might recall from an introductory statistics course that the variance can be computed as: \\[ s^2 = \\frac{\\sum\\limits_i (X_i-\\bar X)^2}{N}\\tag{2} \\] Note that a more common measure of variance, the sample variance, where one divides the above numerator by \\((n-1)\\) can also be adopted in the Moran’s I calculation. Equation (1) is thus dividing the large fraction on the right-hand side by the variance. This has for effect of limiting the range of possible Moran’s I values between -1 and 1 (note that in some extreme cases, \\(I\\) can take on a value more extreme than [-1; 1]). We can re-write the Moran’s I equation by plugging in \\(s^2\\) as follows: \\[ I = \\frac{\\sum\\limits_i \\sum\\limits_j w_{ij}\\frac{(X_i-\\bar X)}{s}\\frac{(X_j-\\bar X)}{s}}{\\sum\\limits_i \\sum\\limits_j w_{ij}} \\tag{3} \\] Note that here \\(s\\times s = s^2\\). You might recognize the numerator as a sum of the product of standardized z-values between neighboring features. If we let \\(z_i = \\frac{(X_i-\\bar X)}{s}\\) and \\(z_j = \\frac{(X_j-\\bar X)}{s}\\), The Moran’s I equation can be reduced to: \\[ I = \\frac{\\sum\\limits_i \\sum\\limits_j w_{ij}(z_i\\ z_j)}{\\sum\\limits_i \\sum\\limits_j w_{ij}} \\tag{4} \\] Recall that we are comparing a variable \\(X\\) at \\(i\\) to all of its neighboring values at \\(j\\). More specifically, we are computing a summary value (such as the mean) of the neighboring values at \\(j\\) and multiplying that by \\(X_i\\). So, if we let \\(y_i = \\sum\\limits_j w_{ij} z_j\\), the Moran’s I coefficient can be rewritten as: \\[ I = \\frac{\\sum\\limits_i z_i y_i}{\\sum\\limits_i \\sum\\limits_j w_{ij}} \\tag{5} \\] So, \\(y_i\\) is the average z-value for the neighboring features thus making the product \\(z_i y_i\\) nothing more than a correlation coefficient. The product \\(z_iy_i\\) is a local measure of spatial autocorrelation, \\(I_i\\). If we don’t summarize across all locations \\(i\\), we get our local I statistic, \\(I_i\\): \\[ I_i = z_iy_i \\tag{6} \\] The global Moran’s I statistic, \\(I\\), is thus the average of all \\(I_i\\) values. \\[ I = \\frac{\\sum\\limits_i I_i}{\\sum\\limits_i \\sum\\limits_j w_{ij}} \\tag{5} \\] Let’s explore elements of the Moran’s I equation using the following sample dataset. Figure 13.15: Simulated spatial layer. The figure on the left shows the each cell’s ID value. The figure in the middle shows the values for each cell. The figure on the right shows the standardized values using equation (2). The first step in the computation of a Moran’s I index is the generation of weights. The weights can take on many different values. For example, one could assign a value of 1 to a neighboring cell as shown in the following matrix. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 3 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 4 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 5 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 6 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 7 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 8 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 9 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 0 10 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 11 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 12 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 13 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 14 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 15 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 16 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 For example, cell ID 1 (whose value is 25 and whose standardized value is 0.21) has for neighbors cells 2, 5 and 6. Computationally (working with the standardized values), this gives us a summarized neighboring value (aka lagged value), \\(y_1(lag)\\) of: \\[ \\begin{align*} y_1 = \\sum\\limits_j w_{1j} z_j {}={} & (0)(0.21)+(1)(1.17)+(0)(1.5)+ ... + \\\\ & (1)(0.69)+(1)(0.93)+(0)(-0.36)+...+ \\\\ & (0)(-0.76) = 2.79 \\end{align*} \\] Computing the spatially lagged values for the other 15 cells generates the following scatterplot: Figure 13.16: Moran’s I scatterplot using a binary weight. The red point is the (\\(z_1\\), \\(y_i\\)) pair computed for cell 1. You’ll note that the range of neighboring values along the \\(y\\)-axis is much greater than that of the original values on the \\(x\\)-axis. This is not necessarily an issue given that the Moran’s \\(I\\) correlation coefficient standardizes the values by recentering them on the overall mean \\((X - \\bar{X})/s\\). This is simply to re-emphasize that we are interested in how a neighboring value varies relative to a feature’s value, regardless of the scale of values in either batches. If there is a downside to adopting a binary weight, it’s the bias that the different number of neighbors can introduce in the calculation of the spatially lagged values. In other words, a feature with 5 polygons (such as feature ID 12) will have a larger neighboring value than a feature with 3 neighbors (such as feature ID 1) whose neighboring value will be less if there was no spatial autocorrelation in the dataset. A more natural weight is one where the values are standardized across each row of the weights matrix such that the weights across each row sum to one. For example: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 0 0.333 0 0 0.333 0.333 0 0 0 0 0 0 0 0 0 0 2 0.2 0 0.2 0 0.2 0.2 0.2 0 0 0 0 0 0 0 0 0 3 0 0.2 0 0.2 0 0.2 0.2 0.2 0 0 0 0 0 0 0 0 4 0 0 0.333 0 0 0 0.333 0.333 0 0 0 0 0 0 0 0 5 0.2 0.2 0 0 0 0.2 0 0 0.2 0.2 0 0 0 0 0 0 6 0.125 0.125 0.125 0 0.125 0 0.125 0 0.125 0.125 0.125 0 0 0 0 0 7 0 0.125 0.125 0.125 0 0.125 0 0.125 0 0.125 0.125 0.125 0 0 0 0 8 0 0 0.2 0.2 0 0 0.2 0 0 0 0.2 0.2 0 0 0 0 9 0 0 0 0 0.2 0.2 0 0 0 0.2 0 0 0.2 0.2 0 0 10 0 0 0 0 0.125 0.125 0.125 0 0.125 0 0.125 0 0.125 0.125 0.125 0 11 0 0 0 0 0 0.125 0.125 0.125 0 0.125 0 0.125 0 0.125 0.125 0.125 12 0 0 0 0 0 0 0.2 0.2 0 0 0.2 0 0 0 0.2 0.2 13 0 0 0 0 0 0 0 0 0.333 0.333 0 0 0 0.333 0 0 14 0 0 0 0 0 0 0 0 0.2 0.2 0.2 0 0.2 0 0.2 0 15 0 0 0 0 0 0 0 0 0 0.2 0.2 0.2 0 0.2 0 0.2 16 0 0 0 0 0 0 0 0 0 0 0.333 0.333 0 0 0.333 0 The spatially lagged value for cell ID 1 is thus computed as: \\[ \\begin{align*} y_1 = \\sum\\limits_j w_{1j} z_j {}={} & (0)(0.21)+(0.333)(1.17)+(0)(1.5)+...+ \\\\ & (0.333)(0.69)+(0.333)(0.93)+(0)(-0.36)+...+ \\\\ & (0)(-0.76) = 0.93 \\end{align*} \\] Multiplying each neighbor by the standardized weight, then summing these values, is simply computing the neighbor’s mean value. Using the standardized weights generates the following scatter plot. Plot on the left shows the raw values on the x and y axes; plot on the right shows the standardized values \\(z_i\\) and \\(y_i = \\sum\\limits_j w_{ij} z_j\\). You’ll note that the shape of the point cloud is the same in both plots given that the axes on the left plot are scaled such as to match the standardized scales in both axes. Figure 13.17: Moran’s scatter plot with original values on the left and same Moran’s I scatter plot on the right using the standardzied values \\(z_i\\) and \\(y_i\\). Note the difference in the point cloud pattern in the above plot from the one generated using the binary weights. Other weights can be used such as inverse distance and k-nearest neighbors to name just a few. However, must software implementations of the Moran’s I statistic will adopt the row standardized weights. 13.4.1 Local Moran’s I Once a spatial weight is chosen, and both \\(z_i\\) and \\(y_i\\) are computed. We can compute the \\(z_iy_i\\) product for all locations of \\(i\\) thus giving us a measure of the local Moran’s I statistic. Taking feature ID of 1 in our example, we compute \\(I_1(lag) = 0.21 \\times 0.93 = 0.19\\). Computing \\(I_i\\) for all cells gives us the following plot. Figure 13.18: The left plot shows the Moran’s I scatter plot with the point colors symbolizing the \\(I_i\\) values. The figure on the right shows the matching \\(I_i\\) values mapped to each respective cell. Here, we are adopting a different color scheme from that used earlier. Green colors highlight features whose values are surrounded by similar values. These can be either positive values surrounded by standardized values that tend to be positive or negative values surrounded by values that tend to be negative. In both cases, the calculated \\(I_i\\) will be positive. Red colors highlight features whose values are surrounded by dissimilar values. These can be either negative values surrounded by values that tend to be positive or positive values surrounded by values that tend to be negative. In both cases, the calculated \\(I_i\\) will be negative. In our example, two features have a negative Moran’s I coefficient: cell IDs 7 and 12. 13.4.2 Global Moran’s I The Global Moran’s I coefficient, \\(I\\) is nothing more than a summary of the local Moran’s I coefficients. Using a standardized weight, \\(I\\) is the average of all \\(I_i\\) values. \\[ \\begin{pmatrix} \\frac{0.19+0.7+1.15+0.68+0.18+0.15+-0.24+0.44+0.25+0.12+0.14+-0.29+1.18+1.39+0.71+0.39}{\\sum\\limits_i\\sum\\limits_j w_{ij}} = 0.446 \\end{pmatrix} \\] In this example, \\(\\sum\\limits_j w_{ij}\\) is the sum of all 256 values in Table (2) which, using standardized weights, sums to 16. \\(I\\) is thus the slope that best fits the data. This can be plotted using either the standardized values or the raw values. Figure 13.19: Moran’s scatter with fitted Moran’s I slope (red line). The left plot uses the raw values \\((X_i,X_i(lag))\\) for its axes labels. Right plot uses the standardized values \\((z_i,y_i)\\) for its axes labels. References "],["spatial-interpolation.html", "Chapter 14 Spatial Interpolation 14.1 Deterministic Approach to Interpolation 14.2 Statistical Approach to Interpolation", " Chapter 14 Spatial Interpolation Given a distribution of point meteorological stations showing precipitation values, how I can I estimate the precipitation values where data were not observed? Figure 14.1: Average yearly precipitation (reported in inches) for several meteorological sites in Texas. To help answer this question, we need to clearly define the nature of our point dataset. We’ve already encountered point data earlier in the course where our interest was in creating point density maps using different kernel windows. However, the point data used represented a complete enumeration of discrete events or observations–i.e. the entity of interest only occurred a discrete locations within a study area and therefore could only be measured at those locations. Here, our point data represents sampled observations of an entity that can be measured anywhere within our study area. So creating a point density raster from this data would only make sense if we were addressing the questions like “where are the meteorological stations concentrated within the state of Texas?”. Another class of techniques used with points that represent samples of a continuous field are interpolation methods. There are many interpolation tools available, but these tools can usually be grouped into two categories: deterministic and statistical interpolation methods. 14.1 Deterministic Approach to Interpolation We will explore two deterministic methods: proximity (aka Thiessen) techniques and inverse distance weighted techniques (IDW for short). 14.1.1 Proximity interpolation This is probably the simplest (and possibly one of the oldest) interpolation method. It was introduced by Alfred H. Thiessen more than a century ago. The goal is simple: Assign to all unsampled locations the value of the closest sampled location. This generates a tessellated surface whereby lines that split the midpoint between each sampled location are connected thus enclosing an area. Each area ends up enclosing a sample point whose value it inherits. Figure 14.2: Tessellated surface generated from discrete point samples. This is also known as a Thiessen interpolation. One problem with this approach is that the surface values change abruptly across the tessellated boundaries. This is not representative of most surfaces in nature. Thiessen’s method was very practical in his days when computers did not exist. But today, computers afford us more advanced methods of interpolation as we will see next. 14.1.2 Inverse Distance Weighted (IDW) The IDW technique computes an average value for unsampled locations using values from nearby weighted locations. The weights are proportional to the proximity of the sampled points to the unsampled location and can be specified by the IDW power coefficient. The larger the power coefficient, the stronger the weight of nearby points as can be gleaned from the following equation that estimates the value \\(z\\) at an unsampled location \\(j\\): \\[ \\hat{Z_j} = \\frac{\\sum_i{Z_i/d^n_{ij}}}{\\sum_i{1/d^n_{ij}}} \\] The carat \\(\\hat{}\\) above the variable \\(z\\) reminds us that we are estimating the value at \\(j\\). The parameter \\(n\\) is the weight parameter that is applied as an exponent to the distance thus amplifying the irrelevance of a point at location \\(i\\) as distance to \\(j\\) increases. So a large \\(n\\) results in nearby points wielding a much greater influence on the unsampled location than a point further away resulting in an interpolated output looking like a Thiessen interpolation. On the other hand, a very small value of \\(n\\) will give all points within the search radius equal weight such that all unsampled locations will represent nothing more than the mean values of all sampled points within the search radius. In the following figure, the sampled points and values are superimposed on top of an (IDW) interpolated raster generated with a \\(n\\) value of 2. Figure 14.3: An IDW interpolation of the average yearly precipitation (reported in inches) for several meteorological sites in Texas. An IDW power coefficient of 2 was used in this example. In the following example, an \\(n\\) value of 15 is used to interpolate precipitation. This results in nearby points having greater influence on the unsampled locations. Note the similarity in output to the proximity (Thiessen) interpolation. Figure 14.4: An IDW interpolation of the average yearly precipitation (reported in inches) for several meteorological sites in Texas. An IDW power coefficient of 15 was used in this example. 14.1.3 Fine tuning the interpolation parameters Finding the best set of input parameters to create an interpolated surface can be a subjective proposition. Other than eyeballing the results, how can you quantify the accuracy of the estimated values? One option is to split the points into two sets: the points used in the interpolation operation and the points used to validate the results. While this method is easily implemented (even via a pen and paper adoption) it does suffer from significant loss in power–i.e. we are using just half of the information to estimate the unsampled locations. A better approach (and one easily implemented in a computing environment) is to remove one data point from the dataset and interpolate its value using all other points in the dataset then repeating this process for each point in that dataset (while making sure that the interpolator parameters remain constant across each interpolation). The interpolated values are then compared with the actual values from the omitted point. This method is sometimes referred to as jackknifing or leave-one-out cross-validation. The performance of the interpolator can be summarized by computing the root-mean of squared residuals (RMSE) from the errors as follows: \\[ RMSE = \\sqrt{\\frac{\\sum_{i=1}^n (\\hat {Z_{i}} - Z_i)^2}{n}} \\] where \\(\\hat {Z_{i}}\\) is the interpolated value at the unsampled location i (i.e. location where the sample point was removed), \\(Z_i\\) is the true value at location i and \\(n\\) is the number of points in the dataset. We can create a scatterplot of the predicted vs. expected precipitation values from our dataset. The solid diagonal line represents the one-to-one slope (i.e. if the predicted values matched the true values exactly, then the points would fall on this line). The red dashed line is a linear fit to the points which is here to help guide our eyes along the pattern generated by these points. Figure 14.5: Scatter plot pitting predicted values vs. the observed values at each sampled location following a leave-one-out cross validation analysis. The computed RMSE from the above working example is 6.989 inches. We can extend our exploration of the interpolator’s accuracy by creating a map of the confidence intervals. This involves layering all \\(n\\) interpolated surfaces from the aforementioned jackknife technique, then computing the confidence interval for each location ( pixel) in the output map (raster). If the range of interpolated values from the jackknife technique for an unsampled location \\(i\\) is high, then this implies that this location is highly sensitive to the presence or absence of a single point from the sample point locations thus producing a large confidence interval (i.e. we can’t be very confident of the predicted value). Conversely, if the range of values estimated for location \\(i\\) is low, then a small confidence interval is computed (providing us with greater confidence in the interpolated value). The following map shows the 95% confidence interval for each unsampled location (pixel) in the study extent. Figure 14.6: In this example an IDW power coefficient of 2 was used and the search parameters was confined to a minimum number of points of 10 and a maximum number of points of 15. The search window was isotropic. Each pixel represents the range of precipitation values (in inches) around the expected value given a 95% confidence interval. IDW interpolation is probably one of the most widely used interpolators because of its simplicity. In many cases, it can do an adequate job. However, the choice of power remains subjective. There is another class of interpolators that makes use of the information provided to us by the sample points–more specifically, information pertaining to 1st and 2nd order behavior. These interpolators are covered next. 14.2 Statistical Approach to Interpolation The statistical interpolation methods include surface trend and Kriging. 14.2.1 Trend Surfaces It may help to think of trend surface modeling as a regression on spatial coordinates where the coefficients apply to those coordinate values and (for more complicated surface trends) to the interplay of the coordinate values. We will explore a 0th order, 1st order and 2nd order surface trend in the following sub-sections. 14.2.1.1 0th Order Trend Surface The first model (and simplest model), is the 0th order model which takes on the following expression: Z = a where the intercept a is the mean precipitation value of all sample points (27.1 in our working example). This is simply a level (horizontal) surface whose cell values all equal 27.1. Figure 14.7: The simplest model where all interpolated surface values are equal to the mean precipitation. This makes for an uninformative map. A more interesting surface trend map is one where the surface trend has a slope other than 0 as highlighted in the next subsection. 14.2.1.2 1st Order Trend Surface The first order surface polynomial is a slanted flat plane whose formula is given by: Z = a + bX + cY where X and Y are the coordinate pairs. Figure 14.8: Result of a first order interpolation. The 1st order surface trend does a good job in highlighting the prominent east-west trend. But is the trend truly uniform along the X axis? Let’s explore a more complicated surface: the quadratic polynomial. 14.2.1.3 2nd Order Trend Surface The second order surface polynomial (aka quadratic polynomial) is a parabolic surface whose formula is given by: \\(Z = a + bX + cY + dX^2 + eY^2 + fXY\\) Figure 14.9: Result of a second order interpolation This interpolation picks up a slight curvature in the east-west trend. But it’s not a significant improvement over the 1st order trend. 14.2.2 Ordinary Kriging Several forms of kriging interpolators exist: ordinary, universal and simple just to name a few. This section will focus on ordinary kriging (OK) interpolation. This form of kriging usually involves four steps: Removing any spatial trend in the data (if present). Computing the experimental variogram, \\(\\gamma\\), which is a measure of spatial autocorrelation. Defining an experimental variogram model that best characterizes the spatial autocorrelation in the data. Interpolating the surface using the experimental variogram. Adding the kriged interpolated surface to the trend interpolated surface to produce the final output. These steps our outlined in the following subsections. 14.2.2.1 De-trending the data One assumption that needs to be met in ordinary kriging is that the mean and the variation in the entity being studied is constant across the study area. In other words, there should be no global trend in the data (the term drift is sometimes used to describe the trend in other texts). This assumption is clearly not met with our Texas precipitation dataset where a prominent east-west gradient is observed. This requires that we remove the trend from the data before proceeding with the kriging operations. Many pieces of software will accept a trend model (usually a first, second or third order polynomial). In the steps that follow, we will use the first order fit computed earlier to de-trend our point values (recall that the second order fit provided very little improvement over the first order fit). Removing the trend leaves us with the residuals that will be used in kriging interpolation. Note that the modeled trend will be added to the kriged interpolated surface at the end of the workflow. Figure 14.10: Map showing de-trended precipitation values (aka residuals). These detrended values are then passed to the ordinary kriging interpolation operations. You can think of these residuals as representing variability in the data not explained by the global trend. If variability is present in the residuals then it is best characterized as a distance based measure of variability (as opposed to a location based measure). 14.2.2.2 Experimental Variogram In Kriging interpolation, we focus on the spatial relationship between location attribute values. More specifically, we are interested in how these attribute values (precipitation residuals in our working example) vary as the distance between location point pairs increases. We can compute the difference, \\(\\gamma\\), in precipitation values by squaring their differences then dividing by 2. For example, if we take two meteorological stations (one whose de-trended precipitation value is -1.2 and the other whose value is 1.6), Figure 14.11: Locations of two sample sites used to demonstrate the calculation of gamma. we can compute their difference (\\(\\gamma\\)) as follows: \\[ \\gamma = \\frac{(Z_2 - Z_1)^2}{2} = \\frac{(-1.2 - (1.6))^2}{2} = 3.92 \\] We can compute \\(\\gamma\\) for all point pairs then plot these values as a function of the distances that separate these points: Figure 14.12: Experimental variogram plot of precipitation residual values. The red point in the plot is the value computed in the above example. The distance separating those two points is about 209 km. This value is mapped in 14.12 as a red dot. The above plot is called an experimental semivariogram cloud plot (also referred to as an experimental variogram cloud plot). The terms semivariogram and variogram are often used interchangeably in geostatistics (we’ll use the term variogram henceforth since this seems to be the term of choice in current literature). Also note that the word experimental is sometimes dropped when describing these plots, but its use in our terminology is an important reminder that the points we are working with are just samples of some continuous field whose spatial variation we are attempting to model. 14.2.2.3 Sample Experimental Variogram Cloud points can be difficult to interpret due to the sheer number of point pairs (we have 465 point pairs from just 50 sample points, and this just for 1/3 of the maximum distance lag!). A common approach to resolving this issue is to “bin” the cloud points into intervals called lags and to summarize the points within each interval. In the following plot, we split the data into 15 bins then compute the average point value for each bin (displayed as red points in the plot). The red points that summarize the cloud are the sample experimental variogram estimates for each of the 15 distance bands and the plot is referred to as the sample experimental variogram plot. Figure 14.13: Sample experimental variogram plot of precipitation residual values. 14.2.2.4 Experimental Variogram Model The next step is to fit a mathematical model to our sample experimental variogram. Different mathematical models can be used; their availability is software dependent. Examples of mathematical models are shown below: Figure 14.14: A subset of variogram models available in R’s gstat package. The goal is to apply the model that best fits our sample experimental variogram. This requires picking the proper model, then tweaking the partial sill, range, and nugget parameters (where appropriate). The following figure illustrates a nonzero intercept where the nugget is the distance between the \\(0\\) variance on the \\(y\\) axis and the variogram’s model intercept with the \\(y\\) axis. The partial sill is the vertical distance between the nugget and the part of the curve that levels off. If the variogram approaches \\(0\\) on the \\(y\\)-axis, then the nugget is \\(0\\) and the partial sill is simply referred to as the sill. The distance along the \\(x\\) axis where the curve levels off is referred to as the range. Figure 14.15: Graphical description of the range, sill and nugget parameters in a variogram model. In our working example, we will try to fit the Spherical function to our sample experimental variogram. This is one of three popular models (the other two being linear and gaussian models.) Figure 14.16: A spherical model fit to our residual variogram. 14.2.2.5 Kriging Interpolation The variogram model is used by the kriging interpolator to provide localized weighting parameters. Recall that with the IDW, the interpolated value at an unsampled site is determined by summarizing weighted neighboring points where the weighting parameter (the power parameter) is defined by the user and is applied uniformly to the entire study extent. Kriging uses the variogram model to compute the weights of neighboring points based on the distribution of those values–in essence, kriging is letting the localized pattern produced by the sample points define the weights (in a systematic way). The exact mathematical implementation will not be covered here (it’s quite involved), but the resulting output is shown in the following figure: Figure 14.17: Krige interpolation of the residual (detrended) precipitation values. Recall that the kriging interpolation was performed on the de-trended data. In essence, we predicted the precipitation values based on localized factors. We now need to combine this interpolated surface with that produced from the trend interpolated surface to produce the following output: Figure 14.18: The final kriged surface. A valuable by-product of the kriging operation is the variance map which gives us a measure of uncertainty in the interpolated values. The smaller the variance, the better (note that the variance values are in squared units). Figure 14.19: Variance map resulting from the Kriging analysis. "],["references.html", "Chapter 15 References", " Chapter 15 References "],["reading-and-writing-spatial-data-in-r.html", "A Reading and writing spatial data in R Sample files for this exercise Introduction Creating spatial objects Converting from an sf object Converting to an sf object Dissecting the sf file object Exporting to different data file formats", " A Reading and writing spatial data in R R sf terra tidygeocoder spatstat 4.3.1 1.0.14 1.7.55 1.0.5 3.0.7 Sample files for this exercise First, you will need to download some sample files from the github repository. Make sure to set your R session folder to the directory where you will want to save the sample files before running the following code chunks. download.file("https://github.com/mgimond/Spatial/raw/main/Data/Income_schooling.zip", destfile = "Income_schooling.zip" , mode='wb') unzip("Income_schooling.zip", exdir = ".") file.remove("Income_schooling.zip") download.file("https://github.com/mgimond/Spatial/raw/main/Data/rail_inters.gpkg", destfile = "./rail_inters.gpkg", mode='wb') download.file("https://github.com/mgimond/Spatial/raw/main/Data/elev.img", destfile = "./elev.img", mode='wb') Introduction There are several different R spatial formats to choose from. Your choice of format will largely be dictated by the package(s) and or function(s) used in your workflow. A breakdown of formats and intended use are listed below. Data format Used with… Used in package… Used for… Comment sf vector sf, others visualizing, manipulating, querying This is the new spatial standard in R. Will also read from spatially enabled databases such as postgresSQL. raster raster raster, others visualizing, manipulating, spatial statistics This has been the most popular raster format fo rmany years. But, it is gradually being supplanted by terra SpatRaster terra terra, others visualizing, manipulating, spatial statistics This is gradually replacing raster SpatialPoints* SpatialPolygons* SpatialLines* SpatialGrid* vector and raster sp, spdep Visualizing, spatial statistics These are legacy formats. spdep now accepts sf objects ppp owin vector spatstat Point pattern analysis/statistics NA im raster spatstat Point pattern analysis/statistics NA 1 The spatial* format includes SpatialPointsDataFrame, SpatialPolygonsDataFrame, SpatialLinesDataFrame, etc… There is an attempt at standardizing the spatial format in the R ecosystem by adopting a well established set of spatial standards known as simple features. This effort results in a recently developed package called sf (Pebesma 2018). It is therefore recommended that you work in an sf framework when possible. As of this writing, most of the basic data manipulation and visualization operations can be successfully conducted using sf spatial objects. Some packages such as spdep and spatstat require specialized data object types. This tutorial will highlight some useful conversion functions for this purpose. Creating spatial objects The following sections demonstrate different spatial data object creation strategies. Reading a shapefile Shapefiles consist of many files sharing the same core filename and different suffixes (i.e. file extensions). For example, the sample shapefile used in this exercise consists of the following files: [1] "Income_schooling.dbf" "Income_schooling.prj" "Income_schooling.sbn" "Income_schooling.sbx" [5] "Income_schooling.shp" "Income_schooling.shx" Note that the number of files associated with a shapefile can vary. sf only needs to be given the *.shp name. It will then know which other files to read into R such as projection information and attribute table. library(sf) s.sf <- st_read("Income_schooling.shp") Let’s view the first few records in the spatial data object. head(s.sf, n=4) # List spatial object and the first 4 attribute records Simple feature collection with 4 features and 5 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: 379071.8 ymin: 4936182 xmax: 596500.1 ymax: 5255569 Projected CRS: NAD83 / UTM zone 19N NAME Income NoSchool NoSchoolSE IncomeSE geometry 1 Aroostook 21024 0.01338720 0.00140696 250.909 MULTIPOLYGON (((513821.1 51... 2 Somerset 21025 0.00521153 0.00115002 390.909 MULTIPOLYGON (((379071.8 50... 3 Piscataquis 21292 0.00633830 0.00212896 724.242 MULTIPOLYGON (((445039.5 51... 4 Penobscot 23307 0.00684534 0.00102545 242.424 MULTIPOLYGON (((472271.3 49... Note that the sf object stores not only the geometry but the coordinate system information and attribute data as well. These will be explored later in this exercise. Reading a GeoPackage A geopackage can store more than one layer. To list the layers available in the geopackage, type: st_layers("rail_inters.gpkg") Driver: GPKG Available layers: layer_name geometry_type features fields crs_name 1 Interstate Multi Line String 35 1 NAD83 2 Rail Multi Line String 730 3 NAD83 / UTM zone 19N In this example, we have two separate layers: Interstate and Rail. We can extract each layer separately via the layer= parameter. inter.sf <- st_read("rail_inters.gpkg", layer="Interstate") rail.sf <- st_read("rail_inters.gpkg", layer="Rail") Reading a raster In earlier versions of this tutorial, the raster package was used to read raster files. This is being supplanted by terra which will be the package used in this and in subsequent exercises. terra will read many different raster file formats such as geoTiff, Imagine and HDF5 just to name a few. To see a list of supported raster file formats on your computer simply run: terra::gdal(drivers = TRUE) |> subset(type == "raster") In the following example, an Imagine raster file is read into R using the rast function. library(terra) elev.r <- rast("elev.img") The object class is of type SpatRaster. class(elev.r) [1] "SpatRaster" attr(,"package") [1] "terra" What sets a SpatRaster object apart from other R data file objects is its storage. By default, data files are loaded into memory, but SpatRaster objects are not. This can be convenient when working with raster files too large for memory. But this comes at a performance cost. If your RAM is large enough to handle your raster file, it’s best to load the entire dataset into memory. To check if the elev.r object is loaded into memory, run: inMemory(elev.r) [1] FALSE An output of FALSE indicates that it is not. To force the raster into memory use set.values: set.values(elev.r) Let’s check that the raster is indeed loaded into memory: inMemory(elev.r) [1] TRUE Now let’s look at the raster’s properties: elev.r class : SpatRaster dimensions : 994, 652, 1 (nrow, ncol, nlyr) resolution : 500, 500 (x, y) extent : 336630.3, 662630.3, 4759303, 5256303 (xmin, xmax, ymin, ymax) coord. ref. : NAD_1983_UTM_Zone_19N (EPSG:26919) source(s) : memory varname : elev name : Layer_1 min value : 0 max value : 1546 The raster object returns its grid dimensions (number of rows and columns), pixel size/resolution (in the layer’s coordinate system units), geographic extent, native coordinate system (UTM NAD83 Zone 19 with units of meters) and min/max raster values. Creating a spatial object from a data frame Geographic point data locations recorded in a spreadsheet can be converted to a spatial point object. Note that it’s important that you specify the coordinate system used to record the coordinate pairs since such information is not stored in a data frame. In the following example, the coordinate values are recorded in a WGS 1984 geographic coordinate system (crs = 4326). # Create a simple dataframe with lat/long values df <- data.frame(lon = c(-68.783, -69.6458, -69.7653), lat = c(44.8109, 44.5521, 44.3235), Name= c("Bangor", "Waterville", "Augusta")) # Convert the dataframe to a spatial object. Note that the # crs= 4326 parameter assigns a WGS84 coordinate system to the # spatial object p.sf <- st_as_sf(df, coords = c("lon", "lat"), crs = 4326) p.sf Simple feature collection with 3 features and 1 field Geometry type: POINT Dimension: XY Bounding box: xmin: -69.7653 ymin: 44.3235 xmax: -68.783 ymax: 44.8109 Geodetic CRS: WGS 84 Name geometry 1 Bangor POINT (-68.783 44.8109) 2 Waterville POINT (-69.6458 44.5521) 3 Augusta POINT (-69.7653 44.3235) Geocoding street addresses The tidygeocoder package will convert street addresses to latitude/longitude coordinate pairs using a wide range of geocoding services such as the US census and Google. Some of these geocoding services will require an API key, others will not. Click here to see the list of geocoding services supported by tidygeocoder and their geocoding limitations. In the example that follows, the osm geocoding service is used by default. library(tidygeocoder) options(pillar.sigfig = 7) # Increase significant digits in displayed output dat <- data.frame( name = c("Colby College", "Bates College", "Bowdoin College"), address = c("4000 Mayflower drive, Waterville, ME , 04901", "275 College st, Lewiston, ME 04240", "255 Maine St, Brunswick, ME 04011")) geocode(.tbl = dat, address = address, method = "osm") # A tibble: 3 × 4 name address lat long <chr> <chr> <dbl> <dbl> 1 Colby College 4000 Mayflower drive, Waterville, ME , 04901 44.56119 -69.65845 2 Bates College 275 College st, Lewiston, ME 04240 44.10638 -70.20636 3 Bowdoin College 255 Maine St, Brunswick, ME 04011 43.90870 -69.96142 Another free (but manual) alternative, is to use the US Census Bureau’s web geocoding service for creating lat/lon values from a file of US street addresses. This needs to be completed via their web interface and the resulting data table (a CSV file) would then need to be loaded into R as a data frame. Converting from an sf object Packages such as spdep (older versions only) and spatsat do not support sf objects. The following sections demonstrate methods to convert from sf to other formats. Converting an sf object to a Spatial* object (spdep/sp) The following code will convert point, polyline or polygon features to a spatial* object. While the current version of spdep will now accept sf objects, converting to spatial* objects will be necessary with legacy spdep packages. In this example, an sf polygon feature is converted to a SpatialPolygonsDataFrame object. s.sp <- as_Spatial(s.sf) class(s.sp) [1] "SpatialPolygonsDataFrame" attr(,"package") [1] "sp" Converting an sf polygon object to an owin object The spatstat package is used to analyze point patterns however, in most cases, the study extent needs to be explicitly defined by a polygon object. The polygon should be of class owin. library(spatstat) s.owin <- as.owin(s.sf) class(s.owin) [1] "owin" Note the loading of the package spatstat. This is required to access the as.owin.sf method for sf. Note too that the attribute table gets stripped from the polygon data. This is usually fine given that the only reason for converting a polygon to an owin format is for delineating the study boundary. Converting an sf point object to a ppp object The spatstat package is currently designed to work with projected (planar) coordinate system. If you attempt to convert a point object that is in a geographic coordinate system, you will get the following error message: p.ppp <- as.ppp(p.sf) Error: Only projected coordinates may be converted to spatstat class objects The error message reminds us that a geographic coordinate system (i.e. one that uses angular measurements such as latitude/longitude) cannot be used with this package. If you encounter this error, you will need to project the point object to a projected coordinate system. In this example, we’ll project the p.sf object to a UTM coordinate system (epsg=32619). Coordinate systems in R are treated in a later appendix. p.sf.utm <- st_transform(p.sf, 32619) # project from geographic to UTM p.ppp <- as.ppp(p.sf.utm) # Create ppp object class(p.ppp) [1] "ppp" Note that if the point layer has an attribute table, its attributes will be converted to ppp marks. These attribute values can be accessed via marks(p.ppp). Converting a SpatRaster object to an im object To create a spatstat im raster object from a SpatRaster object, you will need to first create a three column dataframe from the SpatRaster objects with the first two columns defining the X and Y coordinate values of each cell, and the third column defining the cell values df <- as.data.frame(elev.r,xy=TRUE) elev.im <- as.im(df) class(elev.im) [1] "im" Converting to an sf object All aforementioned spatial formats, except owin, can be coerced to an sf object via the st_as_sf function. for example: st_as_sf(p.ppp) # For converting a ppp object to an sf object st_as_sf(s.sp) # For converting a Spatial* object to an sf object Dissecting the sf file object head(s.sf,3) Simple feature collection with 3 features and 5 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: 379071.8 ymin: 4936182 xmax: 596500.1 ymax: 5255569 Projected CRS: NAD83 / UTM zone 19N NAME Income NoSchool NoSchoolSE IncomeSE geometry 1 Aroostook 21024 0.01338720 0.00140696 250.909 MULTIPOLYGON (((513821.1 51... 2 Somerset 21025 0.00521153 0.00115002 390.909 MULTIPOLYGON (((379071.8 50... 3 Piscataquis 21292 0.00633830 0.00212896 724.242 MULTIPOLYGON (((445039.5 51... The first line of output gives us the geometry type, MULTIPOLYGON, a multi-polygon data type. This is also referred to as a multipart polygon. A single-part sf polygon object will adopt the POLYGON geometry. The next few lines of output give us the layer’s bounding extent in the layer’s native coordinate system units. You can extract the extent via the st_bbox() function as in st_bbox(s.sf). The following code chunk can be used to extract addition coordinate information from the data. st_crs(s.sf) Depending on the version of the PROJ library used by sf, you can get two different outputs. If your version of sf is built with a version of PROJ older than 6.0, the output will consist of an epsg code (when available) and a proj4 string as follows: Coordinate Reference System: EPSG: 26919 proj4string: "+proj=utm +zone=19 +datum=NAD83 +units=m +no_defs" If your version of sf is built with a version of PROJ 6.0 or greater, the output will consist of a user defined CS definition (e.g. an epsg code), if available, and a Well Known Text (WKT) formatted coordinate definition that consists of a series of [ ] tags as follows: Coordinate Reference System: User input: NAD83 / UTM zone 19N wkt: PROJCRS["NAD83 / UTM zone 19N", BASEGEOGCRS["NAD83", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4269]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["Degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["Degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1]], ID["EPSG",26919]] The WKT format will usually start with a PROJCRS[...] tag for a projected coordinate system, or a GEOGCRS[...] tag for a geographic coordinate system. More information on coordinate systems in R can be found in the coordinate systems appendix. What remains of the sf summary output is the first few records of the attribute table. You can extract the object’s table to a dedicated data frame via: s.df <- data.frame(s.sf) class(s.df) [1] "data.frame" head(s.df, 5) NAME Income NoSchool NoSchoolSE IncomeSE geometry 1 Aroostook 21024 0.01338720 0.001406960 250.909 MULTIPOLYGON (((513821.1 51... 2 Somerset 21025 0.00521153 0.001150020 390.909 MULTIPOLYGON (((379071.8 50... 3 Piscataquis 21292 0.00633830 0.002128960 724.242 MULTIPOLYGON (((445039.5 51... 4 Penobscot 23307 0.00684534 0.001025450 242.424 MULTIPOLYGON (((472271.3 49... 5 Washington 20015 0.00478188 0.000966036 327.273 MULTIPOLYGON (((645446.5 49... The above chunk will also create a geometry column. This column is somewhat unique in that it stores its contents as a list of geometry coordinate pairs (polygon vertex coordinate values in this example). str(s.df) 'data.frame': 16 obs. of 6 variables: $ NAME : chr "Aroostook" "Somerset" "Piscataquis" "Penobscot" ... $ Income : int 21024 21025 21292 23307 20015 21744 21885 23020 25652 24268 ... $ NoSchool : num 0.01339 0.00521 0.00634 0.00685 0.00478 ... $ NoSchoolSE: num 0.001407 0.00115 0.002129 0.001025 0.000966 ... $ IncomeSE : num 251 391 724 242 327 ... $ geometry :sfc_MULTIPOLYGON of length 16; first list element: List of 1 ..$ :List of 1 .. ..$ : num [1:32, 1:2] 513821 513806 445039 422284 424687 ... ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg" You can also opt to remove this column prior to creating the dataframe as follows: s.nogeom.df <- st_set_geometry(s.sf, NULL) class(s.nogeom.df) [1] "data.frame" head(s.nogeom.df, 5) NAME Income NoSchool NoSchoolSE IncomeSE 1 Aroostook 21024 0.01338720 0.001406960 250.909 2 Somerset 21025 0.00521153 0.001150020 390.909 3 Piscataquis 21292 0.00633830 0.002128960 724.242 4 Penobscot 23307 0.00684534 0.001025450 242.424 5 Washington 20015 0.00478188 0.000966036 327.273 Exporting to different data file formats You can export an sf object to many different spatial file formats such as a shapefile or a geopackage. st_write(s.sf, "shapefile_out.shp", driver = "ESRI Shapefile") # create to a shapefile st_write(s.sf, "s.gpkg", driver = "GPKG") # Create a geopackage file If the file you are writing to already exists, the above will throw an error. To force an overwrite, simply add the delete_layer = TRUE argument to the st_write function. You can see a list of writable vector formats via: gdal(drivers = TRUE) |> subset(can %in% c("write", "read/write" ) & type == "vector") The value in the name column is the driver name to pass to the driver = argument in the st_write() function. To export a raster to a data file, use writeRaster() function. writeRaster(elev.r, "elev_out.tif", gdal = "GTiff" ) # Create a geoTiff file writeRaster(elev.r, "elev_out.img", gdal = "HFA" ) # Create an Imagine raster file You can see a list of writable raster formats via: gdal(drivers = TRUE) |> subset(can %in% c("write", "read/write" ) & type == "raster") The value in the name column is the driver name to pass to the gdal = argument in the writeRaster() function. References "],["mapping-data-in-r.html", "B Mapping data in R Sample files for this exercise tmap ggplot2 plot_sf", " B Mapping data in R R sf terra tmap ggplot2 4.3.1 1.0.14 1.7.55 3.3.3 3.4.3 There are many mapping environments that can be adopted in R. Three are presented in this tutorial: tmap, ggplot2 and plot_sf. Sample files for this exercise Data used in the following exercises can be loaded into your current R session by running the following chunk of code. library(sf) library(terra) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/elev.RDS")) elev.r <- unwrap(readRDS(z)) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/inter_sf.RDS")) inter.sf <- readRDS(z) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/rail_sf.RDS")) rail.sf <- readRDS(z) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/s_sf.RDS")) s.sf <- readRDS(z) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/p_sf.RDS")) p.sf <- readRDS(z) The data objects consist of five layers: an elevation raster (elev.r), an interstate polyline layer (inter.sf), a point cities layer (p.sf), a railroad polyline layer (rail.sf) and a Maine counties polygon layer (s.sf). All vector layers are sf objects. All layers are in a UTM/NAD83 projection (Zone 19N) except p.sf which is in a WGS 1984 geographic coordinate system. tmap The tmap package is specifically developed for mapping spatial data. As such, it offers the greatest mapping options. The package recognizes sf, raster and Spatial* objects. The basics To map the counties polygon layer using a grey color scheme, type: library(tmap) tm_shape(s.sf) + tm_polygons(col="grey", border.col="white") The tm_shape function loads the spatial object (vector or raster) into the mapping session. The tm_polygons function is one of many tmap functions that dictates how the spatial object is to be mapped. The col parameter defines either the polygon fill color or the spatial object’s attribute column to be used to define the polygons’ color scheme. For example, to use the Income attribute value to define the color scheme, type: tm_shape(s.sf) + tm_polygons(col="Income", border.col = "white") Note the + symbol used to piece together the functions (this is similar to the ggplot2 syntax). You can customize the map by piecing together various map element functions. For example, to move the legend box outside of the main map body add the tm_legend(outside = TRUE) function to the mapping operation. tm_shape(s.sf) + tm_polygons("Income", border.col = "white") + tm_legend(outside = TRUE) You can also choose to omit the legend box (via the legend.show = FALSE parameter) and the data frame border (via the tm_layout(frame = FALSE) function): tm_shape(s.sf) + tm_polygons("Income", border.col = "white", legend.show=FALSE) + tm_layout(frame = FALSE) If you want to omit the polygon border lines from the plot, simply add the border.col = NULL parameter to the tm_polygons function. tm_shape(s.sf) + tm_polygons("Income", border.col = NULL) + tm_legend(outside = TRUE) Note that the tm_fill function is nearly identical to the tm_polygons function with the difference being that the tm_fill function does not draw polygon borders. Combining layers You can easily stack layers by piecing together additional tm_shapefunctions. In the following example, the railroad layer and the point layer are added to the income map. The railroad layer is mapped using the tm_lines function and the cities point layer is mapped using the tm_dots function. Note that layers are pieced together using the + symbol. tm_shape(s.sf) + tm_polygons("Income", border.col = NULL) + tm_legend(outside = TRUE) + tm_shape(rail.sf) + tm_lines(col="grey70") + tm_shape(p.sf) + tm_dots(size=0.3, col="black") Layers are stacked in the order in which they are listed. In the above example, the point layer is the last layer called therefore it is drawn on top of the previously drawn layers. Note that if a layer’s coordinate system is properly defined, tmap will reproject, on-the-fly, any layer whose coordinate system does not match that of the first layer in the stack. In this example, s.sf defines the map’s coordinate system (UTM/NAD83). p.sf is in a geographic coordinate system and is thus reprojected on-the-fly to properly overlap the other layers in the map. Tweaking classification schemes You can control the classification type, color scheme, and bin numbers via the tm_polygons function. For example, to apply a quantile scheme with 6 bins and varying shades of green, type: tm_shape(s.sf) + tm_polygons("Income", style = "quantile", n = 6, palette = "Greens") + tm_legend(outside = TRUE) Other style classification schemes include fixed, equal, jenks, kmeans and sd. If you want to control the breaks manually set style=fixed and specify the classification breaks using the breaks parameter. For example, tm_shape(s.sf) + tm_polygons("Income", style = "fixed",palette = "Greens", breaks = c(0, 23000, 27000, 100000 )) + tm_legend(outside = TRUE) If you want a bit more control over the legend elements, you can tweak the labels parameter as in, tm_shape(s.sf) + tm_polygons("Income", style = "fixed",palette = "Greens", breaks = c(0, 23000, 27000, 100000 ), labels = c("under $23,000", "$23,000 to $27,000", "above $27,000"), text.size = 1) + tm_legend(outside = TRUE) Tweaking colors There are many color schemes to choose from, but you will probably want to stick to color swatches established by Cynthia Brewer. These palettes are available in tmap and their names are listed below. For sequential color schemes, you can choose from the following palettes. For divergent color schemes, you can choose from the following palettes. For categorical color schemes, you can choose from the following palettes. For example, to map the county names using the Pastel1 categorical color scheme, type: tm_shape(s.sf) + tm_polygons("NAME", palette = "Pastel1") + tm_legend(outside = TRUE) To map the percentage of the population not having attained a high school degree (column labeled NoSchool in s.sf) using a YlOrBr palette with 8 bins while modifying the legend title to read “Fraction without a HS degree”, type: tm_shape(s.sf) + tm_polygons("NoSchool", style="quantile", palette = "YlOrBr", n=8, title="Fraction without \\na HS degree") + tm_legend(outside = TRUE) The character \\n in the “Fraction without \\na HS degree” string is interpreted by R as a new line (carriage return). If you want to reverse the color scheme simply add the minus symbol - in front of the palette name as in palette = \"-YlOrBr\" Adding labels You can add text and labels using the tm_text function. In the following example, point labels are added to the right of the points with the text left justified (just = \"left\") and with an x offset of 0.5 units for added buffer between the point and the text. tm_shape(s.sf) + tm_polygons("NAME", palette = "Pastel1", border.col = "white") + tm_legend(outside = TRUE) + tm_shape(p.sf) + tm_dots(size= .3, col = "red") + tm_text("Name", just = "left", xmod = 0.5, size = 0.8) The tm_text function accepts an auto placement option via the parameter auto.placement = TRUE. This uses a simulated annealing algorithm. Note that this automated approach may not generate the same text placement after each run. Adding a grid or graticule You can add a grid or graticule to the map using the tm_grid function. You will need to modify the map’s default viewport setting via the tm_layout function to provide space for the grid labels. In the following example, the grid is generated using the layer’s UTM coordinate system and is divided into roughly four segments along the x-axis and five segments along the y-axis. The function will adjust the grid placement so as to generate “pretty” label values. tm_shape(s.sf) + tm_polygons("NAME", palette = "Pastel1") + tm_legend(outside = TRUE) + tm_layout(outer.margins = c(.1,.1,.1,.1)) + tm_grid(labels.inside.frame = FALSE, n.x = 4, n.y = 5) To generate a graticule (lines of latitude and longitude), simply modify the grid’s coordinate system to a geographic one using either an EPSG defined coordinate system, or a PROJ4 formatted string. But note that the PROJ string syntax is falling out of favor in current and future R spatial environments so, if possible, adopt an EPSG (or OGC) code. Here, we’ll use EPSG:4326 which defines the WGS 1984 geographic coordinate system. We will also modify the grid placement by explicitly specifying the lat/long grid values. tm_shape(s.sf) + tm_polygons("NAME", palette = "Pastel1") + tm_legend(outside = TRUE) + tm_layout(outer.margins = c(.1,.1,.1,.1)) + tm_grid(labels.inside.frame = FALSE, x = c(-70.5, -69, -67.5), y = c(44, 45, 46, 47), projection = "EPSG:4326") Adding the ° symbol to the lat/long values requires a bit more code: tm_shape(s.sf) + tm_polygons("NAME", palette = "Pastel1") + tm_legend(outside = TRUE) + tm_layout(outer.margins = c(.1,.1,.1,.1)) + tm_grid(labels.inside.frame = FALSE, x = c(-70.5, -69, -67.5) , y = c(44, 45, 46, 47), projection = "+proj=longlat", labels.format = list(fun=function(x) {paste0(x,intToUtf8(176))} ) ) Here, we use the unicode decimal representation of the ° symbol (unicode 176) and pass it to the intToUtf8 function. A list of unicode characters and their decimal representation can be found on this Wikipedia page. Adding statistical plots A histogram of the variables being mapped can be added to the legend element. By default, the histogram will inherit the colors used in the classification scheme. tm_shape(s.sf) + tm_polygons("NoSchool", palette = "YlOrBr", n = 6, legend.hist = TRUE, title = "% no school") + tm_legend(outside = TRUE, hist.width = 2) Mapping raster files Raster objects can be mapped by specifying the tm_raster function. For example to plot the elevation raster and assign 64 continuous shades of the built-in terrain color ramp, type: tm_shape(elev.r) + tm_raster(style = "cont", title = "Elevation (m)", palette = terrain.colors(64))+ tm_legend(outside = TRUE) Note the use of another style parameter option: cont for continuous color scheme. You can choose to symbolize the raster using classification breaks instead of continuous colors. For example, to manually set the breaks to 50, 100, 500, 750, 1000, and 15000 meters, type: tm_shape(elev.r) + tm_raster(style = "fixed", title = "Elevation (m)", breaks = c(0, 50, 100, 500, 750, 1000, 15000), palette = terrain.colors(5))+ tm_legend(outside = TRUE) Other color gradients that R offers include, heat.colors, rainbow, and topo.colors. You can also create your own color ramp via the colorRampPalette function. For example, to generate a 12 bin quantile classification scheme using a color ramp that changes from darkolivegreen4 to yellow to brown (these are built-in R colors), and adding a histogram to view the distribution of colors across pixels, type: tm_shape(elev.r) + tm_raster(style = "quantile", n = 12, title = "Elevation (m)", palette = colorRampPalette( c("darkolivegreen4","yellow", "brown"))(12), legend.hist = TRUE)+ tm_legend(outside = TRUE, hist.width = 2) Note that the Brewer palette names can also be used with rasters. Changing coordinate systems tmap can change the output’s coordinate system without needing to reproject the data layers. In the following example, the elevation raster, railroad layer and point city layer are mapped onto a USA Contiguous Albers Equal Area Conic projection. A lat/long grid is added as a reference. # Define the Albers coordinate system aea <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +ellps=GRS80 +datum=NAD83" # Map the data tm_shape(elev.r, projection = aea) + tm_raster(style = "quantile", n = 12, palette = colorRampPalette( c("darkolivegreen4","yellow", "brown"))(12), legend.show = FALSE) + tm_shape(rail.sf) + tm_lines(col = "grey70")+ tm_shape(p.sf) +tm_dots(size=0.5) + tm_layout(outer.margins = c(.1,.1,.1,.1)) + tm_grid(labels.inside.frame = FALSE, x = c(-70.5, -69, -67.5), y = c(44, 45, 46, 47), projection = "+proj=longlat") The first data layer’s projection= parameter will define the map’s coordinate system. Note that this parameter does not need to be specified in the other layers taking part in the output map. If a projection is not explicitly defined in the first call to tm_shape, then the output map will default to the first layer’s reference system. Side-by-side maps You can piece maps together side-by-side using the tmap_arrange function. You first need to save each map to a separate object before combining them. For example: inc.map <- tm_shape(s.sf) + tm_polygons(col="Income")+ tm_legend(outside=TRUE) school.map <- tm_shape(s.sf) + tm_polygons(col="NoSchool")+ tm_legend(outside=TRUE) name.map <- tm_shape(s.sf) + tm_polygons(col="NAME")+ tm_legend(outside=TRUE) tmap_arrange(inc.map, school.map, name.map) Splitting data by polygons or group of polygons You can split the output into groups of features based on a column attribute. For example, to split the income map into individual polygons via the NAME attribute, type: tm_shape(s.sf) + tm_polygons(col = "Income") + tm_legend(outside = TRUE) + tm_facets( by = "NAME", nrow = 2) The order of the faceted plot follows the alphanumeric order of the faceting attribute values. If you want to change the faceted order, you will need to change the attribute’s level order. ggplot2 If you are already familiar with ggplot2, you will find it easy to transition to spatial data visualization. The key geom used when mapping spatial data is geom_sf(). The basics If you wish to simply plot the geometric elements of a layer, type: library(ggplot2) ggplot(data = s.sf) + geom_sf() As with any ggplot operation, you can also pass the object’s name to the geom_sf() instead of the ggplot function as in: ggplot() + geom_sf(data = s.sf) This will prove practical later in this exercise when multiple layers are plotted on the map. By default, ggplot will add a graticule to the plot, even if the coordinate system associated with the layer is in a projected coordinate system. You can adopt any one of ggplot2’s gridline removal strategies to eliminate the grid from the plot. Here, we’ll make use of the theme_void() function. ggplot(data = s.sf) + geom_sf() + theme_void() If you want to have ggplot adopt the layer’s native coordinate system (UTM NAD 1983 in this example) instead of the default geographic coordinate system, type: ggplot(data = s.sf) + geom_sf() + coord_sf(datum = NULL) Or, you can explicitly assign the data layer’s datum via a call to st_crs as in ... + coord_sf(datum = st_crs(s.sf)) By setting datum to NULL, you prevent ggplot from figuring out how to convert the layer’s native coordinate system to a geographic one. You can control grid/graticule intervals using ggplot’s scale_..._continuous functions. For example: ggplot(data = s.sf) + geom_sf() + scale_x_continuous(breaks = c(-70, -69, -68)) + scale_y_continuous(breaks = 44:47) If you wish to apply a grid native to the layer’s coordinate system, type: ggplot(data = s.sf) + geom_sf() + coord_sf(datum = NULL) + scale_x_continuous(breaks = c(400000, 500000, 600000)) + scale_y_continuous(breaks = c(4900000, 5100000)) To symbolize a layer’s geometries using one of the layer’s attributes, add the aes() function. ggplot(data = s.sf, aes(fill = Income)) + geom_sf() Note that the data and aesthetics can be defined in the geom_sf function as well: ggplot() + geom_sf(data = s.sf, aes(fill = Income)) To change the border color, type: ggplot(data = s.sf, aes(fill = Income)) + geom_sf(col = "white") To remove outlines, simply pass NA to col (e.g. col = NA) in the geom_sf function. Tweaking classification schemes To bin the color scheme by assigning ranges of income values to a unique set of color swatches defined by hex values, use one of the scale_fill_steps* family of functions. ggplot(data = s.sf, aes(fill = Income)) + geom_sf() + scale_fill_stepsn(colors = c("#D73027", "#FC8D59", "#FEE08B", "#D9EF8B", "#91CF60") , breaks = c(22000, 25000, 27000, 30000)) You can adopt Brewer’s color schemes by applying one of the scale_..._fermenter() functions and specifying the classification type (sequential, seq; divergent, div; or categorical, qual) and the palette name. For example, to adopt a divergent color scheme using the \"PRGn\" colors, type: ggplot(data = s.sf, aes(fill = Income)) + geom_sf() + scale_fill_fermenter(type = "div", palette = "PRGn", n.breaks = 4) The flip the color scheme set direction to 1. ggplot(data = s.sf, aes(fill = Income)) + geom_sf() + scale_fill_fermenter(type = "div", palette = "PRGn", n.breaks = 4, direction = 1) ggplot offers many advanced options. For example, we can modify the bin intervals by generating a non-uniform classification scheme and scale the legend bar so as to reflect the non-uniform intervals using the guide_coloursteps() function and its even.steps = FALSE argument. We’ll also modify the legend bar dimensions and title in this code chunk. ggplot(data = s.sf, aes(fill = Income)) + geom_sf() + scale_fill_stepsn(colors = c("#D73027", "#FC8D59", "#FEE08B", "#D9EF8B", "#91CF60", "#1A9850") , breaks = c(22000, 25000, 26000, 27000, 30000), values = scales::rescale(c(22000, 25000, 26000, 27000, 30000), c(0,1)), guide = guide_coloursteps(even.steps = FALSE, show.limits = TRUE, title = "Per capita Income \\n(US dollars)", barheight = unit(2.2, "in"), barwidth = unit(0.15, "in"))) Combining layers You can overlap layers in the map by adding calls to geom_sf. In such a scenario, it might be best for readability sake to specify the layer name in the geom_sf function instead of the ggplot function. ggplot() + geom_sf(data = s.sf, aes(fill = Income)) + geom_sf(data = rail.sf, col = "white") + geom_sf(data = p.sf, col = "green") Note that ggplot will convert coordinate systems on-the-fly as needed. Here, p.sf is in a coordinate system different from the other layers. You can also add raster layers to the map. However, the raster layer must be in a dataframe format with x, y and z columns. The elev.r raster is in a SpatRaster format and will need to be converted to a dataframe using the as.data.frame function from the raster package. This function has a special method for raster layers, as such, it adds parameters unique to this method. These include xy = TRUE which instructs the function to create x and y coordinate columns from the data, and na.rm = TRUE which removes blank cells (this will help reduce the size of our dataframe given that elev.r does not fill its extent’s rectangular outline). Since the layers are drawn in the order listed, we will move the rail.sf vector layer to the bottom of the stack. ggplot() + geom_raster(data = as.data.frame(elev.r, xy=TRUE, na.rm = TRUE), aes(x = x, y = y, fill = elev)) + scale_fill_gradientn(colours = terrain.colors(7)) + geom_sf(data = rail.sf, col = "white") + geom_sf(data = p.sf, col = "black") + theme(axis.title = element_blank()) # Removes axes labels plot_sf The sf package has its own plot method. This is a convenient way to generate simple plots without needing additional plotting packages. The basics By default, when passing an sf object to `plot, the function will generate as may plots as there are attribute columns. For example plot(s.sf) To limit the plot to just one of the attribute columns, limit the dataset using basic R indexing techniques. For example, to plot the Income column, type plot(s.sf["Income"]) To limit the output to just the layer’s geometry, wrap the object name with the st_geometry function. plot(st_geometry(s.sf)) You can control the fill and border colors using the col and border parameters respectively. plot(st_geometry(s.sf), col ="grey", border = "white") Adding a graticule You can add a graticule by setting the graticule parameter to TRUE. To add graticule labels, set axes to TRUE. plot(st_geometry(s.sf), col ="grey", border = "white", graticule = TRUE, axes= TRUE) Combining layers To add layers, generate a new call to plot with the add parameter set to TRUE. For example, to add rail.sf and p.sf to the map, type: plot(st_geometry(s.sf), col ="grey", border = "white", graticule = TRUE, axes= TRUE) plot(rail.sf, col = "grey20", add = TRUE) Note that plot_sf requires that the layers be in the same coordinate system. For example, adding p.sf will not show the points on the map given that it’s in a different coordinate system. sf layers can be combined with raster layers. The order in which layers are listed will matter. You will usually want to map the raster layer first, then add the vector layer(s). plot(elev.r, col = terrain.colors(30)) plot(st_geometry(rail.sf), col ="grey", border = "white", add = TRUE) Tweaking colors You can tweak the color schemes as well as the legend display. The latter will require the use of R’s built-in par function whereby the las = 1 parameter will render the key labels horizontal, and the omi parameter will prevent the legend labels from being cropped. OP <- par(las = 1, omi=c(0,0,0,0.6)) p1 <- plot(s.sf["Income"], breaks = c(20000, 22000, 25000, 26000, 27000, 30000, 33000), pal = c("#D73027", "#FC8D59", "#FEE08B", "#D9EF8B", "#91CF60", "#1A9850"), key.width = 0.2, at = c(20000, 22000, 25000, 26000, 27000, 30000, 33000)) par(OP) While plot_sf offers succinct plotting commands and independence from other mapping packages, it is limited in its customization options. "],["anatomy-of-simple-feature-objects.html", "C Anatomy of simple feature objects Creating point ‘sf’ objects Creating polyline ‘sf’ objects Creating polygon ‘sf’ objects Extracting geometry from an sf object Alternative syntax Additional resources", " C Anatomy of simple feature objects R sf ggplot2 4.3.1 1.0.14 3.4.3 This tutorial exposes you to the building blocks of simple feature objects via the the creation of point, polyline and polygon features from scratch. Creating point ‘sf’ objects We will start off by exploring the creation of a singlepart point feature object. There are three phases in creating a point simple feature (sf) object: Defining the coordinate pairs via a point geometry object, sfg; Creating a simple feature column object, sfc, from the point geometries; Creating the simple feature object, sf. Step 1: Create the point geometry: sfg Here, we’ll create three separate point objects. We’ll adopt a geographic coordinate system, but note that we do not specify the coordinate system just yet. library(sf) p1.sfg <- st_point(c(-70, 45)) p2.sfg <- st_point(c(-69, 44)) p3.sfg <- st_point(c(-69, 45)) Let’s check the class of one of these point geometries. class(p1.sfg) [1] "XY" "POINT" "sfg" What we are looking for is a sfg class. You’ll note other classes associated with this object such as POINT which defines the geometric primitive. You’ll see examples of other geometric primitives later in this tutorial. Note that if a multipart point feature object is desired, the st_multipoint() function needs to be used instead of st_point() with the coordinate pairs defined in matrix as in st_multipoint(matrix( c(-70, 45, -69, 44, -69, 45), ncol = 2, byrow = TRUE ) ). Step 2: Create a column of simple feature geometries: sfc Next, we’ll combine the point geometries into a single object. Note that if you are to define a coordinate system for the features, you can do so here via the crs= parameter. We use the WGS 1984 reference system (EPSG code of 4326). p.sfc <- st_sfc( list(p1.sfg, p2.sfg, p3.sfg), crs = 4326 ) class(p.sfc) [1] "sfc_POINT" "sfc" The object is a simple feature column, sfc. More specifically, we’ve combined the point geometries into a single object whereby each geometry is assigned its own row or, to be technical, each point was assigned its own component via the list function. You can can confirm that each point geometry is assigned its own row in the following output. p.sfc Geometry set for 3 features Geometry type: POINT Dimension: XY Bounding box: xmin: -70 ymin: 44 xmax: -69 ymax: 45 Geodetic CRS: WGS 84 You can access each point using list operations. For example, to access the second point, type: p.sfc[[2]] Step 3: Create the simple feature object sf The final step is to create the simple feature object. p.sf <- st_sf(p.sfc) p.sf Simple feature collection with 3 features and 0 fields Geometry type: POINT Dimension: XY Bounding box: xmin: -70 ymin: 44 xmax: -69 ymax: 45 Geodetic CRS: WGS 84 p.sfc 1 POINT (-70 45) 2 POINT (-69 44) 3 POINT (-69 45) Renaming the geometry column The above step generated a geometry column named after the input sfc object name (p.sfc in our example). This is perfectly functional since the sf object knows that this is the geometry column. We can confirm this by checking out p.sf’s attributes. attributes(p.sf) $names [1] "p.sfc" $row.names [1] 1 2 3 $class [1] "sf" "data.frame" $sf_column [1] "p.sfc" $agr factor() Levels: constant aggregate identity What we are looking for is the $sf_column attribute which is , in our example, pointing to the p.sfc column. This attribute is critical in a spatial operation that makes use of the dataframe’s spatial objects. Functions that recognize sf objects will look for this attribute to identify the geometry column. You might chose to rename the column to something more meaningful such as coords (note that some spatially enabled databases adopt the name geom). You can use the names() function to rename that column, but note that you will need to re-define the geometry column in the attributes using the st_geometry() function. names(p.sf) <- "coords" st_geometry(p.sf) <- "coords" p.sf Simple feature collection with 3 features and 0 fields Geometry type: POINT Dimension: XY Bounding box: xmin: -70 ymin: 44 xmax: -69 ymax: 45 Geodetic CRS: WGS 84 coords 1 POINT (-70 45) 2 POINT (-69 44) 3 POINT (-69 45) Adding attributes to an sf object The p.sf object is nothing more than a dataframe with a geometry column of list data type. typeof(p.sf$coords) [1] "list" Storing spatial features in a dataframe has many benefits, one of which is operating on the features’ attribute values. For example, we can add a new column with attribute values for each geometry entry. Here, we’ll assign letters to each point. Note that the order in which the attribute values are passed to the dataframe must match that of the geometry elements. p.sf$val1 <- c("A", "B", "C") p.sf Simple feature collection with 3 features and 1 field Geometry type: POINT Dimension: XY Bounding box: xmin: -70 ymin: 44 xmax: -69 ymax: 45 Geodetic CRS: WGS 84 coords val1 1 POINT (-70 45) A 2 POINT (-69 44) B 3 POINT (-69 45) C We can use sf’s plot function to view the points. plot(p.sf, pch = 16, axes = TRUE, main = NULL) Adding a geometry column to an existing non-spatial dataframe A nifty property of the sfc object created in step 2 above is the ability to append it to an existing dataframe using the st_geometry() function. In the following example, we’ll create a dataframe, then append the geometry column to that dataframe. df <- data.frame(col1 = c("A", "B","C")) st_geometry(df) <- p.sfc Note that once we’ve added the geometry column, df becomes a spatial feature object and the geometry column is assigned the name geometry. df Simple feature collection with 3 features and 1 field Geometry type: POINT Dimension: XY Bounding box: xmin: -70 ymin: 44 xmax: -69 ymax: 45 Geodetic CRS: WGS 84 col1 geometry 1 A POINT (-70 45) 2 B POINT (-69 44) 3 C POINT (-69 45) Creating polyline ‘sf’ objects The steps are similar to creating a point object. You first create the geometry(ies), you then combine the geometry(ies) into a spatial feature column before creating the simple feature object. First, we need to define the vertices that will define each line segment of the polyline. The order in which the vertices are defined matters: The order defines each connecting line segment ends. The coordinate pairs of each vertex are stored in a matrix. l <- rbind( c(-70, 45), c(-69, 44), c(-69, 45) ) Next, we create a polyline geometry object. l.sfg <- st_linestring(l) Next, we create the simple feature column. We also add the reference system definition (crs = 4326). l.sfc <- st_sfc(list(l.sfg), crs = 4326) Finally, we create the simple feature object. l.sf <- st_sf(l.sfc) l.sf Simple feature collection with 1 feature and 0 fields Geometry type: LINESTRING Dimension: XY Bounding box: xmin: -70 ymin: 44 xmax: -69 ymax: 45 Geodetic CRS: WGS 84 l.sfc 1 LINESTRING (-70 45, -69 44,... Even though we have multiple line segments, they are all associated with a single polyline feature, hence they each share the same attribute. plot(l.sf, type = "b", pch = 16, main = NULL, axes = TRUE) Creating branching polyline features You can also create polyline features with branching segments (i.e. where at least one vertex is associated with more than two line segments). You simply need to make sure that the coordinate values for the overlapping vertices share the exact same values. # Define coordinate pairs l1 <- rbind( c(-70, 45), c(-69, 44), c(-69, 45) ) l2 <- rbind( c(-69, 44), c(-70, 44) ) l3 <- rbind( c(-69, 44), c(-68, 43) ) # Create simple feature geometry object l.sfg <- st_multilinestring(list(l1, l2, l3)) # Create simple feature column object l.sfc <- st_sfc(list(l.sfg), crs = 4326) # Create simple feature object l.sf <- st_sf(l.sfc) # Plot the data plot(l.sf, type = "b", pch = 16, axes = TRUE) Creating polygon ‘sf’ objects General steps in creating a polygon sf spatial object from scratch include: Defining the vertices of each polygon in a matrix; Creating a list object from each matrix object (the list structure will differ between POLYGON and MULTIPOLYGON geometries); Creating an sfg polygon geometry object from the list; Creating an sf spatial object. Defining a polygon’s geometry is a bit more involved than a polyline in that a polygon defines an enclosed area. By convention, simple features record vertices coordinate pairs in a counterclockwise direction such that the area to the left of a polygon’s perimeter when traveling in the direction of the recorded vertices is the polygon’s “inside”. This is counter to the order in which vertices are recorded in a shapefile whereby the area to the right of the traveled path along the polygon’s perimeter is deemed “inside”. A polygon hole has its ring defined in the opposite direction: clockwise for a simple feature object and counterclockwise for a shapefile. For many applications in R, the ring direction will not matter, but for a few they might. So when possible, adopt the simple feature paradigm when defining the coordinate pairs. Note that importing a shapefile into an R session will usually automatically reverse the polygons’ ring direction. There are two types of polygon geometries that can be adopted depending on your needs: POLYGON and MULTIPOLYGON. POLYGON simple feature A plain polygon We’ll first create a simple polygon shaped like a triangle. The sf output structure will be similar to that for the POINT and POLYLINE objects with the coordinate pairs defining the polygon vertices stored in a geometry column. The polygon coordinate values are defined in a matrix. The last coordinate pair must match the first coordinate pair. The coordinate values will be recorded in a geographic coordinate system (latitude, longitude) but the reference system won’t be defined until the creation of the sfc object. poly1.crd <- rbind( c(-66, 43), c(-70, 47), c(-70,43), c(-66, 43) ) Next, we create the POLYGON geometries. The polygon matrix needs to be wrapped in a list object. poly1.geom <- st_polygon( list(poly1.crd ) ) We now have a polygon geometry. poly1.geom Next, we create a simple feature column from the polygon geometry. We’ll also define the coordinate system used to report the coordinate values. poly.sfc <- st_sfc( list(poly1.geom), crs = 4326 ) poly.sfc Geometry set for 1 feature Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 Finally, to create the sf object, run the st_sf() function. poly.sf <- st_sf(poly.sfc) poly.sf Simple feature collection with 1 feature and 0 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 poly.sfc 1 POLYGON ((-66 43, -70 47, -... The coordinates column is assigned the name poly.sfc by default. If you wish to change the column name to coords, for example, type the following: names(poly.sf) <- "coords" st_geometry(poly.sf) <- "coords" poly.sf Simple feature collection with 1 feature and 0 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 coords 1 POLYGON ((-66 43, -70 47, -... plot(poly.sf, col = "bisque", axes = TRUE) A polygon with a hole In this example, we’ll add a hole to the polygon. Recall that its outer ring will need to be recorded in a counterclockwise direction and its hole in a clockwise direction. The resulting data object will have the following structure. # Polygon 1 poly1.outer.crd <- rbind( c(-66, 43),c(-70, 47), c(-70,43), c(-66, 43) ) # Outer ring poly1.inner.crd <- rbind( c(-68, 44), c(-69,44), c(-69, 45), c(-68, 44) ) # Inner ring Next, we combine the ring coordinates into a single geometric element. Note that this is done by combining the two coordinate matrices into a single list object. poly1.geom <- st_polygon( list(poly1.outer.crd, poly1.inner.crd)) We now create the simple feature column object. poly.sfc <- st_sfc( list(poly1.geom), crs = 4326 ) Finally, to create the sf object, run the st_sf() function. poly.sf <- st_sf(poly.sfc) We’ll take the opportunity to rename the coordinate column (even though this is not necessary). names(poly.sf) <- "coords" st_geometry(poly.sf) <- "coords" poly.sf Simple feature collection with 1 feature and 0 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 coords 1 POLYGON ((-66 43, -70 47, -... Let’s now plot the sf object. plot(poly.sf, col = "bisque", axes = TRUE) Combining polygons: singlepart features In this example, we’ll create two distinct polygons by adding a second polygon to the one created in the last step. The output will be a singlepart polygon feature (i.e. each polygon can be assigned its own unique attribute value). We’ll create the second polygon (the first polygon having already been created in the previous section). # Define coordinate matrix poly2.crd <- rbind( c(-67, 45),c(-67, 47), c(-69,47), c(-67, 45) ) # Create polygon geometry poly2.geom <- st_polygon( list(poly2.crd)) Next, we combine the geometries into a simple feature column, sfc. poly.sfc <- st_sfc( list(poly1.geom , poly2.geom), crs = 4326 ) Each polygon has its own row in the sfc object. poly.sfc Geometry set for 2 features Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 Finally, to create the sf object, run the st_sf() function. poly.sf <- st_sf(poly.sfc) poly.sf Simple feature collection with 2 features and 0 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 poly.sfc 1 POLYGON ((-66 43, -70 47, -... 2 POLYGON ((-67 45, -67 47, -... We’ll go ahead and rename the geometry column to coords. names(poly.sf) <- "coords" st_geometry(poly.sf) <- "coords" poly.sf Simple feature collection with 2 features and 0 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 coords 1 POLYGON ((-66 43, -70 47, -... 2 POLYGON ((-67 45, -67 47, -... plot(poly.sf, col = "bisque", axes = TRUE) Adding attributes As with the point sf object created earlier in this exercise, we can append columns to the polygon sf object. But make sure that the order of the attribute values match the order in which the polygons are stored in the sf object. poly.sf$id <- c("A", "B") poly.sf Simple feature collection with 2 features and 1 field Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 coords id 1 POLYGON ((-66 43, -70 47, -... A 2 POLYGON ((-67 45, -67 47, -... B plot(poly.sf["id"], axes = TRUE, main = NULL) MULTIPOLYGON simple feature: multipart features If multiple polygons are to share the same attribute record (a scenario referred to as multipart geometry in some GIS applications), you need to use the st_multipolygon() function when creating the sfg object. In this example, we’ll combine the two polygon created in the last example into a single geometry element. The multipolygon function groups polygons into a single list. If one of the polygons is made up of more than one ring (e.g. a polygon with a whole), its geometry is combined into a single sub-list object. # Create multipolygon geometry mpoly1.sfg <- st_multipolygon( list( list( poly1.outer.crd, # Outer loop poly1.inner.crd), # Inner loop list( poly2.crd)) ) # Separate polygon # Create simple feature column object mpoly.sfc <- st_sfc( list(mpoly1.sfg), crs = 4326) # Create simple feature object mpoly.sf <- st_sf(mpoly.sfc) mpoly.sf Simple feature collection with 1 feature and 0 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 mpoly.sfc 1 MULTIPOLYGON (((-66 43, -70... Note the single geometric entry in the table. Mixing singlepart and multipart elements A MULTIPOLGON geometry can be used to store a single polygon as well. In this example, we’ll create a MULTIPOLYGON sf object that will combine multipart and singlepart polygons. To make this example more interesting, we’ll have one of the elements (poly4.coords) overlapping several polygons. Note that any overlapping polygon needs to be in its own MULTIPOLYGON or POLYGON entry–if it’s added to an existing entry (i.e. combined with another polygon geometry), it may be treated as a hole, even if the coordinate values are recorded in a counterclockwise direction. poly3.coords <- rbind( c(-66, 44), c(-64, 44), c(-66,47), c(-66, 44) ) poly4.coords <- rbind( c(-67, 43), c(-64, 46), c(-66.5,46), c(-67, 43) ) Note the embedded list() functions in the following code chunk. mpoly1.sfg <- st_multipolygon( list( list( poly1.outer.crd, # Outer loop poly1.inner.crd), # Inner loop list( poly2.crd)) ) # Separate poly mpoly2.sfg <- st_multipolygon( list( list(poly3.coords))) # Unique polygon mpoly3.sfg <- st_multipolygon( list( list(poly4.coords)) ) # Unique polygon Finally, we’ll generate the simple feature object, sf, via the creation of the simple feature column object, sfc. We’ll also assign the WGS 1984 geographic coordinate system (epsg = 4326). mpoly.sfc <- st_sfc( list(mpoly1.sfg, mpoly2.sfg, mpoly3.sfg), crs = 4326) mpoly.sf <- st_sf(mpoly.sfc) Next, we’ll add attribute values to each geometric object before generating a plot. We’ll apply a transparency to the polygons to reveal the overlapping geometries. mpoly.sf$ids <- c("A", "B", "C") plot(mpoly.sf["ids"], axes = TRUE, main = NULL, pal = sf.colors(alpha = 0.5, categorical = TRUE)) Note how polygon C overlaps the other polygon elements. We can check that this does not violate simple feature rules via the st_is_valid() function. st_is_valid(mpoly.sf) [1] TRUE TRUE TRUE This returns three boolean values, one for each element. A value of TRUE indicates that the geometry does not violate any rule. Avoid storing overlapping polygons in a same MULTIPOLYGON geometry. Doing so will create an “invalid” sf object which may pose problems with certain functions. Extracting geometry from an sf object You can extract the geometry from an sf object via the st_geometry function. For example, # Create sfc from sf st_geometry(mpoly.sf) Geometry set for 3 features Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -64 ymax: 47 Geodetic CRS: WGS 84 To extract coordinates from a single record in a WKT (well known text) format, type: st_geometry(mpoly.sf)[[1]] If you want the extract the coordinate pairs of the first element in a list format type: st_geometry(mpoly.sf)[[1]][] [[1]] [[1]][[1]] [,1] [,2] [1,] -66 43 [2,] -70 47 [3,] -70 43 [4,] -66 43 [[1]][[2]] [,1] [,2] [1,] -68 44 [2,] -69 44 [3,] -69 45 [4,] -68 44 [[2]] [[2]][[1]] [,1] [,2] [1,] -67 45 [2,] -67 47 [3,] -69 47 [4,] -67 45 Alternative syntax In this tutorial, you were instructed to define the coordinate pairs in matrices. This is probably the simplest way to enter coordinate values manually. You can, however, bypass the creation of a matrix and simply define the coordinate pairs using the WKT syntax. For example, to generate the POLYGON geometry object from above, you could simply type: st_as_sfc( "POLYGON ((-66 43, -70 47, -70 43, -66 43), (-68 44, -69 44, -69 45, -68 44))" ) Geometry set for 1 feature Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 CRS: NA Note that the WKT syntax is that listed in the sfc and sf geometry columns. Also note that the function st_as_sfc is used as opposed to the st_sfc function used with matrices in earlier steps. Additional resources Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data”, The R Journal, pages 439-446. Pebesma, Edzar and Bivand, Roger. “Spatial Data Science: with applications in R”, https://keen-swartz-3146c4.netlify.app/ "],["vector-operations-in-r.html", "D Vector operations in R Dissolving geometries Subsetting by attribute Intersecting layers Clipping spatial objects using other spatial objects Unioning layers Buffering geometries", " D Vector operations in R R sf ggplot2 4.3.1 1.0.14 3.4.3 .scroll1 { max-height: 100px; overflow-y: auto; background-color: inherit; } Earlier versions of this tutorial made use of a combination of packages including raster and rgeos to perform most vector operations highlighted in this exercise. Many of these vector operations can now be performed using the sf package. As such, all code chunks in this tutorial make use sf for most vector operations. We’ll first load spatial objects used in this exercise. These include: A polygon layer that delineates Maine counties (USA), s1.sf; A polygon layer that delineates distances to Augusta (Maine) as concentric circles, s2.sf; A polyline layer of the interstate highway system that runs through Maine. These data are stored as sf objects. library(sf) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/Income_schooling_sf.rds")) s1.sf <- readRDS(z) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/Dist_sf.rds")) s2.sf <- readRDS(z) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/Highway_sf.rds")) l1.sf <- readRDS(z) A map of the above layers is shown below. We’ll use the ggplot2 package too generate this and subsequent maps in this tutorial. library(ggplot2) ggplot() + geom_sf(data = s1.sf) + geom_sf(data = s2.sf, alpha = 0.5, col = "red") + geom_sf(data = l1.sf, col = "blue") The attributes table for both polygon objects (s1.sf and s2.sf) are shown next. Note that each shape object has a unique set of attributes as well as a unique number of records Figure 2.6: Attribute tables for the Maine spatial object, s1.sf, (left table) and the distance to Augusta spatial object, s2.sf (right table). Dissolving geometries Dissolving by contiguous shape There are two different ways to dissolve geometries that share a common boundary. Both are presented next. Option 1 To dissolve all polygons that share at least one line segment, simply pass the object name to sf’s st_union function while making sure that the by_feature option is set to FALSE. In this example, we dissolve all polygons to create a single outline of the state of Maine. ME <- st_union(s1.sf, by_feature = FALSE) ggplot(ME) + geom_sf(fill = "grey") Note that the dissolving process removed all attributes from the original spatial object. You’ll also note that st_union returns an sfc object even though the input object is sf. You can convert the output to an sf object using the st_sf() function as in st_sf(ME). Option 2 Another approach is to make use of the dplyr package and its group_by/summarise functions. library(dplyr) ME <- s1.sf %>% group_by() %>% summarise() ggplot(ME) + geom_sf(fill = "grey") Note that this option will also remove any attributes associated with the input spatial object, however, the output remains an sf object (this differs from the st_union output). Dissolving by attribute You can also choose to dissolve based on an attribute’s values. First, we’ll create a new column whose value will be binary (TRUE/FALSE) depending on whether or not the county income is below the counties’ median income value. s1.sf$med <- s1.sf$Income > median(s1.sf$Income) ggplot(s1.sf) + geom_sf(aes(fill = med)) Next, we’ll dissolve all polygons by the med attribute. Any polygons sharing at least one line segment that have the same med value will be dissolved into a single polygon. Two approaches are presented here: one using sf’s aggregate function, the other using the dplyr approach adopted in the previous section. Option 1 ME.inc <- aggregate(s1.sf["med"], by = list(diss = s1.sf$med), FUN = function(x)x[1], do_union = TRUE) This option will create a new field defined in the by = parameter (diss in this working example). st_drop_geometry(ME.inc) # Print the layer's attributes table diss med 1 FALSE FALSE 2 TRUE TRUE Option 2 ME.inc <- s1.sf %>% group_by(med) %>% summarise() This option will limit the attributes to that/those listed in the group_by function. st_drop_geometry(ME.inc) # A tibble: 2 × 1 med * <lgl> 1 FALSE 2 TRUE A map of the resulting layer follows. ggplot(ME.inc) + geom_sf(aes(fill = med)) The dissolving (aggregating) operation will, by default, eliminate all other attribute values. If you wish to summarize other attribute values along with the attribute used for dissolving, use the dplyr piping operation option. For example, to compute the median Income value for each of the below/above median income groups type the following: ME.inc <- s1.sf %>% group_by(med) %>% summarize(medinc = median(Income)) ggplot(ME.inc) + geom_sf(aes(fill = medinc)) To view the attributes table with both the aggregate variable, med, and the median income variable, Income, type: st_drop_geometry(ME.inc) # A tibble: 2 × 2 med medinc * <lgl> <dbl> 1 FALSE 21518 2 TRUE 27955 Subsetting by attribute You can use conventional R dataframe manipulation operations to subset by attribute values. For example, to subset by county name (e.g. Kennebec county), type: ME.ken <- s1.sf[s1.sf$NAME == "Kennebec",] You can, of course, use piping operations to perform the same task as follows: ME.ken <- s1.sf %>% filter(NAME == "Kennebec") ggplot(ME.ken) + geom_sf() To subset by a range of attribute values (e.g. subset by income values that are less than the median value), type: ME.inc2 <- s1.sf %>% filter(Income < median(Income)) ggplot(ME.inc2) + geom_sf() Intersecting layers To intersect two polygon objects, use sf’s st_intersection function. clp1 <- st_intersection(s1.sf, s2.sf) ggplot(clp1) + geom_sf() st_intersection keeps all features that overlap along with their combined attributes. Note that new polygons are created which will increase the size of the attributes table beyond the size of the combined input attributes table. st_drop_geometry(clp1) NAME Income NoSchool NoSchoolSE IncomeSE med distance 8 Kennebec 25652 0.00570358 0.000917087 360.000 TRUE 20 12 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 20 14 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 20 1 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 50 5 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 50 6 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 50 7 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 50 8.1 Kennebec 25652 0.00570358 0.000917087 360.000 TRUE 50 9 Androscoggin 24268 0.00830953 0.001178660 460.606 TRUE 50 11 Knox 27141 0.00652269 0.001863920 684.849 TRUE 50 12.1 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 50 13 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 50 14.1 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 50 1.1 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 80 2 Piscataquis 21292 0.00633830 0.002128960 724.242 FALSE 80 3 Penobscot 23307 0.00684534 0.001025450 242.424 FALSE 80 5.1 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 80 6.1 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 80 7.1 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 80 9.1 Androscoggin 24268 0.00830953 0.001178660 460.606 TRUE 80 10 Hancock 28071 0.00238996 0.000784584 585.455 TRUE 80 11.1 Knox 27141 0.00652269 0.001863920 684.849 TRUE 80 12.2 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 80 13.1 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 80 14.2 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 80 1.2 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 120 2.1 Piscataquis 21292 0.00633830 0.002128960 724.242 FALSE 120 3.1 Penobscot 23307 0.00684534 0.001025450 242.424 FALSE 120 5.2 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 120 6.2 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 120 7.2 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 120 10.1 Hancock 28071 0.00238996 0.000784584 585.455 TRUE 120 13.2 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 120 15 York 28496 0.00529228 0.000737195 332.121 TRUE 120 Clipping spatial objects using other spatial objects The st_intersection can also be used to clip an input layer using another layer’s outer geometry boundaries as the “cookie cutter”. But note that the latter must be limited to its outer boundaries which may require that it be run through a dissolving operation (shown earlier in this tutorial) to dissolve internal boundaries. To clip s2.sf using the outline of s1.sf, type: clp2 <- st_intersection(s2.sf, st_union(s1.sf)) ggplot(clp2) + geom_sf() The order the layers are passed to the st_intersection function matters. Flipping the input layer in the last example will clip s1.sf to s2.sf’s bounding polygon(s). clp2 <- st_intersection(s1.sf, st_union(s2.sf)) ggplot(clp2) + geom_sf() Line geometries can also be clipped to polygon features. The output will be a line object that falls within the polygons of the input polygon object. For example, to output all line segments that fall within the concentric distance circles of s2.sf, type: clp3 <- st_intersection(l1.sf, st_union(s2.sf)) A plot of the clipped line features is shown with the outline of the clipping feature. ggplot(clp3) + geom_sf(data = clp3) + geom_sf(data = st_union(s2.sf), col = "red", fill = NA ) Unioning layers To union two polygon objects, use sf’s st_union function. For example, un1 <- st_union(s2.sf,s1.sf) ggplot(un1) + geom_sf(aes(fill = NAME), alpha = 0.4) This produces the following attributes table. distance NAME Income NoSchool NoSchoolSE IncomeSE med 1 20 Aroostook 21024 0.01338720 0.001406960 250.909 FALSE 2 50 Aroostook 21024 0.01338720 0.001406960 250.909 FALSE 3 80 Aroostook 21024 0.01338720 0.001406960 250.909 FALSE 4 120 Aroostook 21024 0.01338720 0.001406960 250.909 FALSE 1.1 20 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 2.1 50 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 3.1 80 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 4.1 120 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 1.2 20 Piscataquis 21292 0.00633830 0.002128960 724.242 FALSE 2.2 50 Piscataquis 21292 0.00633830 0.002128960 724.242 FALSE 3.2 80 Piscataquis 21292 0.00633830 0.002128960 724.242 FALSE 4.2 120 Piscataquis 21292 0.00633830 0.002128960 724.242 FALSE 1.3 20 Penobscot 23307 0.00684534 0.001025450 242.424 FALSE 2.3 50 Penobscot 23307 0.00684534 0.001025450 242.424 FALSE 3.3 80 Penobscot 23307 0.00684534 0.001025450 242.424 FALSE 4.3 120 Penobscot 23307 0.00684534 0.001025450 242.424 FALSE 1.4 20 Washington 20015 0.00478188 0.000966036 327.273 FALSE 2.4 50 Washington 20015 0.00478188 0.000966036 327.273 FALSE 3.4 80 Washington 20015 0.00478188 0.000966036 327.273 FALSE 4.4 120 Washington 20015 0.00478188 0.000966036 327.273 FALSE 1.5 20 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 2.5 50 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 3.5 80 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 4.5 120 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 1.6 20 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 2.6 50 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 3.6 80 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 4.6 120 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 1.7 20 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 2.7 50 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 3.7 80 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 4.7 120 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 1.8 20 Kennebec 25652 0.00570358 0.000917087 360.000 TRUE 2.8 50 Kennebec 25652 0.00570358 0.000917087 360.000 TRUE 3.8 80 Kennebec 25652 0.00570358 0.000917087 360.000 TRUE 4.8 120 Kennebec 25652 0.00570358 0.000917087 360.000 TRUE 1.9 20 Androscoggin 24268 0.00830953 0.001178660 460.606 TRUE 2.9 50 Androscoggin 24268 0.00830953 0.001178660 460.606 TRUE 3.9 80 Androscoggin 24268 0.00830953 0.001178660 460.606 TRUE 4.9 120 Androscoggin 24268 0.00830953 0.001178660 460.606 TRUE 1.10 20 Hancock 28071 0.00238996 0.000784584 585.455 TRUE 2.10 50 Hancock 28071 0.00238996 0.000784584 585.455 TRUE 3.10 80 Hancock 28071 0.00238996 0.000784584 585.455 TRUE 4.10 120 Hancock 28071 0.00238996 0.000784584 585.455 TRUE 1.11 20 Knox 27141 0.00652269 0.001863920 684.849 TRUE 2.11 50 Knox 27141 0.00652269 0.001863920 684.849 TRUE 3.11 80 Knox 27141 0.00652269 0.001863920 684.849 TRUE 4.11 120 Knox 27141 0.00652269 0.001863920 684.849 TRUE 1.12 20 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 2.12 50 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 3.12 80 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 4.12 120 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 1.13 20 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 2.13 50 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 3.13 80 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 4.13 120 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 1.14 20 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 2.14 50 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 3.14 80 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 4.14 120 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 1.15 20 York 28496 0.00529228 0.000737195 332.121 TRUE 2.15 50 York 28496 0.00529228 0.000737195 332.121 TRUE 3.15 80 York 28496 0.00529228 0.000737195 332.121 TRUE 4.15 120 York 28496 0.00529228 0.000737195 332.121 TRUE Note that the union operation can generate many overlapping geometries. This is because each geometry of the layers being unioned are paired up with one another creating unique combinations of each layer’s geometries. For example, the Aroostook County polygon from s1.sf is paired with each annulus of the s2.sf layer creating four new geometries. un1 %>% filter(NAME == "Aroostook") Simple feature collection with 4 features and 7 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: 318980.1 ymin: 4788093 xmax: 596500.1 ymax: 5255569 Projected CRS: +proj=utm +zone=19 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0 distance NAME Income NoSchool NoSchoolSE IncomeSE med 1 20 Aroostook 21024 0.0133872 0.00140696 250.909 FALSE 2 50 Aroostook 21024 0.0133872 0.00140696 250.909 FALSE 3 80 Aroostook 21024 0.0133872 0.00140696 250.909 FALSE 4 120 Aroostook 21024 0.0133872 0.00140696 250.909 FALSE geometry 1 MULTIPOLYGON (((438980 4928... 2 MULTIPOLYGON (((438980 4958... 3 MULTIPOLYGON (((438980 4988... 4 MULTIPOLYGON (((438980 5028... The union operation creates all possible pairs of geometries between both input objects (i.e. 4 circle geometries from s2.sf times 16 county geometries from s1.sf for a total of 64 geometries). Buffering geometries To buffer point, line or polygon geometries, use sf’s st_buffer function. For example, the following code chunk generates a 10 km (10,000 m) buffer around the polyline segments. l1.sf.buf <- st_buffer(l1.sf, dist = 10000) ggplot(l1.sf.buf) + geom_sf() + coord_sf(ndiscr = 1000) To create a continuous polygon geometry (i.e. to eliminate overlapping buffers), we’ll follow up with one of the dissolving techniques introduced earlier in this tutorial. l1.sf.buf.dis <- l1.sf.buf %>% group_by() %>% summarise() ggplot(l1.sf.buf.dis) + geom_sf() If you want to preserve an attribute value (such as highway number), modify the above code as follows: l1.sf.buf.dis <- l1.sf.buf %>% group_by(Number) %>% summarise() ggplot(l1.sf.buf.dis, aes(fill=Number) ) + geom_sf(alpha = 0.5) "],["mapping-rates-in-r.html", "E Mapping rates in R Raw Rates Standardized mortality ratios (relative risk) Dykes and Unwin’s chi-square statistic Unstable ratios", " E Mapping rates in R R spdep classInt RColorBrewer sf sp 4.3.1 1.2.8 0.4.10 1.1.3 1.0.14 2.0.0 In this exercise, we’ll make use of sf’s plot method instead of tmap to take advantage of sf’s scaled keys which will prove insightful when exploring rate mapping techniques that adopt none uniform classification schemes. The following libraries are used in the examples that follow. library(spdep) library(classInt) library(RColorBrewer) library(sf) library(sp) Next, we’ll initialize some color palettes. pal1 <- brewer.pal(6,"Greys") pal2 <- brewer.pal(8,"RdYlGn") pal3 <- c(brewer.pal(9,"Greys"), "#FF0000") The Auckland dataset from the spdep package will be used throughout this exercise. Some of the graphics that follow are R reproductions of Bailey and Gatrell’s book, Interactive Spatial Data Analysis (Bailey and Gatrell 1995). auckland <- st_read(system.file("shapes/auckland.shp", package="spData")[1]) Reading layer `auckland' from data source `C:\\Users\\mgimond\\AppData\\Local\\R\\win-library\\4.3\\spData\\shapes\\auckland.shp' using driver `ESRI Shapefile' Simple feature collection with 167 features and 4 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: 7.6 ymin: -4.3 xmax: 91.5 ymax: 99.3 CRS: NA The Auckland data represents total infant deaths (under the age of five) for Auckland, New Zealand, spanning the years 1977 through 1985 for different census area units. The following block of code maps these counts by area. Both equal interval and quantile classification schemes of the same data are mapped. brks1 <- classIntervals(auckland$M77_85, n = 6, style = "equal") brks2 <- classIntervals(auckland$M77_85, n = 6, style = "quantile") plot(auckland["M77_85"], breaks = brks1$brks, pal = pal1, at = round(brks1$brks,2), main = "Equal interval breaks", key.pos = 4, las = 1) plot(auckland["M77_85"], breaks = brks2$brks, pal = pal1, at = brks2$brks, main = "Quantile breaks", key.pos = 4, las = 1) These are examples of choropleth maps (choro = area and pleth = value) where some attribute (an enumeration of child deaths in this working example) is aggregated over a defined area (e.g. census area units) and displayed using two different classification schemes. Since the area units used to map death counts are not uniform in shape and area across Auckland, there is a tendency to assign more “visual weight” to polygons having larger areas than those having smaller areas. In our example, census units in the southern end of Auckland appear to have an “abnormally” large infant death count. Another perceptual interpretation of the map is one that flags those southern units as being “problematic” or of “great concern”. However, as we shall see in the following sections, this perception may not reflect reality. We therefore seek to produce perceptually tenable maps. Dykes and Unwin (Dykes and Unwin 2001) define a similar concept called map stability which seeks to produce maps that convey real effects. Raw Rates A popular approach for correcting for biased visual weights (due, for instance, to different unit area sizes) is to normalize the count data by area thus giving a count per unit area. Though this may make sense for population count data, it does not make a whole lot sense when applied to mortality counts; we are usually interested in the number of deaths per population count and not in the number of deaths per unit area. In the next chunk of code we extract population count under the age of 5 from the Auckland data set and assign this value to the variable pop. Likewise, we extract the under 5 mortality count and assign this value to the variable mor. Bear in mind that the mortality count spans a 9 year period. Since mortality rates are usually presented in rates per year, we need to multiply the population value (which is for the year 1981) by nine. This will be important in the subsequent code when we compute mortality rates. pop <- auckland$Und5_81 * 9 mor <- auckland$M77_85 Next, we will compute the raw rates (infant deaths per 1000 individuals per year) and map this rate by census unit area. Both quantile and equal interval classification schemes of the same data are mapped. auckland$raw.rate <- mor / pop * 1000 brks1 <- classIntervals(auckland$raw.rate, n = 6, style = "equal") brks2 <- classIntervals(auckland$raw.rate, n = 6, style = "quantile") plot(auckland["raw.rate"], breaks = brks1$brks, pal = pal1, at = round(brks1$brks,2), main = "Equal interval breaks", key.pos = 4, las = 1) plot(auckland["raw.rate"], breaks = brks2$brks, pal = pal1, at = round(brks2$brks,2), main = "Quantile breaks", key.pos = 4, las = 1) Note how our perception of the distribution of infant deaths changes when looking at mapped raw rates vs. counts. A north-south trend in perceived “abnormal” infant deaths is no longer apparent in this map. Standardized mortality ratios (relative risk) Another way to re-express the data is to map the Standardized Mortality Ratios (SMR)-a very popular form of representation in the field of epidemiology. Such maps map the ratios of the number of deaths to an expected death count. There are many ways to define an expected death count, many of which can be externally specified. In the following example, the expected death count \\(E_i\\) is estimated by multiplying the under 5 population count for each area by the overall death rate for Auckland: \\[E_i = {n_i}\\times{mortality_{Auckland} } \\] where \\(n_i\\) is the population count within census unit area \\(i\\) and \\(mortality_{Auckland}\\) is the overall death rate computed from \\(mortality_{Auckland} = \\sum_{i=1}^j O_i / \\sum_{i=1}^j n_i\\) where \\(O_i\\) is the observed death count for census unit \\(i\\). This chunk of code replicates Bailey and Gatrell’s figure 8.1 with the one exception that the color scheme is reversed (Bailey and Gatrell assign lighter hues to higher numbers). auck.rate <- sum(mor) / sum(pop) mor.exp <- pop * auck.rate # Expected count over a nine year period auckland$rel.rate <- 100 * mor / mor.exp brks <- classIntervals(auckland$rel.rate, n = 6, style = "fixed", fixedBreaks = c(0,47, 83, 118, 154, 190, 704)) plot(auckland["rel.rate"], breaks = brks$brks, at = brks$brks, pal = pal1, key.pos = 4, las = 1) Dykes and Unwin’s chi-square statistic Dykes and Unwin (Dykes and Unwin 2001) propose a similar technique whereby the rates are standardized following: \\[\\frac{O_i - E_i}{\\sqrt{E_i}} \\] This has the effect of creating a distribution of values closer to normal (as opposed to a Poisson distribution of rates and counts encountered thus far). We can therefore apply a diverging color scheme where green hues represent less than expected rates and red hues represent greater than expected rates. auckland$chi.squ = (mor - mor.exp) / sqrt(mor.exp) brks <- classIntervals(auckland$chi.squ, n = 6, style = "fixed", fixedBreaks = c(-5,-3, -1, -2, 0, 1, 2, 3, 5)) plot(auckland["chi.squ"], breaks = brks$brks, at = brks$brks, pal=rev(pal2), key.pos = 4, las = 1) Unstable ratios One problem with the various techniques used thus far is their sensitivity (hence instability) to small underlying population counts (i.e. unstable ratios). This next chunk of code maps the under 5 population count by census area unit. brks <- classIntervals(auckland$Und5_81, n = 6, style = "equal") plot(auckland["Und5_81"], breaks = brks$brks, at = brks$brks, pal = pal1, key.pos = 4, las = 1) Note the variability in population count with some areas encompassing fewer than 50 infants. If there is just one death in that census unit, the death rate would be reported as \\(1/50 * 1000\\) or 20 per thousand infants–far more than then the 2.63 per thousand rate for our Auckland data set. Interestingly, the three highest raw rates in Auckland (14.2450142, 18.5185185, 10.5820106 deaths per 1000) are associated with some of the smallest underlying population counts (39, 6, 21 infants under 5). One approach to circumventing this issue is to generate a probability map of the data. The next section highlights such an example. Global Empirical Bayes (EB) rate estimate The idea behind Bayesian approach is to compare the value in some area \\(i\\) to some a priori estimate of the value and to “stabilize” the values due to unstable ratios (e.g. where area populations are small). The a priori estimate can be based on some global mean. An example of the use on a global EB infant mortality rate map is shown below. The EB map is shown side-by-side with the raw rates map for comparison. aka Global moment estimator of infant mortality per 1000 per year EB.est <- EBest(auckland$M77_85, auckland$Und5_81 * 9 ) auckland$EBest <- EB.est$estmm * 1000 brks1 <- classIntervals(auckland$EBest, n = 10, style = "quantile") brks2 <- classIntervals(auckland$raw.rate, n = 10, style = "quantile") plot(auckland["EBest"], breaks = brks1$brks, at = round(brks1$brks, 2), pal = pal3, main="EB rates", key.pos = 4, las = 1) plot(auckland["raw.rate"], breaks = brks2$brks, at = round(brks2$brks, 2), pal = pal3, main="Raw Rates", key.pos = 4, las = 1) The census units with the top 10% rates are highlighted in red. Unstable rates (i.e. those associated with smaller population counts) are assigned lower weights to reduce their “prominence” in the mapped data. Notice how the three high raw rates highlighted in the last section are reduced from 14.2450142, 18.5185185, 10.5820106 counts per thousand to 3.6610133, 2.8672132, 3.0283279 counts per thousand. The “remapping” of these values along with others can be shown on the following plot: Local Empirical Bayes (EB) rate estimate The a priori mean and variance need not be aspatial (i.e. the prior distribution being the same for the entire Auckland study area). The adjusted estimated rates can be shrunk towards a local mean instead. Such technique is referred to as local empirical Bayes rate estimates. In the following example, we define local as consisting of all first order adjacent census unit areas. nb <- poly2nb(auckland) EBL.est <- EBlocal(auckland$M77_85, 9*auckland$Und5_81, nb) auckland$EBLest <- EBL.est$est * 1000 brks1 <- classIntervals(auckland$EBLest, n = 10, style = "quantile") brks2 <- classIntervals(auckland$raw.rate, n = 10, style = "quantile") plot(auckland["EBLest"], breaks = brks1$brks, at = round(brks1$brks,2), pal = pal3, main = "Local EB rates", key.pos = 4, las = 1) plot(auckland["raw.rate"], breaks = brks2$brks, at = round(brks2$brks,2), pal = pal3, main = "Raw Rates", key.pos = 4, las = 1) The census units with the top 10% rates are highlighted in red. References "],["raster-operations-in-r.html", "F Raster operations in R Sample files for this exercise Local operations and functions Focal operations and functions Zonal operations and functions Global operations and functions Computing cumulative distances", " F Raster operations in R R terra sf tmap gdistance ggplot2 rasterVis 4.3.1 1.7.55 1.0.14 3.3.3 1.6.4 3.4.3 0.51.5 Sample files for this exercise We’ll first load spatial objects used in this exercise from a remote website: an elevation SpatRaster object, a bathymetry SpatRaster object and a continents sf vector object library(terra) library(sf) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/elev_world.RDS")) elev <- unwrap(readRDS(z)) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/bath_world.RDS")) bath <- unwrap(readRDS(z)) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/continent_global.RDS")) cont <- readRDS(z) Both rasters cover the entire globe. Elevation below mean sea level are encoded as 0 in the elevation raster. Likewise, bathymetry values above mean sea level are encoded as 0. Note that most of the map algebra operations and functions covered in this tutorial are implemented using the terra package. See chapter 10 for a theoretical discussion of map algebra operations. Local operations and functions Unary operations and functions (applied to single rasters) Most algebraic operations can be applied to rasters as they would with any vector element. For example, to convert all bathymetric values in bath (currently recorded as positive values) to negative values simply multiply the raster by -1. bath2 <- bath * (-1) Another unary operation that can be applied to a raster is reclassification. In the following example, we will assign all bath2 values that are less than zero a 1 and all zero values will remain unchanged. A simple way to do this is to apply a conditional statement. bath3 <- bath2 < 0 Let’s look at the output. Note that all 0 pixels are coded as FALSE and all 1 pixels are coded as TRUE. library(tmap) tm_shape(bath3) + tm_raster(palette = "Greys") + tm_legend(outside = TRUE, text.size = .8) If a more elaborate form of reclassification is desired, you can use the classify function. In the following example, the raster object bath is reclassified to 4 unique values: 100, 500, 1000 and 11000 as follows: Original depth values Reclassified values 0 - 100 100 101 - 500 500 501 - 1000 1000 1001 - 11000 11000 The first step is to create a plain matrix where the first and second columns list the starting and ending values of the range of input values that are to be reclassified, and where the third column lists the new raster cell values. m <- c(0, 100, 100, 100, 500, 500, 500, 1000, 1000, 1000, 11000, 11000) m <- matrix(m, ncol=3, byrow = T) m [,1] [,2] [,3] [1,] 0 100 100 [2,] 100 500 500 [3,] 500 1000 1000 [4,] 1000 11000 11000 bath3 <- classify(bath, m, right = T) The right=T parameter indicates that the intervals should be closed to the right (i.e. the second column of the reclassification matrix is inclusive). tm_shape(bath3) + tm_raster(style="cat") + tm_legend(outside = TRUE, text.size = .8) You can also assign NA (missing) values to pixels. For example, to assign NA values to cells that are equal to 100, type bath3[bath3 == 100] <- NA The following chunk of code highlights all NA pixels in grey and labels them as missing. tm_shape(bath3) + tm_raster(showNA=TRUE, colorNA="grey") + tm_legend(outside = TRUE, text.size = .8) Binary operations and functions (where two rasters are used) In the following example, elev (elevation raster) is added to bath (bathymetry raster) to create a single elevation raster for the globe. Note that the bathymetric raster will need to be multiplied by -1 to differentiate above mean sea level elevation from below mean sea level depth. elevation <- elev - bath tm_shape(elevation) + tm_raster(palette="-RdBu") + tm_legend(outside = TRUE, text.size = .8) Focal operations and functions Operations or functions applied focally to rasters involve user defined neighboring cells. Focal operations can be performed using the focal function. For example, to smooth out the elevation raster by computing the mean cell values over a 11 by 11 cells window, type: f1 <- focal(elevation, w = 11 , fun = mean) The w argument defines the focal window. If it’s given a single number (as is the case in the above code chunk), that number will define the width and height (in cell counts) of the focal window with each cell assigned equal weight. w can also be passed a matrix with each element in that matrix defining the weight for each cell. The following code chunk will generate the same output as the previous code chunk: f1 <- focal(elevation, w = matrix(1, nrow = 11, ncol = 11), fun = mean) tm_shape(f1) + tm_raster(palette="-RdBu") + tm_legend(outside = TRUE, text.size = .8) By default edge cells are assigned a value of NA. This is because cells outside of the input raster extent have no value, so when the average value is computed for a cell along the raster’s edge, the kernel will include the NA values outside the raster’s extent. To see an example of this, we will first smooth the raster using a 3 by 3 focal window, then we’ll zoom in on a 3 by 3 portion of the elevation raster in the above left-hand corner of its extent. # Run a 3x3 smooth on the raster f_mean <- focal(elevation, w = 3 , fun = mean) Figure F.1: Upper left-hand corner of elevation raster Note the NA values in the upper row (shown in bisque color). You might have noticed the lack of edge effect issues along the western edge of the raster outputs. This is because the focal function will wrap the eastern edge of the raster to the western edge of that same raster if the input raster layer spans the entire globe (i.e from -180 ° to +180 °). To have the focal function ignore missing values, simply add the na.rm = TRUE option. # Run a 3x3 smooth on the raster f_mean_no_na <- focal(elevation, w = 3 , fun = mean, na.rm = TRUE) Figure F.2: Upper left-hand corner of elevation raster. Border edge ignored. In essence, the above row of values are computed using just 6 values instead of 9 values (the corner values still make use of the across-180° values). Another option is to expand the row edge beyond its extent by replicating the edge values. This can be done by setting exapnd to true. For example: # Run a 3x3 smooth on the raster f_mean_expand <- focal(elevation, w = 3, fun = mean, expand = TRUE) Figure F.3: Upper left-hand corner of elevation raster Note that if expand is set to TRUE, the na.rm argument is ignored. But, you must be careful in making use of the na.rm = TRUE if you are using a matrix to define the weights as opposed to using the fun functions. For example, the mean function can be replicated using the matrix operation as follows: f_mean <- focal(elevation, w = 3, fun = mean) f_mat <- focal(elevation, w = matrix(1/9, nrow = 3, ncol = 3)) Note that if fun is not defined, it will default to summing the weighted pixel values. Figure F.4: Upper left-hand corner of elevation raster Note the similar output. Now, if we set na.rm to TRUE to both functions, we get: f_mean <- focal(elevation, w = 3, fun = mean, na.rm = TRUE) f_mat <- focal(elevation, w = matrix(1/9, nrow = 3, ncol = 3), na.rm = TRUE) Figure F.5: Upper left-hand corner of elevation raster Note the smaller edge values from the matrix defined weights raster. This is because the matrix is assigning 1/9th the weight for each pixel regardless of the number of pixels used to compute the output pixel values. So the upper edge pixels are summing values from just 6 weighted pixels as opposed to eight. For example, the middle top pixel is computed from 1/9(-4113 -4113 -4112 -4107 -4104 -4103), which results in dividing the sum of six values by nine–hence the unbalanced weight effect. Note that we do not have that problem using the mean function. The neighbors matrix (or kernel) that defines the moving window can be customized. For example if we wanted to compute the average of all 8 neighboring cells excluding the central cell we could define the matrix as follows: m <- matrix(c(1,1,1,1,0,1,1,1,1)/8,nrow = 3) f2 <- focal(elevation, w=m, fun=sum) More complicated kernels can be defined. In the following example, a Sobel filter (used for edge detection in image processing) is defined then applied to the raster layer elevation. Sobel <- matrix(c(-1,0,1,-2,0,2,-1,0,1) / 4, nrow=3) f3 <- focal(elevation, w=Sobel, fun=sum) tm_shape(f3) + tm_raster(palette="Greys") + tm_legend(legend.show = FALSE) Zonal operations and functions A common zonal operation is the aggregation of cells. In the following example, raster layer elevation is aggregated to a 5x5 raster layer. z1 <- aggregate(elevation, fact=2, fun=mean, expand=TRUE) tm_shape(z1) + tm_raster(palette="-RdBu",n=6) + tm_legend(outside = TRUE, text.size = .8) The image may not look much different from the original, but a look at the image properties will show a difference in pixel sizes. res(elevation) [1] 0.3333333 0.3333333 res(z1) [1] 0.6666667 0.6666667 z1’s pixel dimensions are half of elevation’s dimensions. You can reverse the process by using the disaggregate function which will split a cell into the desired number of subcells while assigning each one the same parent cell value. Zonal operations can often involve two layers, one with the values to be aggregated, the other with the defined zones. In the next example, elevation’s cell values are averaged by zones defined by the cont polygon layer. The following chunk computes the mean elevation value for each unique polygon in cont, cont.elev <- extract(elevation, cont, fun=mean, bind = TRUE) The output is a SpatVector. If you want to output a dataframe, set bind to FALSE. cont.elev can be converted back to an sf object as follows: cont.elev.sf <- st_as_sf(cont.elev) The column of interest is automatically named band1. We can now map the average elevation by continent. tm_shape(cont.elev.sf) + tm_polygons(col="band1") + tm_legend(outside = TRUE, text.size = .8) Many custom functions can be applied to extract. For example, to extract the maximum elevation value by continent, type: cont.elev <- extract(elevation, cont, fun=max, bind = TRUE) As another example, we may wish to extract the number of pixels in each polygon using a customized function. cont.elev <- extract(elevation, cont, fun=function(x,...){length(x)}, bind = TRUE) Global operations and functions Global operations and functions may make use of all input cells of a grid in the computation of an output cell value. An example of a global function is the Euclidean distance function, distance, which computes the shortest distance between a pixel and a source (or destination) location. To demonstrate the distance function, we’ll first create a new raster layer with two non-NA pixels. r1 <- rast(ncols=100, nrows=100, xmin=0, xmax=100, ymin=0, ymax=100) r1[] <- NA # Assign NoData values to all pixels r1[c(850, 5650)] <- 1 # Change the pixels #850 and #5650 to 1 crs(r1) <- "+proj=ortho" # Assign an arbitrary coordinate system (needed for mapping with tmap) tm_shape(r1) + tm_raster(palette="red") + tm_legend(outside = TRUE, text.size = .8) Next, we’ll compute a Euclidean distance raster from these two cells. The output extent will default to the input raster extent. r1.d <- distance(r1) tm_shape(r1.d) + tm_raster(palette = "Greens", style="order", title="Distance") + tm_legend(outside = TRUE, text.size = .8) + tm_shape(r1) + tm_raster(palette="red", title="Points") You can also compute a distance raster using sf point objects. In the following example, distances to points (25,30) and (87,80) are computed for each output cell. However, since we are working off of point objects (and not an existing raster as was the case in the previous example), we will need to create a blank raster layer which will define the extent of the Euclidean distance raster output. r2 <- rast(ncols=100, nrows=100, xmin=0, xmax=100, ymin=0, ymax=100) crs(r2) <- "+proj=ortho" # Assign an arbitrary coordinate system # Create a point layer p1 <- st_as_sf(st_as_sfc("MULTIPOINT(25 30, 87 80)", crs = "+proj=ortho")) Now let’s compute the Euclidean distance to these points using the distance function. r2.d <- distance(r2, p1) |---------|---------|---------|---------| ========================================= Let’s plot the resulting output. tm_shape(r2.d) + tm_raster(palette = "Greens", style="order") + tm_legend(outside = TRUE, text.size = .8) + tm_shape(p1) + tm_bubbles(col="red") Computing cumulative distances This exercise demonstrates how to use functions from the gdistance package to generate a cumulative distance raster. One objective will be to demonstrate the influence “adjacency cells” wields in the final results. Load the gdistance package. library(gdistance) First, we’ll create a 100x100 raster and assign a value of 1 to each cell. The pixel value defines the cost (other than distance) in traversing that pixel. In this example, we’ll assume that the cost is uniform across the entire extent. r <- rast(nrows=100,ncols=100,xmin=0,ymin=0,xmax=100,ymax=100) r[] <- rep(1, ncell(r)) If you were to include traveling costs other than distance (such as elevation) you would assign those values to each cell instead of the constant value of 1. A translation matrix allows one to define a ‘traversing’ cost going from one cell to an adjacent cell. Since we are assuming there are no ‘costs’ (other than distance) in traversing from one cell to any adjacent cell we’ll assign a value of 1, function(x){1}, to the translation between a cell and its adjacent cells (i.e. translation cost is uniform in all directions). There are four different ways in which ‘adjacency’ can be defined using the transition function. These are showcased in the next four blocks of code. In this example, adjacency is defined as a four node (vertical and horizontal) connection (i.e. a “rook” move). h4 <- transition(raster(r), transitionFunction = function(x){1}, directions = 4) In this example, adjacency is defined as an eight node connection (i.e. a single cell “queen” move). h8 <- transition(raster(r), transitionFunction = function(x){1}, directions = 8) In this example, adjacency is defined as a sixteen node connection (i.e. a single cell “queen” move combined with a “knight” move). h16 <- transition(raster(r), transitionFunction=function(x){1},16,symm=FALSE) In this example, adjacency is defined as a four node diagonal connection (i.e. a single cell “bishop” move). hb <- transition(raster(r), transitionFunction=function(x){1},"bishop",symm=FALSE) The transition function treats all adjacent cells as being at an equal distance from the source cell across the entire raster. geoCorrection corrects for ‘true’ local distance. In essence, it’s adding an additional cost to traversing from one cell to an adjacent cell (the original cost being defined using the transition function). The importance of applying this correction will be shown later. Note: geoCorrection also corrects for distance distortions associated with data in a geographic coordinate system. To take advantage of this correction, make sure to define the raster layer’s coordinate system using the projection function. h4 <- geoCorrection(h4, scl=FALSE) h8 <- geoCorrection(h8, scl=FALSE) h16 <- geoCorrection(h16, scl=FALSE) hb <- geoCorrection(hb, scl=FALSE) In the “queen’s” case, the diagonal neighbors are \\(\\sqrt{2 x (CellWidth)^{2}}\\) times the cell width distance from the source cell. Next we will map the cumulative distance (accCost) from a central point (A) to all cells in the raster using the four different adjacency definitions. A <- c(50,50) # Location of source cell h4.acc <- accCost(h4,A) h8.acc <- accCost(h8,A) h16.acc <- accCost(h16,A) hb.acc <- accCost(hb,A) If the geoCorrection function had not been applied in the previous steps, the cumulative distance between point location A and its neighboring adjacent cells would have been different. Note the difference in cumulative distance for the 16-direction case as shown in the next two figures. Uncorrected (i.e. geoCorrection not applied to h16): Corrected (i.e. geoCorrection applied to h16): The “bishop” case offers a unique problem: only cells in the diagonal direction are identified as being adjacent. This leaves many undefined cells (labeled as Inf). We will change the Inf cells to NA cells. hb.acc[hb.acc == Inf] <- NA Now let’s compare a 7x7 subset (centered on point A) between the four different cumulative distance rasters. To highlight the differences between all four rasters, we will assign a red color to all cells that are within 20 cell units of point A. It’s obvious that the accuracy of the cumulative distance raster can be greatly influenced by how we define adjacent nodes. The number of red cells (i.e. area identified as being within a 20 units cumulative distance) ranges from 925 to 2749 cells. Working example In the following example, we will generate a raster layer with barriers (defined as NA cell values). The goal will be to identify all cells that fall within a 290 km traveling distance from the upper left-hand corner of the raster layer (the green point in the maps). Results between an 8-node and 16-node adjacency definition will be compared. # create an empty raster r <- rast(nrows=300,ncols=150,xmin=0,ymin=0,xmax=150000, ymax=300000) # Define a UTM projection (this sets map units to meters) crs(r) = "+proj=utm +zone=19 +datum=NAD83" # Each cell is assigned a value of 1 r[] <- rep(1, ncell(r)) # Generate 'baffles' by assigning NA to cells. Cells are identified by # their index and not their coordinates. # Baffles need to be 2 cells thick to prevent the 16-node # case from "jumping" a one pixel thick NA cell. a <- c(seq(3001,3100,1),seq(3151,3250,1)) a <- c(a, a+6000, a+12000, a+18000, a+24000, a+30000, a+36000) a <- c(a , a+3050) r[a] <- NA # Let's check that the baffles are properly placed tm_shape(r) + tm_raster(colorNA="red") + tm_legend(legend.show=FALSE) # Next, generate a transition matrix for the 8-node case and the 16-node case h8 <- transition(raster(r), transitionFunction = function(x){1}, directions = 8) h16 <- transition(raster(r), transitionFunction = function(x){1}, directions = 16) # Now assign distance cost to the matrices. h8 <- geoCorrection(h8) h16 <- geoCorrection(h16) # Define a point source and assign a projection A <- SpatialPoints(cbind(50,290000)) crs(A) <- "+proj=utm +zone=19 +datum=NAD83 +units=m +no_defs" # Compute the cumulative cost raster h8.acc <- accCost(h8, A) h16.acc <- accCost(h16,A) # Replace Inf with NA h8.acc[h8.acc == Inf] <- NA h16.acc[h16.acc == Inf] <- NA Let’s plot the results. Yellow cells will identify cumulative distances within 290 km. tm_shape(h8.acc) + tm_raster(n=2, style="fixed", breaks=c(0,290000,Inf)) + tm_facets() + tm_shape(A) + tm_bubbles(col="green", size = .5) + tm_legend(outside = TRUE, text.size = .8) tm_shape(h16.acc) + tm_raster(n=2, style="fixed", breaks=c(0,290000,Inf)) + tm_facets() + tm_shape(A) + tm_bubbles(col="green", size = .5) + tm_legend(outside = TRUE, text.size = .8) We can compute the difference between the 8-node and 16-node cumulative distance rasters: table(h8.acc[] <= 290000) FALSE TRUE 31458 10742 table(h16.acc[] <= 290000) FALSE TRUE 30842 11358 The number of cells identified as being within a 290 km cumulative distance of point A for the 8-node case is 10742 whereas it’s 11358 for the 16-node case, a difference of 5.4%. "],["coordinate-systems-in-r.html", "G Coordinate Systems in R A note about the changes to the PROJ environment Sample files for this exercise Loading the sf package Checking for a coordinate system Understanding the Proj4 coordinate syntax Assigning a coordinate system Transforming coordinate systems A note about containment Creating Tissot indicatrix circles", " G Coordinate Systems in R R terra sf tmap geosphere 4.3.1 1.7.55 1.0.14 3.3.3 1.5.18 A note about the changes to the PROJ environment Newer versions of sf make use of the PROJ 6.0 C library or greater. Note that the version of PROJ is not to be confused with the version of the proj4 R package–the proj4 and sf packages make use of the PROJ C library that is developed independent of R. You can learn more about the PROJ development at proj.org. There has been a significant change in the PROJ library since the introduction of version 6.0. This has had serious implications in the development of the R spatial ecosystem. As such, if you are using an older version of sf or proj4 that was developed with a version of PROJ older than 6.0, some of the input/output presented in this appendix may differ from yours. Sample files for this exercise Data used in this exercise can be loaded into your current R session by running the following chunk of code. library(terra) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/elev.RDS")) elev.r <- unwrap(readRDS(z)) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/s_sf.RDS")) s.sf <- readRDS(z) We’ll make use of two data layers in this exercise: a Maine counties polygon layer (s.sf) and an elevation raster layer (elev.r). The former is in an sf format and the latter is in a SpatRaster format. Loading the sf package library(sf) Note the versions of GEOS, GDAL and PROJ the package sf is linked to. Different versions of these libraries may result in different outcomes than those shown in this appendix. You can check the linked library versions as follows: sf_extSoftVersion()[1:3] GEOS GDAL proj.4 "3.11.2" "3.7.2" "9.3.0" Checking for a coordinate system To extract coordinate system (CS) information from an sf object use the st_crs function. st_crs(s.sf) Coordinate Reference System: User input: EPSG:26919 wkt: PROJCRS["NAD83 / UTM zone 19N", BASEGEOGCRS["NAD83", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4269]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1]], USAGE[ SCOPE["Engineering survey, topographic mapping."], AREA["North America - between 72°W and 66°W - onshore and offshore. Canada - Labrador; New Brunswick; Nova Scotia; Nunavut; Quebec. Puerto Rico. United States (USA) - Connecticut; Maine; Massachusetts; New Hampshire; New York (Long Island); Rhode Island; Vermont."], BBOX[14.92,-72,84,-66]], ID["EPSG",26919]] With the newer version of the PROJ C library, the coordinate system is defined using the Well Known Text (WTK/WTK2) format which consists of a series of [...] tags. The WKT format will usually start with a PROJCRS[...] tag for a projected coordinate system, or a GEOGCRS[...] tag for a geographic coordinate system. The CRS output will also consist of a user defined CS definition which can be an EPSG code (as is the case in this example), or a string defining the datum and projection type. You can also extract CS information from a SpatRaster object use the st_crs function. st_crs(elev.r) Coordinate Reference System: User input: BOUNDCRS[ SOURCECRS[ PROJCRS["unknown", BASEGEOGCRS["unknown", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]], ID["EPSG",6269]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]], ID["EPSG",16019]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]]]], TARGETCRS[ GEOGCRS["WGS 84", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], CS[ellipsoidal,2], AXIS["geodetic latitude (Lat)",north, ORDER[1], ANGLEUNIT["degree",0.0174532925199433]], AXIS["geodetic longitude (Lon)",east, ORDER[2], ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4326]]], ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84", METHOD["Geocentric translations (geog2D domain)", ID["EPSG",9603]], PARAMETER["X-axis translation",0, ID["EPSG",8605]], PARAMETER["Y-axis translation",0, ID["EPSG",8606]], PARAMETER["Z-axis translation",0, ID["EPSG",8607]]]] wkt: BOUNDCRS[ SOURCECRS[ PROJCRS["unknown", BASEGEOGCRS["unknown", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]], ID["EPSG",6269]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]], ID["EPSG",16019]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]]]], TARGETCRS[ GEOGCRS["WGS 84", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], CS[ellipsoidal,2], AXIS["geodetic latitude (Lat)",north, ORDER[1], ANGLEUNIT["degree",0.0174532925199433]], AXIS["geodetic longitude (Lon)",east, ORDER[2], ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4326]]], ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84", METHOD["Geocentric translations (geog2D domain)", ID["EPSG",9603]], PARAMETER["X-axis translation",0, ID["EPSG",8605]], PARAMETER["Y-axis translation",0, ID["EPSG",8606]], PARAMETER["Z-axis translation",0, ID["EPSG",8607]]]] Up until recently, there has been two ways of defining a coordinate system: via the EPSG numeric code or via the PROJ4 formatted string. Both can be used with the sf and SpatRast objects. With the newer version of the PROJ C library, you can also define an sf object’s coordinate system using the Well Known Text (WTK/WTK2) format. This format has a more elaborate syntax (as can be seen in the previous outputs) and may not necessarily be the easiest way to manually define a CS. When possible, adopt an EPSG code which comes from a well established authority. However, if customizing a CS, it may be easiest to adopt a PROJ4 syntax. Understanding the Proj4 coordinate syntax The PROJ4 syntax consists of a list of parameters, each prefixed with the + character. For example, elev.r’s CS is in a UTM projection (+proj=utm) for zone 19 (+zone=19) and in an NAD 1983 datum (+datum=NAD83). A list of a few of the PROJ4 parameters used in defining a coordinate system follows. Click here for a full list of parameters. +a Semimajor radius of the ellipsoid axis +b Semiminor radius of the ellipsoid axis +datum Datum name +ellps Ellipsoid name +lat_0 Latitude of origin +lat_1 Latitude of first standard parallel +lat_2 Latitude of second standard parallel +lat_ts Latitude of true scale +lon_0 Central meridian +over Allow longitude output outside -180 to 180 range, disables wrapping +proj Projection name +south Denotes southern hemisphere UTM zone +units meters, US survey feet, etc. +x_0 False easting +y_0 False northing +zone UTM zone You can view the list of available projections +proj= here. Assigning a coordinate system A coordinate system definition can be passed to a spatial object. It can either fill a spatial object’s empty CS definition or it can overwrite its existing CS definition (the latter should only be executed if there is good reason to believe that the original definition is erroneous). Note that this step does not change an object’s underlying coordinate values (this process will be discussed in the next section). We’ll pretend that a CS definition was not assigned to s.sf and assign one manually using the st_set_crs() function. In the following example, we will define the CS using the proj4 syntax. s.sf <- st_set_crs(s.sf, "+proj=utm +zone=19 +ellps=GRS80 +datum=NAD83") Let’s now check the object’s CS. st_crs(s.sf) Coordinate Reference System: User input: +proj=utm +zone=19 +ellps=GRS80 +datum=NAD83 wkt: PROJCRS["unknown", BASEGEOGCRS["unknown", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]], ID["EPSG",6269]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]], ID["EPSG",16019]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]]] You’ll note that the User input: field now shows the proj4 string as defined in our call to the st_set_crs() function. But you’ll also note that some of the parameters in the WKT string such as the PROJCRS[...] and BASEGEOGCRS[...] tags are not defined (unknown). This is not necessarily a problem given that key datum and projection information are present in that WKT string (make sure to scroll down in the output box to see the other WKT parameters). Nonetheless, it’s not a bad idea to define the CS using EPSG code when one is available. We’ll do this next. The UTM NAD83 Zone 19N EPSG code equivalent is 26919. s.sf <- st_set_crs(s.sf, 26919) Let’s now check the object’s CS. st_crs(s.sf) Coordinate Reference System: User input: EPSG:26919 wkt: PROJCRS["NAD83 / UTM zone 19N", BASEGEOGCRS["NAD83", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4269]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1]], USAGE[ SCOPE["Engineering survey, topographic mapping."], AREA["North America - between 72°W and 66°W - onshore and offshore. Canada - Labrador; New Brunswick; Nova Scotia; Nunavut; Quebec. Puerto Rico. United States (USA) - Connecticut; Maine; Massachusetts; New Hampshire; New York (Long Island); Rhode Island; Vermont."], BBOX[14.92,-72,84,-66]], ID["EPSG",26919]] Key projection parameters remain the same. But additional information is added to the WKT header. You can use the PROJ4 string defined earlier for s.sf to define a raster’s CRS using the crs() function as follows (here too we’ll assume that the spatial object had a missing reference system or an incorrectly defined one). crs(elev.r) <- "+proj=utm +zone=19 +ellps=GRS80 +datum=NAD83" Note that we do not need to define all of the parameters so long as we know that the default values for these undefined parameters are correct. Also note that we do not need to designate a hemisphere since the NAD83 datum applies only to North America. Let’s check the raster’s CS: st_crs(elev.r) Coordinate Reference System: User input: PROJCRS["unknown", BASEGEOGCRS["unknown", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]], ID["EPSG",6269]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]], ID["EPSG",16019]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]]] wkt: PROJCRS["unknown", BASEGEOGCRS["unknown", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]], ID["EPSG",6269]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]], ID["EPSG",16019]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]]] To define a raster’s CS using an EPSG code, use the following PROJ4 syntax: crs(elev.r) <- "+init=EPSG:26919" st_crs(elev.r) Coordinate Reference System: User input: NAD83 / UTM zone 19N wkt: PROJCRS["NAD83 / UTM zone 19N", BASEGEOGCRS["NAD83", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4269]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]], ID["EPSG",16019]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]], USAGE[ SCOPE["unknown"], AREA["North America - between 72°W and 66°W - onshore and offshore. Canada - Labrador; New Brunswick; Nova Scotia; Nunavut; Quebec. Puerto Rico. United States (USA) - Connecticut; Maine; Massachusetts; New Hampshire; New York (Long Island); Rhode Island; Vermont."], BBOX[14.92,-72,84,-66]]] To recreate a CS defined in a software such as ArcGIS, it is best to extract the CS’ WKID/EPSG code, then use that number to look up the PROJ4 syntax on http://spatialreference.org/ref/. For example, in ArcGIS, the WKID number can be extracted from the coordinate system properties output. Figure G.1: An ArcGIS dataframe coordinate system properties window. Note the WKID/EPSG code of 26919 (highlighted in red) associated with the NAD 1983 UTM Zone 19 N CS. That number can then be entered in the http://spatialreference.org/ref/’s search box to pull the Proj4 parameters (note that you must select Proj4 from the list of syntax options). Figure G.2: Example of a search result for EPSG 26919 at http://spatialreference.org/ref/. Note that after clicking the EPSG:269191 link, you must then select the Proj4 syntax from a list of available syntaxes to view the projection parameters Here are examples of a few common projections: Projection WKID Authority Syntax UTM NAD 83 Zone 19N 26919 EPSG +proj=utm +zone=19 +ellps=GRS80 +datum=NAD83 +units=m +no_defs USA Contiguous albers equal area 102003 ESRI +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs Alaska albers equal area 3338 EPSG +proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs World Robinson 54030 ESRI +proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs Transforming coordinate systems The last section showed you how to define or modify the coordinate system definition. This section shows you how to transform the coordinate values associated with the spatial object to a different coordinate system. This process calculates new coordinate values for the points or vertices defining the spatial object. For example, to transform the s.sf vector object to a WGS 1984 geographic (long/lat) coordinate system, we’ll use the st_transform function. s.sf.gcs <- st_transform(s.sf, "+proj=longlat +datum=WGS84") st_crs(s.sf.gcs) Coordinate Reference System: User input: +proj=longlat +datum=WGS84 wkt: GEOGCRS["unknown", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]], ID["EPSG",6326]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]], CS[ellipsoidal,2], AXIS["longitude",east, ORDER[1], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]], AXIS["latitude",north, ORDER[2], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]]] Using the EPSG code equivalent (4326) instead of the proj4 string yields: s.sf.gcs <- st_transform(s.sf, 4326) st_crs(s.sf.gcs) Coordinate Reference System: User input: EPSG:4326 wkt: GEOGCRS["WGS 84", ENSEMBLE["World Geodetic System 1984 ensemble", MEMBER["World Geodetic System 1984 (Transit)"], MEMBER["World Geodetic System 1984 (G730)"], MEMBER["World Geodetic System 1984 (G873)"], MEMBER["World Geodetic System 1984 (G1150)"], MEMBER["World Geodetic System 1984 (G1674)"], MEMBER["World Geodetic System 1984 (G1762)"], MEMBER["World Geodetic System 1984 (G2139)"], ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]], ENSEMBLEACCURACY[2.0]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], CS[ellipsoidal,2], AXIS["geodetic latitude (Lat)",north, ORDER[1], ANGLEUNIT["degree",0.0174532925199433]], AXIS["geodetic longitude (Lon)",east, ORDER[2], ANGLEUNIT["degree",0.0174532925199433]], USAGE[ SCOPE["Horizontal component of 3D system."], AREA["World."], BBOX[-90,-180,90,180]], ID["EPSG",4326]] This approach may add a few more tags (These reflect changes in datum definitions in newer versions of the PROJ library) but, the coordinate values should be the same To transform a raster object, use the project() function. elev.r.gcs <- project(elev.r, y="+proj=longlat +datum=WGS84") st_crs(elev.r.gcs) Coordinate Reference System: User input: GEOGCRS["unknown", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]], ID["EPSG",6326]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]], CS[ellipsoidal,2], AXIS["longitude",east, ORDER[1], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]], AXIS["latitude",north, ORDER[2], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]]] wkt: GEOGCRS["unknown", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]], ID["EPSG",6326]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]], CS[ellipsoidal,2], AXIS["longitude",east, ORDER[1], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]], AXIS["latitude",north, ORDER[2], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]]] If an EPSG code is to be used, adopt the \"+init=EPSG: ...\" syntax used earlier in this tutorial. elev.r.gcs <- project(elev.r, y="+init=EPSG:4326") st_crs(elev.r.gcs) Coordinate Reference System: User input: WGS 84 wkt: GEOGCRS["WGS 84", ENSEMBLE["World Geodetic System 1984 ensemble", MEMBER["World Geodetic System 1984 (Transit)", ID["EPSG",1166]], MEMBER["World Geodetic System 1984 (G730)", ID["EPSG",1152]], MEMBER["World Geodetic System 1984 (G873)", ID["EPSG",1153]], MEMBER["World Geodetic System 1984 (G1150)", ID["EPSG",1154]], MEMBER["World Geodetic System 1984 (G1674)", ID["EPSG",1155]], MEMBER["World Geodetic System 1984 (G1762)", ID["EPSG",1156]], MEMBER["World Geodetic System 1984 (G2139)", ID["EPSG",1309]], ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1], ID["EPSG",7030]], ENSEMBLEACCURACY[2.0], ID["EPSG",6326]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]], CS[ellipsoidal,2], AXIS["longitude",east, ORDER[1], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]], AXIS["latitude",north, ORDER[2], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]], USAGE[ SCOPE["unknown"], AREA["World."], BBOX[-90,-180,90,180]]] A geographic coordinate system is often desired when overlapping a layer with a web based mapping service such as Google, Bing or OpenStreetMap (even though these web based services end up projecting to a projected coordinate system–most likely a Web Mercator projection). To check that s.sf.gcs was properly transformed, we’ll overlay it on top of an OpenStreetMap using the leaflet package. library(leaflet) leaflet(s.sf.gcs) %>% addPolygons() %>% addTiles() Next, we’ll explore other transformations using a tmap dataset of the world library(tmap) data(World) # The dataset is stored as an sf object # Let's check its current coordinate system st_crs(World) Coordinate Reference System: User input: EPSG:4326 wkt: GEOGCRS["WGS 84", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], CS[ellipsoidal,2], AXIS["geodetic latitude (Lat)",north, ORDER[1], ANGLEUNIT["degree",0.0174532925199433]], AXIS["geodetic longitude (Lon)",east, ORDER[2], ANGLEUNIT["degree",0.0174532925199433]], USAGE[ SCOPE["unknown"], AREA["World"], BBOX[-90,-180,90,180]], ID["EPSG",4326]] The following chunk transforms the world map to a custom azimuthal equidistant projection centered on latitude 0 and longitude 0. Here, we’ll use the proj4 syntax. World.ae <- st_transform(World, "+proj=aeqd +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") Let’s check the CRS of the newly created vector layer st_crs(World.ae) Coordinate Reference System: User input: +proj=aeqd +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs wkt: PROJCRS["unknown", BASEGEOGCRS["unknown", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]], ID["EPSG",6326]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]]], CONVERSION["unknown", METHOD["Modified Azimuthal Equidistant", ID["EPSG",9832]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["False easting",0, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]]] Here’s the mapped output: tm_shape(World.ae) + tm_fill() The following chunk transforms the world map to an Azimuthal equidistant projection centered on Maine, USA (69.8° West, 44.5° North) . World.aemaine <- st_transform(World, "+proj=aeqd +lat_0=44.5 +lon_0=-69.8 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") tm_shape(World.aemaine) + tm_fill() The following chunk transforms the world map to a World Robinson projection. World.robin <- st_transform(World,"+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") tm_shape(World.robin) + tm_fill() The following chunk transforms the world map to a World sinusoidal projection. World.sin <- st_transform(World,"+proj=sinu +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") tm_shape(World.sin) + tm_fill() The following chunk transforms the world map to a World Mercator projection. World.mercator <- st_transform(World,"+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") tm_shape(World.mercator) + tm_fill() Reprojecting to a new meridian center An issue that can come up when transforming spatial data is when the location of the tangent line(s) or points in the CS definition forces polygon features to be split across the 180° meridian. For example, re-centering the Mercator projection to -69° will create the following output. World.mercator2 <- st_transform(World, "+proj=merc +lon_0=-69 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") tm_shape(World.mercator2) + tm_borders() The polygons are split and R does not know how to piece them together. One solution is to split the polygons at the new meridian using the st_break_antimeridian function before projecting to a new re-centered coordinate system. # Define new meridian meridian2 <- -69 # Split world at new meridian wld.new <- st_break_antimeridian(World, lon_0 = meridian2) # Now reproject to Mercator using new meridian center wld.merc2 <- st_transform(wld.new, paste("+proj=merc +lon_0=", meridian2 , "+k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") ) tm_shape(wld.merc2) + tm_borders() This technique can be applied to any other projections. Here’s an example of a Robinson projection. wld.rob.sf <- st_transform(wld.new, paste("+proj=robin +lon_0=", meridian2 , "+k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") ) tm_shape(wld.rob.sf) + tm_borders() A note about containment While in theory, a point completely enclosed by a bounded area should always remain bounded by that area in any projection, this is not always the case in practice. This is because the transformation applies to the vertices that define the line segments and not the lines themselves. So if a point is inside of a polygon and very close to one of its boundaries in its native projection, it may find itself on the other side of that line segment in another projection hence outside of that polygon. In the following example, a polygon layer and point layer are created in a Miller coordinate system where the points are enclosed in the polygons. # Define a few projections miller <- "+proj=mill +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs" lambert <- "+proj=lcc +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs" # Subset the World data layer and reproject to Miller wld.mil <- subset(World, iso_a3 == "CAN" | iso_a3 == "USA") |> st_transform(miller) # Create polygon and point layers in the Miller projection sf1 <- st_sfc( st_polygon(list(cbind(c(-13340256,-13340256,-6661069, -6661069, -13340256), c(7713751, 5326023, 5326023,7713751, 7713751 )))), crs = miller) pt1 <- st_sfc( st_multipoint(rbind(c(-11688500,7633570), c(-11688500,5375780), c(-10018800,7633570), c(-10018800,5375780), c(-8348960,7633570), c(-8348960,5375780))), crs = miller) pt1 <- st_cast(pt1, "POINT") # Create single part points # Plot the data layers in their native projection tm_shape(wld.mil) +tm_fill(col="grey") + tm_graticules(x = c(-60,-80,-100, -120, -140), y = c(30,45, 60), labels.col = "white", col="grey90") + tm_shape(sf1) + tm_polygons("red", alpha = 0.5, border.col = "yellow") + tm_shape(pt1) + tm_dots(size=0.2) The points are close to the boundaries, but they are inside of the polygon nonetheless. To confirm, we can run st_contains on the dataset: st_contains(sf1, pt1) Sparse geometry binary predicate list of length 1, where the predicate was `contains' 1: 1, 2, 3, 4, 5, 6 All six points are selected, as expected. Now, let’s reproject the data into a Lambert conformal projection. # Transform the data wld.lam <- st_transform(wld.mil, lambert) pt1.lam <- st_transform(pt1, lambert) sf1.lam <- st_transform(sf1, lambert) # Plot the data in the Lambert coordinate system tm_shape(wld.lam) +tm_fill(col="grey") + tm_graticules( x = c(-60,-80,-100, -120, -140), y = c(30,45, 60), labels.col = "white", col="grey90") + tm_shape(sf1.lam) + tm_polygons("red", alpha = 0.5, border.col = "yellow") + tm_shape(pt1.lam) + tm_dots(size=0.2) Only three of the points are contained. We can confirm this using the st_contains function: st_contains(sf1.lam, pt1.lam) Sparse geometry binary predicate list of length 1, where the predicate was `contains' 1: 1, 3, 5 To resolve this problem, one should densify the polygon by adding more vertices along the line segment. The vertices density will be dictated by the resolution needed to preserve the map’s containment properties and is best determined experimentally. We’ll use the st_segmentize function to create vertices at 1 km (1000 m) intervals. # Add vertices every 1000 meters along the polygon's line segments sf2 <- st_segmentize(sf1, 1000) # Transform the newly densified polygon layer sf2.lam <- st_transform(sf2, lambert) # Plot the data tm_shape(wld.lam) + tm_fill(col="grey") + tm_graticules( x = c(-60,-80,-100, -120, -140), y = c(30,45, 60), labels.col = "white", col="grey90") + tm_shape(sf2.lam) + tm_polygons("red", alpha = 0.5, border.col = "yellow") + tm_shape(pt1.lam) + tm_dots(size=0.2) Now all points remain contained by the polygon. We can check via: st_contains(sf2.lam, pt1.lam) Sparse geometry binary predicate list of length 1, where the predicate was `contains' 1: 1, 2, 3, 4, 5, 6 Creating Tissot indicatrix circles Most projections will distort some aspect of a spatial property, especially area and shape. A nice way to visualize the distortion afforded by a projection is to create geodesic circles. First, create a point layer that will define the circle centers in a lat/long coordinate system. tissot.pt <- st_sfc( st_multipoint(rbind(c(-60,30), c(-60,45), c(-60,60), c(-80,30), c(-80,45), c(-80,60), c(-100,30), c(-100,45), c(-100,60), c(-120,30), c(-120,45), c(-120,60) )), crs = "+proj=longlat") tissot.pt <- st_cast(tissot.pt, "POINT") # Create single part points Next we’ll construct geodesic circles from these points using the geosphere package. library(geosphere) cr.pt <- list() # Create an empty list # Loop through each point in tissot.pt and generate 360 vertices at 300 km # from each point in all directions at 1 degree increment. These vertices # will be used to approximate the Tissot circles for (i in 1:length(tissot.pt)){ cr.pt[[i]] <- list( destPoint( as(tissot.pt[i], "Spatial"), b=seq(0,360,1), d=300000) ) } # Create a closed polygon from the previously generated vertices tissot.sfc <- st_cast( st_sfc(st_multipolygon(cr.pt ),crs = "+proj=longlat"), "POLYGON" ) We’ll check that these are indeed geodesic circles by computing the geodesic area of each polygon. We’ll use the st_area function from sf which will revert to geodesic area calculation if a lat/long coordinate system is present. tissot.sf <- st_sf( geoArea = st_area(tissot.sfc), tissot.sfc ) The true area of the circles should be \\(\\pi * r^2\\) or 2.8274334^{11} square meters in our example. Let’s compute the error in the tissot output. The values will be reported as fractions. ( (pi * 300000^2) - as.vector(tissot.sf$geoArea) ) / (pi * 300000^2) [1] -0.0008937164 0.0024530577 0.0057943110 -0.0008937164 [5] 0.0024530577 0.0057943110 -0.0008937164 0.0024530577 [9] 0.0057943110 -0.0008937164 0.0024530577 0.0057943110 In all cases, the error is less than 0.1%. The error is primarily due to the discretization of the circle parameter. Let’s now take a look at the distortions associated with a few popular coordinate systems. We’ll start by exploring the Mercator projection. # Transform geodesic circles and compute area error as a percentage tissot.merc <- st_transform(tissot.sf, "+proj=merc +ellps=WGS84") tissot.merc$area_err <- round((st_area(tissot.merc, tissot.merc$geoArea)) / tissot.merc$geoArea * 100 , 2) # Plot the map tm_shape(World, bbox = st_bbox(tissot.merc), projection = st_crs(tissot.merc)) + tm_borders() + tm_shape(tissot.merc) + tm_polygons(col="grey", border.col = "red", alpha = 0.3) + tm_graticules(x = c(-60,-80,-100, -120, -140), y = c(30,45, 60), labels.col = "white", col="grey80") + tm_text("area_err", size=.8, alpha=0.8, col="blue") The mercator projection does a good job at preserving shape, but the area’s distortion increases dramatically poleward. Next, we’ll explore the Lambert azimuthal equal area projection centered at 45 degrees north and 100 degrees west. # Transform geodesic circles and compute area error as a percentage tissot.laea <- st_transform(tissot.sf, "+proj=laea +lat_0=45 +lon_0=-100 +ellps=WGS84") tissot.laea$area_err <- round( (st_area(tissot.laea ) - tissot.laea$geoArea) / tissot.laea$geoArea * 100, 2) # Plot the map tm_shape(World, bbox = st_bbox(tissot.laea), projection = st_crs(tissot.laea)) + tm_borders() + tm_shape(tissot.laea) + tm_polygons(col="grey", border.col = "red", alpha = 0.3) + tm_graticules(x=c(-60,-80,-100, -120, -140), y = c(30,45, 60), labels.col = "white", col="grey80") + tm_text("area_err", size=.8, alpha=0.8, col="blue") The area error across the 48 states is near 0. But note that the shape does become slightly distorted as we move away from the center of projection. Next, we’ll explore the Robinson projection. # Transform geodesic circles and compute area error as a percentage tissot.robin <- st_transform(tissot.sf, "+proj=robin +ellps=WGS84") tissot.robin$area_err <- round( (st_area(tissot.robin ) - tissot.robin$geoArea) / tissot.robin$geoArea * 100, 2) # Plot the map tm_shape(World, bbox = st_bbox(tissot.robin), projection = st_crs(tissot.robin)) + tm_borders() + tm_shape(tissot.robin) + tm_polygons(col="grey", border.col = "red", alpha = 0.3) + tm_graticules(x=c(-60,-80,-100, -120, -140), y = c(30,45, 60), labels.col = "white", col="grey80") + tm_text("area_err", size=.8, alpha=0.8, col="blue") Both shape and area are measurably distorted for the north american continent. "],["point-pattern-analysis-in-r.html", "H Point pattern analysis in R Sample files for this exercise Prepping the data Density based analysis Distance based analysis Hypothesis tests", " H Point pattern analysis in R R spatstat 4.3.1 3.0.7 For a basic theoretical treatise on point pattern analysis (PPA) the reader is encouraged to review the point pattern analysis lecture notes. This section is intended to supplement the lecture notes by implementing PPA techniques in the R programming environment. Sample files for this exercise Data used in the following exercises can be loaded into your current R session by running the following chunk of code. load(url("https://github.com/mgimond/Spatial/raw/main/Data/ppa.RData")) The data objects consist of three spatial data layers: starbucks: A ppp point layer of Starbucks stores in Massachusetts; ma: An owin polygon layer of Massachusetts boundaries; pop: An im raster layer of population density distribution. All layers are in a format supported by the spatstat (Baddeley, Rubak, and Turner 2016) package. Note that these layers are not authoritative and are to be used for instructional purposes only. Prepping the data All point pattern analysis tools used in this tutorial are available in the spatstat package. These tools are designed to work with points stored as ppp objects and not SpatialPointsDataFrame or sf objects. Note that a ppp object may or may not have attribute information (also referred to as marks). Knowing whether or not a function requires that an attribute table be present in the ppp object matters if the operation is to complete successfully. In this tutorial we will only concern ourselves with the pattern generated by the points and not their attributes. We’ll therefore remove all marks from the point object. library(spatstat) marks(starbucks) <- NULL Many point pattern analyses such as the average nearest neighbor analysis should have their study boundaries explicitly defined. This can be done in spatstat by “binding” the Massachusetts boundary polygon to the Starbucks point feature object using the Window() function. Note that the function name starts with an upper case W. Window(starbucks) <- ma We can plot the point layer to ensure that the boundary is properly defined for that layer. plot(starbucks, main=NULL, cols=rgb(0,0,0,.2), pch=20) We’ll make another change to the dataset. Population density values for an administrative layer are usually quite skewed. The population density for Massachusetts is no exception. The following code chunk generates a histogram from the pop raster layer. hist(pop, main=NULL, las=1) Transforming the skewed distribution in the population density covariate may help reveal relationships between point distributions and the covariate in some of the point pattern analyses covered later in this tutorial. We’ll therefore create a log-transformed version of pop. pop.lg <- log(pop) hist(pop.lg, main=NULL, las=1) We’ll be making use of both expressions of the population density distribution in the following exercises. Density based analysis Quadrat density You can compute the quadrat count and intensity using spatstat’s quadratcount() and intensity() functions. The following code chunk divides the state of Massachusetts into a grid of 3 rows and 6 columns then tallies the number of points falling in each quadrat. Q <- quadratcount(starbucks, nx= 6, ny=3) The object Q stores the number of points inside each quadrat. You can plot the quadrats along with the counts as follows: plot(starbucks, pch=20, cols="grey70", main=NULL) # Plot points plot(Q, add=TRUE) # Add quadrat grid You can compute the density of points within each quadrat as follows: # Compute the density for each quadrat Q.d <- intensity(Q) # Plot the density plot(intensity(Q, image=TRUE), main=NULL, las=1) # Plot density raster plot(starbucks, pch=20, cex=0.6, col=rgb(0,0,0,.5), add=TRUE) # Add points The density values are reported as the number of points (stores) per square meters, per quadrat. The Length dimension unit is extracted from the coordinate system associated with the point layer. In this example, the length unit is in meters, so the density is reported as points per square meter. Such a small length unit is not practical at this scale of analysis. It’s therefore desirable to rescale the spatial objects to a larger length unit such as the kilometer. starbucks.km <- rescale(starbucks, 1000, "km") ma.km <- rescale(ma, 1000, "km") pop.km <- rescale(pop, 1000, "km") pop.lg.km <- rescale(pop.lg, 1000, "km") The second argument to the rescale function divides the current unit (meter) to get the new unit (kilometer). This gives us more sensible density values to work with. # Compute the density for each quadrat (in counts per km2) Q <- quadratcount(starbucks.km, nx= 6, ny=3) Q.d <- intensity(Q) # Plot the density plot(intensity(Q, image=TRUE), main=NULL, las=1) # Plot density raster plot(starbucks.km, pch=20, cex=0.6, col=rgb(0,0,0,.5), add=TRUE) # Add points Quadrat density on a tessellated surface We can use a covariate such as the population density raster to define non-uniform quadrats. We’ll first divide the population density covariate into four regions (aka tessellated surfaces) following an equal interval classification scheme. Recall that we are working with the log transformed population density values. The breaks will be defined as follows: Break Logged population density value 1 ] -Inf; 4 ] 2 ] 4 ; 6 ] 3 ] 3 ; 8 ] 4 ] 8 ; Inf ] brk <- c( -Inf, 4, 6, 8 , Inf) # Define the breaks Zcut <- cut(pop.lg.km, breaks=brk, labels=1:4) # Classify the raster E <- tess(image=Zcut) # Create a tesselated surface The tessellated object can be mapped to view the spatial distribution of quadrats. plot(E, main="", las=1) Next, we’ll tally the quadrat counts within each tessellated area then compute their density values (number of points per quadrat area). Q <- quadratcount(starbucks.km, tess = E) # Tally counts Q.d <- intensity(Q) # Compute density Q.d tile 1 2 3 4 0.0000000000 0.0003706106 0.0103132964 0.0889370933 Recall that the length unit is kilometer so the above density values are number of points per square kilometer within each quadrat unit. Plot the density values across each tessellated region. plot(intensity(Q, image=TRUE), las=1, main=NULL) plot(starbucks.km, pch=20, cex=0.6, col=rgb(1,1,1,.5), add=TRUE) Let’s modify the color scheme. cl <- interp.colours(c("lightyellow", "orange" ,"red"), E$n) plot( intensity(Q, image=TRUE), las=1, col=cl, main=NULL) plot(starbucks.km, pch=20, cex=0.6, col=rgb(0,0,0,.5), add=TRUE) Kernel density raster The spatstat package has a function called density which computes an isotropic kernel intensity estimate of the point pattern. Its bandwidth defines the kernel’s window extent. This next code chunk uses the default bandwidth. K1 <- density(starbucks.km) # Using the default bandwidth plot(K1, main=NULL, las=1) contour(K1, add=TRUE) In this next chunk, a 50 km bandwidth (sigma = 50) is used. Note that the length unit is extracted from the point layer’s mapping units (which was rescaled to kilometers earlier in this exercise). K2 <- density(starbucks.km, sigma=50) # Using a 50km bandwidth plot(K2, main=NULL, las=1) contour(K2, add=TRUE) The kernel defaults to a gaussian smoothing function. The smoothing function can be changed to a quartic, disc or epanechnikov function. For example, to change the kernel to a disc function type: K3 <- density(starbucks.km, kernel = "disc", sigma=50) # Using a 50km bandwidth plot(K3, main=NULL, las=1) contour(K3, add=TRUE) Kernel density adjusted for covariate In the following example, a Starbucks store point process’ intensity is estimated following the population density raster covariate. The outputs include a plot of \\(\\rho\\) vs. population density and a raster map of \\(\\rho\\) controlled for population density. # Compute rho using the ratio method rho <- rhohat(starbucks.km, pop.lg.km, method="ratio") # Generate rho vs covariate plot plot(rho, las=1, main=NULL, legendargs=list(cex=0.8, xpd=TRUE, inset=c(1.01, 0) )) It’s important to note that we are not fitting a parametric model to the data. Instead, a non-parametric curve is fit to the data. Its purpose is to describe/explore the shape of the relationship between point density and covariate. Note the exponentially increasing intensity of Starbucks stores with increasing population density values when the population density is expressed as a log. The grey envelope represents the 95% confidence interval. The following code chunk generates the map of the predicted Starbucks density if population density were the sole driving process. (Note the use of the gamma parameter to “stretch” the color scheme in the map). pred <- predict(rho) cl <- interp.colours(c("lightyellow", "orange" ,"red"), 100) # Create color scheme plot(pred, col=cl, las=1, main=NULL, gamma = 0.25) The predicted intensity’s spatial pattern mirrors the covariate’s population distribution pattern. The predicted intensity values range from 0 to about 5 stores per square kilometer. You’ll note that this maximum value does not match the maximum value of ~3 shown in the rho vs population density plot. This is because the plot did not show the full range of population density values (the max density value shown was 10). The population raster layer has a maximum pixel value of 11.03 (this value can be extracted via max(pop.lg.km)). We can compare the output of the predicted Starbucks stores intensity function to that of the observed Starbucks stores intensity function. We’ll use the variable K1 computed earlier to represent the observed intensity function. K1_vs_pred <- pairs(K1, pred, plot = FALSE) plot(K1_vs_pred$pred ~ K1_vs_pred$K1, pch=20, xlab = "Observed intensity", ylab = "Predicted intensity", col = rgb(0,0,0,0.1)) If the modeled intensity was comparable to the observed intensity, we would expect the points to cluster along a one-to-one diagonal. An extreme example is to compare the observed intensity with itself which offers a perfect match of intensity values. K1_vs_K1 <- pairs(K1, K1, labels = c("K1a", "K1b"), plot = FALSE) plot(K1_vs_K1$K1a ~ K1_vs_K1$K1b, pch=20, xlab = "Observed intensity", ylab = "Observed intensity") So going back to our predicted vs observed intensity plot, we note a strong skew in the predicted intensity values. We also note an overestimation of intensity around higher values. summary(as.data.frame(K1_vs_pred)) K1 pred Min. :8.846e-05 Min. :0.000000 1st Qu.:1.207e-03 1st Qu.:0.000282 Median :3.377e-03 Median :0.001541 Mean :8.473e-03 Mean :0.007821 3rd Qu.:1.078e-02 3rd Qu.:0.005904 Max. :5.693e-02 Max. :5.103985 The predicted maximum intensity value is two orders of magnitude greater than that observed. The overestimation of intenstity values can also be observed at lower values. The following plot limits the data to observed intensities less than 0.04. A red one-to-one line is added for reference. If intensities were similar, they would aggregate around this line. plot(K1_vs_pred$pred ~ K1_vs_pred$K1, pch=20, xlab = "Observed intensity", ylab = "Predicted intensity", col = rgb(0,0,0,0.1), xlim = c(0, 0.04), ylim = c(0, 0.1)) abline(a=0, b = 1, col = "red") Modeling intensity as a function of a covariate The relationship between the predicted Starbucks store point pattern intensity and the population density distribution can be modeled following a Poisson point process model. We’ll generate the Poisson point process model then plot the results. # Create the Poisson point process model PPM1 <- ppm(starbucks.km ~ pop.lg.km) # Plot the relationship plot(effectfun(PPM1, "pop.lg.km", se.fit=TRUE), main=NULL, las=1, legendargs=list(cex=0.8, xpd=TRUE, inset=c(1.01, 0) )) Note that this is not the same relationship as \\(\\rho\\) vs. population density shown in the previous section. Here, we’re fitting a well defined model to the data whose parameters can be extracted from the PPM1 object. PPM1 Nonstationary Poisson process Fitted to point pattern dataset 'starbucks.km' Log intensity: ~pop.lg.km Fitted trend coefficients: (Intercept) pop.lg.km -13.710551 1.279928 Estimate S.E. CI95.lo CI95.hi Ztest Zval (Intercept) -13.710551 0.46745489 -14.626746 -12.794356 *** -29.33021 pop.lg.km 1.279928 0.05626785 1.169645 1.390211 *** 22.74705 Problem: Values of the covariate 'pop.lg.km' were NA or undefined at 0.57% (4 out of 699) of the quadrature points The model takes on the form: \\[ \\lambda(i) = e^{-13.71 + 1.27(logged\\ population\\ density)} \\] Here, the base intensity is close to zero (\\(e^{-13.71}\\)) when the logged population density is zero and for every increase in one unit of the logged population density, the Starbucks point density increases by a factor of \\(e^{1.27}\\) units. Distance based analysis Next, we’ll explore three different distance based analyses: The average nearest neighbor, the \\(K\\) and \\(L\\) functions and the pair correlation function \\(g\\). Average nearest neighbor analysis Next, we’ll compute the average nearest neighbor (ANN) distances between Starbucks stores. To compute the average first nearest neighbor distance (in kilometers) set k=1: mean(nndist(starbucks.km, k=1)) [1] 3.275492 To compute the average second nearest neighbor distance set k=2: mean(nndist(starbucks.km, k=2)) [1] 5.81173 The parameter k can take on any order neighbor (up to n-1 where n is the total number of points). The average nearest neighbor function can be expended to generate an ANN vs neighbor order plot. In the following example, we’ll plot ANN as a function of neighbor order for the first 100 closest neighbors: ANN <- apply(nndist(starbucks.km, k=1:100),2,FUN=mean) plot(ANN ~ eval(1:100), type="b", main=NULL, las=1) The bottom axis shows the neighbor order number and the left axis shows the average distance in kilometers. K and L functions To compute the K function, type: K <- Kest(starbucks.km) plot(K, main=NULL, las=1, legendargs=list(cex=0.8, xpd=TRUE, inset=c(1.01, 0) )) The plot returns different estimates of \\(K\\) depending on the edge correction chosen. By default, the isotropic, translate and border corrections are implemented. To learn more about these edge correction methods type ?Kest at the command line. The estimated \\(K\\) functions are listed with a hat ^. The black line (\\(K_{pois}\\)) represents the theoretical \\(K\\) function under the null hypothesis that the points are completely randomly distributed (CSR/IRP). Where \\(K\\) falls under the theoretical \\(K_{pois}\\) line the points are deemed more dispersed than expected at distance \\(r\\). Where \\(K\\) falls above the theoretical \\(K_{pois}\\) line the points are deemed more clustered than expected at distance \\(r\\). To compute the L function, type: L <- Lest(starbucks.km, main=NULL) plot(L, main=NULL, las=1, legendargs=list(cex=0.8, xpd=TRUE, inset=c(1.01, 0) )) To plot the L function with the Lexpected line set horizontal: plot(L, . -r ~ r, main=NULL, las=1, legendargs=list(cex=0.8, xpd=TRUE, inset=c(1.01, 0) )) Pair correlation function g To compute the pair correlation function type: g <- pcf(starbucks.km) plot(g, main=NULL, las=1, legendargs=list(cex=0.8, xpd=TRUE, inset=c(1.01, 0) )) As with the Kest and Lest functions, the pcf function outputs different estimates of \\(g\\) using different edge correction methods (Ripley and Translate). The theoretical \\(g\\)-function \\(g_{Pois}\\) under a CSR process (green dashed line) is also displayed for comparison. Where the observed \\(g\\) is greater than \\(g_{Pois}\\) we can expect more clustering than expected and where the observed \\(g\\) is less than \\(g_{Pois}\\) we can expect more dispersion than expected. Hypothesis tests Test for clustering/dispersion First, we’ll run an ANN analysis for Starbucks locations assuming a uniform point density across the state (i.e. a completely spatially random process). ann.p <- mean(nndist(starbucks.km, k=1)) ann.p [1] 3.275492 The observed average nearest neighbor distance is 3.28 km. Next, we will generate the distribution of expected ANN values given a homogeneous (CSR/IRP) point process using Monte Carlo methods. This is our null model. n <- 599L # Number of simulations ann.r <- vector(length = n) # Create an empty object to be used to store simulated ANN values for (i in 1:n){ rand.p <- rpoint(n=starbucks.km$n, win=ma.km) # Generate random point locations ann.r[i] <- mean(nndist(rand.p, k=1)) # Tally the ANN values } In the above loop, the function rpoint is passed two parameters: n=starbucks.km$n and win=ma.km. The first tells the function how many points to randomly generate (starbucks.km$n extracts the number of points from object starbucks.km). The second tells the function to confine the points to the extent defined by ma.km. Note that the latter parameter is not necessary if the ma boundary was already defined as the starbucks window extent. You can plot the last realization of the homogeneous point process to see what a completely random placement of Starbucks stores could look like. plot(rand.p, pch=16, main=NULL, cols=rgb(0,0,0,0.5)) Our observed distribution of Starbucks stores certainly does not look like the outcome of a completely independent random process. Next, let’s plot the histogram of expected values under the null and add a blue vertical line showing where our observed ANN value lies relative to this distribution. hist(ann.r, main=NULL, las=1, breaks=40, col="bisque", xlim=range(ann.p, ann.r)) abline(v=ann.p, col="blue") It’s obvious from the test that the observed ANN value is far smaller than the expected ANN values one could expect under the null hypothesis. A smaller observed value indicates that the stores are far more clustered than expected under the null. Next, we’ll run the same test but control for the influence due to population density distribution. Recall that the ANN analysis explores the 2nd order process underlying a point pattern thus requiring that we control for the first order process (e.g. population density distribution). This is a non-homogeneous test. Here, we pass the parameter f=pop.km to the function rpoint telling it that the population density raster pop.km should be used to define where a point should be most likely placed (high population density) and least likely placed (low population density) under this new null model. Here, we’ll use the non-transformed representation of the population density raster, pop.km. n <- 599L ann.r <- vector(length=n) for (i in 1:n){ rand.p <- rpoint(n=starbucks.km$n, f=pop.km) ann.r[i] <- mean(nndist(rand.p, k=1)) } You can plot the last realization of the non-homogeneous point process to convince yourself that the simulation correctly incorporated the covariate raster in its random point function. Window(rand.p) <- ma.km # Replace raster mask with ma.km window plot(rand.p, pch=16, main=NULL, cols=rgb(0,0,0,0.5)) Note the cluster of points near the highly populated areas. This pattern is different from the one generated from a completely random process. Next, let’s plot the histogram and add a blue line showing where our observed ANN value lies. hist(ann.r, main=NULL, las=1, breaks=40, col="bisque", xlim=range(ann.p, ann.r)) abline(v=ann.p, col="blue") Even though the distribution of ANN values we would expect when controlled for the population density nudges closer to our observed ANN value, we still cannot say that the clustering of Starbucks stores can be explained by a completely random process when controlled for population density. Computing a pseudo p-value from the simulation A (pseudo) p-value can be extracted from a Monte Carlo simulation. We’ll work off of the last simulation. First, we need to find the number of simulated ANN values greater than our observed ANN value. N.greater <- sum(ann.r > ann.p) To compute the p-value, find the end of the distribution closest to the observed ANN value, then divide that count by the total count. Note that this is a so-called one-sided P-value. See lecture notes for more information. p <- min(N.greater + 1, n + 1 - N.greater) / (n +1) p [1] 0.001666667 In our working example, you’ll note that or simulated ANN value was nowhere near the range of ANN values computed under the null yet we don’t have a p-value of zero. This is by design since the strength of our estimated p will be proportional to the number of simulations–this reflects the chance that given an infinite number of simulations at least one realization of a point pattern could produce an ANN value more extreme than ours. Test for a poisson point process model with a covariate effect The ANN analysis addresses the 2nd order effect of a point process. Here, we’ll address the 1st order process using the poisson point process model. We’ll first fit a model that assumes that the point process’ intensity is a function of the logged population density (this will be our alternate hypothesis). PPM1 <- ppm(starbucks.km ~ pop.lg.km) PPM1 Nonstationary Poisson process Fitted to point pattern dataset 'starbucks.km' Log intensity: ~pop.lg.km Fitted trend coefficients: (Intercept) pop.lg.km -13.710551 1.279928 Estimate S.E. CI95.lo CI95.hi Ztest Zval (Intercept) -13.710551 0.46745489 -14.626746 -12.794356 *** -29.33021 pop.lg.km 1.279928 0.05626785 1.169645 1.390211 *** 22.74705 Problem: Values of the covariate 'pop.lg.km' were NA or undefined at 0.57% (4 out of 699) of the quadrature points Next, we’ll fit the model that assumes that the process’ intensity is not a function of population density (the null hypothesis). PPM0 <- ppm(starbucks.km ~ 1) PPM0 Stationary Poisson process Fitted to point pattern dataset 'starbucks.km' Intensity: 0.008268627 Estimate S.E. CI95.lo CI95.hi Ztest Zval log(lambda) -4.795287 0.07647191 -4.945169 -4.645405 *** -62.70651 In our working example, the null model (homogeneous intensity) takes on the form: \\[ \\lambda(i) = e^{-4.795} \\] \\(\\lambda(i)\\) under the null is nothing more than the observed density of Starbucks stores within the State of Massachusetts, or: starbucks.km$n / area(ma.km) [1] 0.008268627 The alternate model takes on the form: \\[ \\lambda(i) = e^{-13.71 + 1.27\\ (logged\\ population\\ density)} \\] The models are then compared using the likelihood ratio test which produces the following output: anova(PPM0, PPM1, test="LRT") Npar Df Deviance Pr(>Chi) 5 NA NA NA 6 1 537.218 0 The value under the heading PR(>Chi) is the p-value which gives us the probability that we would be wrong in rejecting the null. Here p~0 suggests that there is close to a 0% chance that we would be wrong in rejecting the base model in favor of the alternate model–put another way, the alternate model (that the logged population density can help explain the distribution of Starbucks stores) is a significant improvement over the null. Note that if you were to compare two competing non-homogeneous models such as population density and income distributions, you would need to compare the model with one of the covariates with an augmented version of that model using the other covariate. In other words, you would need to compare PPM1 <- ppm(starbucks.km ~ pop.lg.km) with something like PPM2 <- ppm(starbucks.km ~ pop.lg.km + income.km). References "],["spatial-autocorrelation-in-r.html", "I Spatial autocorrelation in R Sample files for this exercise Introduction Define neighboring polygons Computing the Moran’s I statistic: the hard way Computing the Moran’s I statistic: the easy way Moran’s I as a function of a distance band", " I Spatial autocorrelation in R R tmap spdep 4.3.1 3.3.3 1.2.8 For a basic theoretical treatise on spatial autocorrelation the reader is encouraged to review the lecture notes. This section is intended to supplement the lecture notes by implementing spatial autocorrelation techniques in the R programming environment. Sample files for this exercise Data used in the following exercises can be loaded into your current R session by running the following chunk of code. z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/s_sf.RDS")) s1 <- readRDS(z) The data object consists of an sf vector layer representing income and education data aggregated at the county level for the state of Maine. Introduction The spatial object s1 has five attributes. The one of interest for this exercise is Income (per capita, in units of dollars). Let’s map the income distribution using a quantile classification scheme. We’ll make use of the tmap package. library(tmap) tm_shape(s1) + tm_polygons(style="quantile", col = "Income") + tm_legend(outside = TRUE, text.size = .8) Define neighboring polygons The first step requires that we define “neighboring” polygons. This could refer to contiguous polygons, polygons within a certain distance band, or it could be non-spatial in nature and defined by social, political or cultural “neighbors”. Here, we’ll adopt a contiguous neighbor definition where we’ll accept any contiguous polygon that shares at least on vertex (this is the “queen” case and is defined by setting the parameter queen=TRUE). If we required that at least one edge be shared between polygons then we would set queen=FALSE. library(spdep) nb <- poly2nb(s1, queen=TRUE) For each polygon in our polygon object, nb lists all neighboring polygons. For example, to see the neighbors for the first polygon in the object, type: nb[[1]] [1] 2 3 4 5 Polygon 1 has 4 neighbors. The numbers represent the polygon IDs as stored in the spatial object s1. Polygon 1 is associated with the County attribute name Aroostook: s1$NAME[1] [1] Aroostook 16 Levels: Androscoggin Aroostook Cumberland Franklin Hancock Kennebec ... York Its four neighboring polygons are associated with the counties: s1$NAME[c(2,3,4,5)] [1] Somerset Piscataquis Penobscot Washington 16 Levels: Androscoggin Aroostook Cumberland Franklin Hancock Kennebec ... York Next, we need to assign weights to each neighboring polygon. In our case, each neighboring polygon will be assigned equal weight (style=\"W\"). This is accomplished by assigning the fraction \\(1/ (\\# of neighbors)\\) to each neighboring county then summing the weighted income values. While this is the most intuitive way to summaries the neighbors’ values it has one drawback in that polygons along the edges of the study area will base their lagged values on fewer polygons thus potentially over- or under-estimating the true nature of the spatial autocorrelation in the data. For this example, we’ll stick with the style=\"W\" option for simplicity’s sake but note that other more robust options are available, notably style=\"B\". lw <- nb2listw(nb, style="W", zero.policy=TRUE) The zero.policy=TRUE option allows for lists of non-neighbors. This should be used with caution since the user may not be aware of missing neighbors in their dataset however, a zero.policy of FALSE would return an error. To see the weight of the first polygon’s four neighbors type: lw$weights[1] [[1]] [1] 0.25 0.25 0.25 0.25 Each neighbor is assigned a quarter of the total weight. This means that when R computes the average neighboring income values, each neighbor’s income will be multiplied by 0.25 before being tallied. Finally, we’ll compute the average neighbor income value for each polygon. These values are often referred to as spatially lagged values. Inc.lag <- lag.listw(lw, s1$Income) The following table shows the average neighboring income values (stored in the Inc.lag object) for each county. Computing the Moran’s I statistic: the hard way We can plot lagged income vs. income and fit a linear regression model to the data. # Create a regression model M <- lm(Inc.lag ~ s1$Income) # Plot the data plot( Inc.lag ~ s1$Income, pch=20, asp=1, las=1) The slope of the regression line is the Moran’s I coefficient. coef(M)[2] s1$Income 0.2828111 To assess if the slope is significantly different from zero, we can randomly permute the income values across all counties (i.e. we are not imposing any spatial autocorrelation structure), then fit a regression model to each permuted set of values. The slope values from the regression give us the distribution of Moran’s I values we could expect to get under the null hypothesis that the income values are randomly distributed across the counties. We then compare the observed Moran’s I value to this distribution. n <- 599L # Define the number of simulations I.r <- vector(length=n) # Create an empty vector for (i in 1:n){ # Randomly shuffle income values x <- sample(s1$Income, replace=FALSE) # Compute new set of lagged values x.lag <- lag.listw(lw, x) # Compute the regression slope and store its value M.r <- lm(x.lag ~ x) I.r[i] <- coef(M.r)[2] } # Plot the histogram of simulated Moran's I values # then add our observed Moran's I value to the plot hist(I.r, main=NULL, xlab="Moran's I", las=1) abline(v=coef(M)[2], col="red") The simulation suggests that our observed Moran’s I value is not consistent with a Moran’s I value one would expect to get if the income values were not spatially autocorrelated. In the next step, we’ll compute a pseudo p-value from this simulation. Computing a pseudo p-value from an MC simulation First, we need to find the number of simulated Moran’s I values values greater than our observed Moran’s I value. N.greater <- sum(coef(M)[2] > I.r) To compute the p-value, find the end of the distribution closest to the observed Moran’s I value, then divide that count by the total count. Note that this is a so-called one-sided P-value. See lecture notes for more information. p <- min(N.greater + 1, n + 1 - N.greater) / (n + 1) p [1] 0.01333333 In our working example, the p-value suggests that there is a small chance (0.013%) of being wrong in stating that the income values are not clustered at the county level. Computing the Moran’s I statistic: the easy way To get the Moran’s I value, simply use the moran.test function. moran.test(s1$Income,lw) Moran I test under randomisation data: s1$Income weights: lw Moran I statistic standard deviate = 2.2472, p-value = 0.01231 alternative hypothesis: greater sample estimates: Moran I statistic Expectation Variance 0.28281108 -0.06666667 0.02418480 Note that the p-value computed from the moran.test function is not computed from an MC simulation but analytically instead. This may not always prove to be the most accurate measure of significance. To test for significance using the MC simulation method instead, use the moran.mc function. MC<- moran.mc(s1$Income, lw, nsim=599) # View results (including p-value) MC Monte-Carlo simulation of Moran I data: s1$Income weights: lw number of simulations + 1: 600 statistic = 0.28281, observed rank = 584, p-value = 0.02667 alternative hypothesis: greater # Plot the distribution (note that this is a density plot instead of a histogram) plot(MC, main="", las=1) Moran’s I as a function of a distance band In this section, we will explore spatial autocorrelation as a function of distance bands. Instead of defining neighbors as contiguous polygons, we will define neighbors based on distances to polygon centers. We therefore need to extract the center of each polygon. coo <- st_centroid(s1) The object coo stores all sixteen pairs of coordinate values. Next, we will define the search radius to include all neighboring polygon centers within 50 km (or 50,000 meters) S.dist <- dnearneigh(coo, 0, 50000) The dnearneigh function takes on three parameters: the coordinate values coo, the radius for the inner radius of the annulus band, and the radius for the outer annulus band. In our example, the inner annulus radius is 0 which implies that all polygon centers up to 50km are considered neighbors. Note that if we chose to restrict the neighbors to all polygon centers between 50 km and 100 km, for example, then we would define a search annulus (instead of a circle) as dnearneigh(coo, 50000, 100000). Now that we defined our search circle, we need to identify all neighboring polygons for each polygon in the dataset. lw <- nb2listw(S.dist, style="W",zero.policy=T) Run the MC simulation. MI <- moran.mc(s1$Income, lw, nsim=599,zero.policy=T) Plot the results. plot(MI, main="", las=1) Display p-value and other summary statistics. MI Monte-Carlo simulation of Moran I data: s1$Income weights: lw number of simulations + 1: 600 statistic = 0.31361, observed rank = 596, p-value = 0.006667 alternative hypothesis: greater "],["interpolation-in-r.html", "J Interpolation in R Thiessen polygons IDW 1st order polynomial fit 2nd order polynomial Kriging", " J Interpolation in R R sf tmap spatstat gstat terra sp 4.3.1 1.0.14 3.3.3 3.0.7 2.1.1 1.7.55 2.0.0 First, let’s load the data from the website. The data are vector layers stored as sf objects. library(sf) library(tmap) # Load precipitation data z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/precip.rds")) P <- readRDS(z) p <- st_as_sf(P) # Load Texas boudary map z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/texas.rds")) W <- readRDS(z) w <- st_as_sf(W) # # Replace point boundary extent with that of Texas tm_shape(w) + tm_polygons() + tm_shape(p) + tm_dots(col="Precip_in", palette = "RdBu", auto.palette.mapping = FALSE, title="Sampled precipitation \\n(in inches)", size=0.7) + tm_text("Precip_in", just="left", xmod=.5, size = 0.7) + tm_legend(legend.outside=TRUE) The p point layer defines the sampled precipitation values. These points will be used to predict values at unsampled locations. The w polygon layer defines the boundary of Texas. This will be the extent for which we will interpolate precipitation data. Thiessen polygons The Thiessen polygons (or proximity interpolation) can be created using spatstat’s dirichlet function. Note that this function will require that the input point layer be converted to a spatstat ppp object–hence the use of the inline as.ppp(P) syntax in the following code chunk. library(spatstat) # Used for the dirichlet tessellation function # Create a tessellated surface th <- dirichlet(as.ppp(p)) |> st_as_sfc() |> st_as_sf() # The dirichlet function does not carry over projection information # requiring that this information be added manually st_crs(th) <- st_crs(p) # The tessellated surface does not store attribute information # from the point data layer. We'll join the point attributes to the polygons th2 <- st_join(th, p, fn=mean) # Finally, we'll clip the tessellated surface to the Texas boundaries th.clp <- st_intersection(th2, w) # Map the data tm_shape(th.clp) + tm_polygons(col="Precip_in", palette="RdBu", auto.palette.mapping=FALSE, title="Predicted precipitation \\n(in inches)") + tm_legend(legend.outside=TRUE) IDW Unlike the Thiessen method shown in the previous section, the IDW interpolation will output a raster. This requires that we first create an empty raster grid, then interpolate the precipitation values to each unsampled grid cell. An IDW power value of 2 (idp=2.0) will be used in this example. Many packages share the same function names. This can be a problem when these packages are loaded in a same R session. For example, the idw function is available in both spatstat.explore and gstat. Here, we make use of gstat’s idw function. This requires that we either detach the spatstat.explore package (this package was automatically installed when we installed spatstat) or that we explicitly identify the package by typing gstat::idw. Here, we opted for the former approach. detach("package:spatstat.explore", unload = TRUE, force=TRUE) library(gstat) library(terra) library(sp) # Create an empty grid where n is the total number of cells grd <- as.data.frame(spsample(P, "regular", n=50000)) names(grd) <- c("X", "Y") coordinates(grd) <- c("X", "Y") gridded(grd) <- TRUE # Create SpatialPixel object fullgrid(grd) <- TRUE # Create SpatialGrid object # Add P's projection information to the empty grid proj4string(P) <- proj4string(P) # Temp fix until new proj env is adopted proj4string(grd) <- proj4string(P) # Interpolate the grid cells using a power value of 2 (idp=2.0) P.idw <- idw(Precip_in ~ 1, P, newdata=grd, idp = 2.0) # Convert to raster object then clip to Texas r <- rast(P.idw) r.m <- mask(r, st_as_sf(W)) # Plot tm_shape(r.m["var1.pred"]) + tm_raster(n=10,palette = "RdBu", auto.palette.mapping = FALSE, title="Predicted precipitation \\n(in inches)") + tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) Fine-tuning the interpolation The choice of power function can be subjective. To fine-tune the choice of the power parameter, you can perform a leave-one-out validation routine to measure the error in the interpolated values. # Leave-one-out validation routine IDW.out <- vector(length = length(P)) for (i in 1:length(P)) { IDW.out[i] <- idw(Precip_in ~ 1, P[-i,], P[i,], idp=2.0)$var1.pred } # Plot the differences OP <- par(pty="s", mar=c(4,3,0,0)) plot(IDW.out ~ P$Precip_in, asp=1, xlab="Observed", ylab="Predicted", pch=16, col=rgb(0,0,0,0.5)) abline(lm(IDW.out ~ P$Precip_in), col="red", lw=2,lty=2) abline(0,1) par(OP) The RMSE can be computed from IDW.out as follows: # Compute RMSE sqrt( sum((IDW.out - P$Precip_in)^2) / length(P)) [1] 6.989294 Cross-validation In addition to generating an interpolated surface, you can create a 95% confidence interval map of the interpolation model. Here we’ll create a 95% CI map from an IDW interpolation that uses a power parameter of 2 (idp=2.0). # Create the interpolated surface (using gstat's idw function) img <- idw(Precip_in~1, P, newdata=grd, idp=2.0) n <- length(P) Zi <- matrix(nrow = length(img$var1.pred), ncol = n) # Remove a point then interpolate (do this n times for each point) st <- rast() for (i in 1:n){ Z1 <- gstat::idw(Precip_in~1, P[-i,], newdata=grd, idp=2.0) st <- c(st,rast(Z1)) # Calculated pseudo-value Z at j Zi[,i] <- n * img$var1.pred - (n-1) * Z1$var1.pred } # Jackknife estimator of parameter Z at location j Zj <- as.matrix(apply(Zi, 1, sum, na.rm=T) / n ) # Compute (Zi* - Zj)^2 c1 <- apply(Zi,2,'-',Zj) # Compute the difference c1 <- apply(c1^2, 1, sum, na.rm=T ) # Sum the square of the difference # Compute the confidence interval CI <- sqrt( 1/(n*(n-1)) * c1) # Create (CI / interpolated value) raster img.sig <- img img.sig$v <- CI /img$var1.pred # Clip the confidence raster to Texas r <- rast(img.sig, layer="v") r.m <- mask(r, st_as_sf(W)) # Plot the map tm_shape(r.m["var1.pred"]) + tm_raster(n=7,title="95% confidence interval \\n(in inches)") + tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) 1st order polynomial fit To fit a first order polynomial model of the form \\(precip = intercept + aX + bY\\) to the data, # Define the 1st order polynomial equation f.1 <- as.formula(Precip_in ~ X + Y) # Add X and Y to P P$X <- coordinates(P)[,1] P$Y <- coordinates(P)[,2] # Run the regression model lm.1 <- lm( f.1, data=P) # Use the regression model output to interpolate the surface dat.1st <- SpatialGridDataFrame(grd, data.frame(var1.pred = predict(lm.1, newdata=grd))) # Clip the interpolated raster to Texas r <- rast(dat.1st) r.m <- mask(r, st_as_sf(W)) # Plot the map tm_shape(r.m) + tm_raster(n=10, palette="RdBu", auto.palette.mapping=FALSE, title="Predicted precipitation \\n(in inches)") + tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) 2nd order polynomial To fit a second order polynomial model of the form \\(precip = intercept + aX + bY + dX^2 + eY^2 +fXY\\) to the data, # Define the 2nd order polynomial equation f.2 <- as.formula(Precip_in ~ X + Y + I(X*X)+I(Y*Y) + I(X*Y)) # Add X and Y to P P$X <- coordinates(P)[,1] P$Y <- coordinates(P)[,2] # Run the regression model lm.2 <- lm( f.2, data=P) # Use the regression model output to interpolate the surface dat.2nd <- SpatialGridDataFrame(grd, data.frame(var1.pred = predict(lm.2, newdata=grd))) # Clip the interpolated raster to Texas r <- rast(dat.2nd) r.m <- mask(r, st_as_sf(W)) # Plot the map tm_shape(r.m) + tm_raster(n=10, palette="RdBu", auto.palette.mapping=FALSE, title="Predicted precipitation \\n(in inches)") + tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) Kriging Fit the variogram model First, we need to create a variogram model. Note that the variogram model is computed on the de-trended data. This is implemented in the following chunk of code by passing the 1st order trend model (defined in an earlier code chunk as formula object f.1) to the variogram function. # Define the 1st order polynomial equation f.1 <- as.formula(Precip_in ~ X + Y) # Compute the sample variogram; note that the f.1 trend model is one of the # parameters passed to variogram(). This tells the function to create the # variogram on the de-trended data. var.smpl <- variogram(f.1, P, cloud = FALSE, cutoff=1000000, width=89900) # Compute the variogram model by passing the nugget, sill and range values # to fit.variogram() via the vgm() function. dat.fit <- fit.variogram(var.smpl, fit.ranges = FALSE, fit.sills = FALSE, vgm(psill=14, model="Sph", range=590000, nugget=0)) # The following plot allows us to assess the fit plot(var.smpl, dat.fit, xlim=c(0,1000000)) Generate Kriged surface Next, use the variogram model dat.fit to generate a kriged interpolated surface. The krige function allows us to include the trend model thus saving us from having to de-trend the data, krige the residuals, then combine the two rasters. Instead, all we need to do is pass krige the trend formula f.1. # Define the trend model f.1 <- as.formula(Precip_in ~ X + Y) # Perform the krige interpolation (note the use of the variogram model # created in the earlier step) dat.krg <- krige( f.1, P, grd, dat.fit) # Convert kriged surface to a raster object for clipping r <- rast(dat.krg) r.m <- mask(r, st_as_sf(W)) # Plot the map tm_shape(r.m["var1.pred"]) + tm_raster(n=10, palette="RdBu", auto.palette.mapping=FALSE, title="Predicted precipitation \\n(in inches)") + tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) Generate the variance and confidence interval maps The dat.krg object stores not just the interpolated values, but the variance values as well. These are also passed to the raster object for mapping as follows: tm_shape(r.m["var1.var"]) + tm_raster(n=7, palette ="Reds", title="Variance map \\n(in squared inches)") +tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) A more readily interpretable map is the 95% confidence interval map which can be generated from the variance object as follows (the map values should be interpreted as the number of inches above and below the estimated rainfall amount). r <- rast(dat.krg) r.m <- mask(sqrt(r["var1.var"])* 1.96, st_as_sf(W)) tm_shape(r.m) + tm_raster(n=7, palette ="Reds", title="95% CI map \\n(in inches)") +tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) "]]
+[["index.html", "Intro to GIS and Spatial Analysis Preface", " Intro to GIS and Spatial Analysis Manuel Gimond Last edited on 2023-11-13 Preface 2023 UPDATE: Removed dependence on rgdal and maptools in Appendices Added Statistical Maps chapter (wrapped confidence maps into this chapter) 2021 UPDATE: This book has been updated for the 2021-2022 academic year. Most changes are in the Appendix and pertain to the sf ecosystem. This includes changes in the mapping appendix, and coordinate systems appendix. This also includes a new appendix that describes the simple feature anatomy and step-by-step instructions on creating new geometries from scratch. These pages are a compilation of lecture notes for my Introduction to GIS and Spatial Analysis course (ES214). They are ordered in such a way to follow the course outline, but most pages can be read in any desirable order. The course (and this book) is split into two parts: data manipulation & visualization and exploratory spatial data analysis. The first part of this book is usually conducted using ArcGIS Desktop whereas the latter part of the book is conducted in R. ArcGIS was chosen as the GIS data manipulation environment because of its “desirability” in job applications for undergraduates in the Unites States. But other GIS software environments, such as the open source software QGIS, could easily be adopted in lieu of ArcGIS–even R can be used to perform many spatial data manipulations such as clipping, buffering and projecting. Even though some of the chapters of this book make direct reference to ArcGIS techniques, most chapters can be studied without access to the software. The latter part of this book (and the course) make heavy use of R because of a) its broad appeal in the world of data analysis b) its rich (if not richest) array of spatial analysis and spatial statistics packages c) its scripting environment (which facilitates reproducibility) d) and its very cheap cost (it’s completely free and open source!). But R can be used for many traditional “GIS” application that involve most data manipulation operations–the only benefit in using a full-blown GIS environment like ArcGIS or QGIS is in creating/editing spatial data, rendering complex maps and manipulating spatial data. The Appendix covers various aspects of spatial data manipulation and analysis using R. The course only focuses on point pattern analysis and spatial autocorrelation using R, but I’ve added other R resources for students wishing to expand their GIS skills using R. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. "],["introGIS.html", "Chapter 1 Introduction to GIS 1.1 What is a GIS? 1.2 What is Spatial Analysis? 1.3 What’s in an Acronym?", " Chapter 1 Introduction to GIS 1.1 What is a GIS? A Geographic Information System is a multi-component environment used to create, manage, visualize and analyze data and its spatial counterpart. It’s important to note that most datasets you will encounter in your lifetime can all be assigned a spatial location whether on the earth’s surface or within some arbitrary coordinate system (such as a soccer field or a gridded petri dish). So in essence, any dataset can be represented in a GIS: the question then becomes “does it need to be analyzed in a GIS environment?” The answer to this question depends on the purpose of the analysis. If, for example, we are interested in identifying the ten African countries with the highest conflict index scores for the 1966-78 period, a simple table listing those scores by country is all that is needed. Table 1.1: Index of total African conflict for the 1966-78 period (Anselin and O’Loughlin 1992). Country Conflicts Country Conflicts EGYPT 5246 LIBERIA 980 SUDAN 4751 SENEGAL 933 UGANDA 3134 CHAD 895 ZAIRE 3087 TOGO 848 TANZANIA 2881 GABON 824 LIBYA 2355 MAURITANIA 811 KENYA 2273 ZIMBABWE 795 SOMALIA 2122 MOZAMBIQUE 792 ETHIOPIA 1878 IVORY COAST 758 SOUTH AFRICA 1875 MALAWI 629 MOROCCO 1861 CENTRAL AFRICAN REPUBLIC 618 ZAMBIA 1554 CAMEROON 604 ANGOLA 1528 BURUNDI 604 ALGERIA 1421 RWANDA 487 TUNISIA 1363 SIERRA LEONE 423 BOTSWANA 1266 LESOTHO 363 CONGO 1142 NIGER 358 NIGERIA 1130 BURKINA FASO 347 GHANA 1090 MALI 299 GUINEA 1015 THE GAMBIA 241 BENIN 998 SWAZILAND 147 Data source: Anselin, L. and John O’Loughlin. 1992. Geography of international conflict and cooperation: spatial dependence and regional context in Africa. In The New Geopolitics, ed. M. Ward, pp. 39-75. A simple sort on the Conflict column reveals that EGYPT, SUDAN, UGANDA, ZAIRE, TANZANIA, LIBYA, KENYA, SOMALIA, ETHIOPIA, SOUTH AFRICA are the top ten countries. What if we are interested in knowing whether countries with a high conflict index score are geographically clustered, does the above table provide us with enough information to help answer this question? The answer, of course, is no. We need additional data pertaining to the geographic location and shape of each country. A map of the countries would be helpful. Figure 1.1: Choropleth representation of African conflict index scores. Countries for which a score was not available are not mapped. Maps are ubiquitous: available online and in various print medium. But we seldom ask how the boundaries of the map features are encoded in a computing environment? After all, if we expect software to assist us in the analysis, the spatial elements of our data should be readily accessible in a digital form. Spending a few minutes thinking through this question will make you realize that simple tables or spreadsheets are not up to this task. A more complex data storage mechanism is required. This is the core of a GIS environment: a spatial database that facilitates the storage and retrieval of data that define the spatial boundaries, lines or points of the entities we are studying. This may seem trivial, but without a spatial database, most spatial data exploration and analysis would not be possible! 1.1.1 GIS software Many GIS software applications are available–both commercial and open source. Two popular applications are ArcGIS and QGIS. 1.1.1.1 ArcGIS A popular commercial GIS software is ArcGIS developed by ESRI (ESRI, pronounced ez-ree),was once a small land-use consulting firm which did not start developing GIS software until the mid 1970s. The ArcGIS desktop environment encompasses a suite of applications which include ArcMap, ArcCatalog, ArcScene and ArcGlobe. ArcGIS comes in three different license levels (basic, standard and advanced) and can be purchased with additional add-on packages. As such, a single license can range from a few thousand dollars to well over ten thousand dollars. In addition to software licensing costs, ArcGIS is only available for Windows operating systems; so if your workplace is a Mac only environment, the purchase of a Windows PC would add to the expense. 1.1.2 QGIS A very capable open source (free) GIS software is QGIS. It encompasses most of the functionality included in ArcGIS. If you are looking for a GIS application for your Mac or Linux environment, QGIS is a wonderful choice given its multi-platform support. Built into the current versions of QGIS are functions from another open source software: GRASS. GRASS has been around since the 1980’s and has many advanced GIS data manipulation functions however, its use is not as intuitive as that of QGIS or ArcGIS (hence the preferred QGIS alternative). 1.2 What is Spatial Analysis? A distinction is made in this course between GIS and spatial analysis. In the context of mainstream GIS software, the term analysis refers to data manipulation and data querying. In the context of spatial analysis, the analysis focuses on the statistical analysis of patterns and underlying processes or more generally, spatial analysis addresses the question “what could have been the genesis of the observed spatial pattern?” It’s an exploratory process whereby we attempt to quantify the observed pattern then explore the processes that may have generated the pattern. For example, you record the location of each tree in a well defined study area. You then map the location of each tree (a GIS task). At this point, you might be inclined to make inferences about the observed pattern. Are the trees clustered or dispersed? Is the tree density constant across the study area? Could soil type or slope have led to the observed pattern? Those are questions that are addressed in spatial analysis using quantitative and statistical techniques. Figure 1.2: Distribution of Maple trees in a 1,000 x 1,000 ft study area. What you will learn in this course is that popular GIS software like ArcGIS are great tools to create and manipulate spatial data, but if one wishes to go beyond the data manipulation and analyze patterns and processes that may have led to these patterns, other quantitative tools are needed. One such tool we will use in this class is R: an open source (freeware) data analysis environment. R has one, if not the richest set of spatial data analysis and statistics tools available today. Learning the R programming environment will prove to be quite beneficial given that many of the operations learnt are transferable across many other (non-spatial) quantitative analysis projects. R can be installed on both Windows and Mac operating systems. Another related piece of software that you might find useful is RStudio which offers a nice interface to R. To learn more about data analysis in R, visit the ES218 course website. 1.3 What’s in an Acronym? GIS is a ubiquitous technology. Many of you are taking this course in part because you have seen GIS listed as a “desirable”” or “required” skill in job postings. Many of you will think of GIS as a “map making” environment as do many ancillary users of GIS in the workforce. While “visualizing” data is an important feature of a GIS, one must not lose sight of what data is being visualized and for what purpose. O’Sullivan and Unwin (O’Sullivan and Unwin 2010) use the term accidental geographer to refer to those “whose understanding of geographic science is based on the operations made possible by GIS software”. We can expand on this idea and define accidental data analyst as one whose understanding of data and its analysis is limited to the point-and-click environment of popular pieces of software such as spreadsheet environments, statistical packages and GIS software. The aggressive marketing of GIS technology has the undesirable effect of placing the technology before purpose and theory. This is not unique to GIS, however. Such concerns were shared decades ago when personal computers made it easier for researchers and employees to graph non-spatial data as well as perform many statistical procedures. The different purposes of mapping spatial data have strong parallels to that of graphing (or plotting) non-spatial data. John Tukey (Tukey 1972) offers three broad classes of the latter: “Graphs from which numbers are to be read off- substitutes for tables. Graphs intended to show the reader what has already been learned (by some other technique)–these we shall sometimes impolitely call propaganda graphs. Graphs intended to let us see what may be happening over and above what we have already described- these are the analytical graphs that are our main topic.” A GIS world analogy is proposed here: Reference maps (USGS maps, hiking maps, road maps). Such maps are used to navigate landscapes or identify locations of points-of-interest. Presentation maps presented in the press such as the NY Times and the Wall Street Journal, but also maps presented in journals. Such maps are designed to convey a very specific narrative of the author’s choosing. (Here we’ll avoid Tukey’s harsh description of such visual displays, but the idea that maps can be used as propaganda is not farfetched). Statistical maps whose purpose it is to manipulate the raw data in such a way to tease out patterns otherwise not discernable in its original form. This usually requires multiple data manipulation operations and visualization and can sometimes benefit from being explored outside of a spatial context. This course will focus on the last two spatial data visualization purposes with a strong emphasis on the latter (Statistical maps). References "],["chp02_0.html", "Chapter 2 Feature Representation 2.1 Vector vs. Raster 2.2 Object vs. Field 2.3 Scale 2.4 Attribute Tables", " Chapter 2 Feature Representation 2.1 Vector vs. Raster To work in a GIS environment, real world observations (objects or events that can be recorded in 2D or 3D space) need to be reduced to spatial entities. These spatial entities can be represented in a GIS as a vector data model or a raster data model. Figure 2.1: Vector and raster representations of a river feature. 2.1.1 Vector Vector features can be decomposed into three different geometric primitives: points, polylines and polygons. 2.1.1.1 Point Figure 2.2: Three point objects defined by their X and Y coordinate values. A point is composed of one coordinate pair representing a specific location in a coordinate system. Points are the most basic geometric primitives having no length or area. By definition a point can’t be “seen” since it has no area; but this is not practical if such primitives are to be mapped. So points on a map are represented using symbols that have both area and shape (e.g. circle, square, plus signs). We seem capable of interpreting such symbols as points, but there may be instances when such interpretation may be ambiguous (e.g. is a round symbol delineating the area of a round feature on the ground such as a large oil storage tank or is it representing the point location of that tank?). 2.1.1.2 Polyline Figure 2.3: A simple polyline object defined by connected vertices. A polyline is composed of a sequence of two or more coordinate pairs called vertices. A vertex is defined by coordinate pairs, just like a point, but what differentiates a vertex from a point is its explicitly defined relationship with neighboring vertices. A vertex is connected to at least one other vertex. Like a point, a true line can’t be seen since it has no area. And like a point, a line is symbolized using shapes that have a color, width and style (e.g. solid, dashed, dotted, etc…). Roads and rivers are commonly stored as polylines in a GIS. 2.1.1.3 Polygon Figure 2.4: A simple polygon object defined by an area enclosed by connected vertices. A polygon is composed of three or more line segments whose starting and ending coordinate pairs are the same. Sometimes you will see the words lattice or area used in lieu of ‘polygon’. Polygons represent both length (i.e. the perimeter of the area) and area. They also embody the idea of an inside and an outside; in fact, the area that a polygon encloses is explicitly defined in a GIS environment. If it isn’t, then you are working with a polyline feature. If this does not seem intuitive, think of three connected lines defining a triangle: they can represent three connected road segments (thus polyline features), or they can represent the grassy strip enclosed by the connected roads (in which case an ‘inside’ is implied thus defining a polygon). 2.1.2 Raster Figure 2.5: A simple raster object defined by a 10x10 array of cells or pixels. A raster data model uses an array of cells, or pixels, to represent real-world objects. Raster datasets are commonly used for representing and managing imagery, surface temperatures, digital elevation models, and numerous other entities. A raster can be thought of as a special case of an area object where the area is divided into a regular grid of cells. But a regularly spaced array of marked points may be a better analogy since rasters are stored as an array of values where each cell is defined by a single coordinate pair inside of most GIS environments. Implicit in a raster data model is a value associated with each cell or pixel. This is in contrast to a vector model that may or may not have a value associated with the geometric primitive. Also note that a raster data structure is square or rectangular. So, if the features in a raster do not cover the full square or rectangular extent, their pixel values will be set to no data values (e.g. NULL or NoData). 2.2 Object vs. Field The traditional vector/raster perspective of our world is one that has been driven by software and data storage environments. But this perspective is not particularly helpful if one is interested in analyzing the pattern. In fact, it can mask some important properties of the entity being studied. An object vs. field view of the world proves to be more insightful even though it may seem more abstract. 2.2.1 Object View An object view of the world treats entities as discrete objects; they need not occur at every location within a study area. Point locations of cities would be an example of an object. So would be polygonal representations of urban areas which may be non-contiguous. 2.2.2 Field View A field view of the world treats entities as a scalar field. This is a mathematical concept in which a scalar is a quantity having a magnitude. It is measurable at every location within the study region. Two popular examples of a scalar field are surface elevation and surface temperature. Each represents a property that can be measured at any location. Another example of a scalar field is the presence and absence of a building. This is a binary scalar where a value of 0 is assigned to a location devoid of buildings and a value of 1 is assigned to locations having one or more buildings. A field representation of buildings may not seem intuitive, in fact, given the definition of an object view of the world in the last section, it would seem only fitting to view buildings as objects. In fact, buildings can be viewed as both field or objects. The context of the analysis is ultimately what will dictate which view to adopt. If we’re interested in studying the distribution of buildings over a study area, then an object view of the features makes sense. If, on the other hand, we are interested in identifying all locations where buildings don’t exist, then a binary field view of these entities would make sense. 2.3 Scale How one chooses to represent a real-world entity will be in large part dictated by the scale of the analysis. In a GIS, scale has a specific meaning: it’s the ratio of distance on the map to that in the real world. So a large scale map implies a relatively large ratio and thus a small extent. This is counter to the layperson’s interpretation of large scale which focuses on the scope or extent of a study; so a large scale analysis would imply one that covers a large area. The following two maps represent the same entity: the Boston region. At a small scale (e.g. 1:10,000,000), Boston and other cities may be best represented as points. At a large scale (e.g. 1:34,000), Boston may be best represented as a polygon. Note that at this large scale, roads may also be represented as polygon features instead of polylines. Figure 2.6: Map of the Boston area at a 1:10,000,000 scale. Note that in geography, this is considered small scale whereas in layperson terms, this extent is often referred to as a large scale (i.e. covering a large area). Figure 2.7: Map of the Boston area at a 1:34,000 scale. Note that in geography, this is considered large scale whereas in layperson terms, this extent is often referred to as a small scale (i.e. covering a small area). 2.4 Attribute Tables Non-spatial information associated with a spatial feature is referred to as an attribute. A feature on a GIS map is linked to its record in the attribute table by a unique numerical identifier (ID). Every feature in a layer has an identifier. It is important to understand the one-to-one or many-to-one relationship between feature, and attribute record. Because features on the map are linked to their records in the table, many GIS software will allow you to click on a map feature and see its related attributes in the table. Raster data can also have attributes only if pixels are represented using a small set of unique integer values. Raster datasets that contain attribute tables typically have cell values that represent or define a class, group, category, or membership. NOTE: not all GIS raster data formats can store attribute information; in fact most raster datasets you will work with in this course will not have attribute tables. 2.4.1 Measurement Levels Attribute data can be broken down into four measurement levels: Nominal data which have no implied order, size or quantitative information (e.g. paved and unpaved roads) Ordinal data have an implied order (e.g. ranked scores), however, we cannot quantify the difference since a linear scale is not implied. Interval data are numeric and have a linear scale, however they do not have a true zero and can therefore not be used to measure relative magnitudes. For example, one cannot say that 60°F is twice as warm as 30°F since when presented in degrees °C the temperature values are 15.5°C and -1.1°C respectively (and 15.5 is clearly not twice as big as -1.1). Ratio scale data are interval data with a true zero such as monetary value (e.g. $1, $20, $100). 2.4.2 Data type Another way to categorize an attribute is by its data type. ArcGIS supports several data types such as integer, float, double and text. Knowing your data type and measurement level should dictate how they are stored in a GIS environment. The following table lists popular data types available in most GIS applications. Type Stored values Note Short integer -32,768 to 32,768 Whole numbers Long integer -2,147,483,648 to 2,147,483,648 Whole numbers Float -3.4 * E-38 to 1.2 E38 Real numbers Double -2.2 * E-308 to 1.8 * E308 Real numbers Text Up to 64,000 characters Letters and words While whole numbers can be stored as a float or double (i.e. we can store the number 2 as 2.0) doing so comes at a cost: an increase in storage space. This may not be a big deal if the dataset is small, but if it consists of tens of thousands of records the increase in file size and processing time may become an issue. While storing an integer value as a float may not have dire consequences, the same cannot be said of storing a float as an integer. For example, if your values consist of 0.2, 0.01, 0.34, 0.1 and 0.876, their integer counterpart would be 0, 0, 0, and 1 (i.e. values rounded to the nearest whole number). This can have a significant impact on a map as shown in the following example. Figure 2.8: Map of data represented as decimal (float) values. Figure 2.9: Map of same data represented as integers instead of float. "],["gis-data-management.html", "Chapter 3 GIS Data Management 3.1 GIS File Data Formats 3.2 Managing GIS Files in ArcGIS 3.3 Managing a Map Project in ArcGIS", " Chapter 3 GIS Data Management 3.1 GIS File Data Formats In the GIS world, you will encounter many different GIS file formats. Some file formats are unique to specific GIS applications, others are universal. For this course, we will focus on a subset of spatial data file formats: shapefiles for vector data, Imagine and GeoTiff files for rasters and file geodatabases and geopackages for both vector and raster data. 3.1.1 Vector Data File Formats 3.1.1.1 Shapefile A shapefile is a file-based data format native to ArcView 3.x software (a much older version of ArcMap). Conceptually, a shapefile is a feature class–it stores a collection of features that have the same geometry type (point, line, or polygon), the same attributes, and a common spatial extent. Despite what its name may imply, a “single” shapefile is actually composed of at least three files, and as many as eight. Each file that makes up a “shapefile” has a common filename but different extension type. The list of files that define a “shapefile” are shown in the following table. Note that each file has a specific role in defining a shapefile. File extension Content .dbf Attribute information .shp Feature geometry .shx Feature geometry index .aih Attribute index .ain Attribute index .prj Coordinate system information .sbn Spatial index file .sbx Spatial index file 3.1.1.2 File Geodatabase A file geodatabase is a relational database storage format. It’s a far more complex data structure than the shapefile and consists of a .gdb folder housing dozens of files. Its complexity renders it more versatile allowing it to store multiple feature classes and enabling topological definitions (i.e. allowing the user to define rules that govern the way different feature classes relate to one another). An example of the contents of a geodatabase is shown in the following figure. Figure 3.1: Sample content of an ArcGIS file geodatabase. (src: esri) 3.1.1.3 GeoPackage This is a relatively new data format that follows open format standards (i.e. it is non-proprietary). It’s built on top of SQLite (a self-contained relational database). Its one big advantage over many other vector formats is its compactness–coordinate value, metadata, attribute table, projection information, etc…, are all stored in a single file which facilitates portability. Its filename usually ends in .gpkg. Applications such as QGIS (2.12 and up), R and ArcGIS will recognize this format (ArcGIS version 10.2.2 and above will read the file from ArcCatalog but requires a script to create a GeoPackage). 3.1.2 Raster Data File Formats Rasters are in part defined by their pixel depth. Pixel depth defines the range of distinct values the raster can store. For example, a 1-bit raster can only store 2 distinct values: 0 and 1. Figure 3.2: Examples of different pixel depths. There is a wide range of raster file formats used in the GIS world. Some of the most popular ones are listed below. 3.1.2.1 Imagine The Imagine file format was originally created by an image processing software company called ERDAS. This file format consists of a single .img file. This is a simpler file format than the vector shapefile. It is sometimes accompanied by an .xml file which usually stores metadata information about the raster layer. 3.1.2.2 GeoTiff A popular public domain raster data format is the GeoTIFF format. If maximum portability and platform independence is important, this file format may be a good choice. 3.1.2.3 File Geodatabase A raster file can also be stored in a file geodatabase alongside vector files. Geodatabases have the benefit of defining image mosaic structures thus allowing the user to create “stitched” images from multiple image files stored in the geodatabase. Also, processing very large raster files can be computationally more efficient when stored in a file geodatabase as opposed to an Imagine or GeoTiff file format. 3.2 Managing GIS Files in ArcGIS Unless you are intimately familiar with the file structure of a GIS file, it is best to copy/move/delete GIS files from within the software environment. Figure 3.3: Windows File Explorer view vs. ArcGIS Catalog view. Note, for example, how the many files that make up the Cities shapefile (as viewed in a Windows file manager environment) appears as a single entry in the Catalog view. This makes it easier to rename the shapefile since it needs to be done only for a single entry in the GIS software (as opposed to renaming the Cities files seven times in the Windows file manager environment). 3.3 Managing a Map Project in ArcGIS Unlike many other software environments such as word processors and spreadsheets, a GIS map project is not self-contained in a single file. A GIS map consists of many files: ArcGIS’ .aprx file and the various vector and/or raster files used in the map project. The .aprx file only stores information about how the different layers are to be symbolized and the GIS file locations these layers point to. Because of the complex data structure associated with GIS maps, it’s usually best to store the .aprx and all associated GIS files under a single project directory. Then, when you are ready to share your map project with someone else, just pass along that project folder as is or compressed in a zip or tar file. Because .aprx map files read data from GIS files, it must know where to find these files on your computer or across the network. There are two ways in which a map document can store the location to the GIS files: as a relative pathname or a full pathname. In older esri GIS applications, like ArcMap, the user had the choice to save a project using relative or full pathnames. Note that ArcMap is a legacy GIS software replaced by ArcGIS Pro. What follows pertains to the ArcMap software environment and not the ArcGIS Pro software environment. A relative pathname defines the location of the GIS files relative to the location of the map file on your computer. For example, let’s say that you created a project folder called HW05 under D:/Username/. In that folder, you have an ArcMap map document, Map.aprx (ArcMap GIS files have an .mxd extension and not an .aprx extension). The GIS document displays two layers stored in the GIS files Roads.shp and Cities.shp. In this scenario, the .mxd document and shapefiles are in the same project folder. If you set the Pathnames parameter to “Store relative pathnames to data sources” (accessed from ArcMap’s File >> Map Document Properties menu) ArcMap will not need to know the entire directory structure above the HW05/ folder to find the two shapefiles as illustrated below. If the “Store relative pathnames to data sources” is not checked in the map’s document properties, then ArcMap will need to know the entire directory structure leading to the HW05/ folder as illustrated below. Your choice of full vs relative pathnames matters if you find yourself having to move or copy your project folder to another directory structure. For example, if you share you HW05/ project folder with another user and that user places the project folder under a different directory structure such as C:/User/Jdoe/GIS/, ArcMap will not find the shapefiles if the pathnames is set to full (i.e. the Store relative pathnames option is not checked). This will result in exclamation marks in your map document TOC. This problem can be avoided by making sure that the map document is set to use relative pathnames and by placing all GIS files (raster and vector) in a common project folder. NOTE: Exclamation marks in your map document indicate that the GIS files are missing or that the directory structure has changed. Figure 3.4: In ArcGIS, an exclamation mark next to a layer indicates that the GIS file the layer is pointing to cannot be found. "],["symbolizing-features.html", "Chapter 4 Symbolizing features 4.1 Color 4.2 Color Space 4.3 Classification 4.4 So how do I find a proper color scheme for my data? 4.5 Classification Intervals", " Chapter 4 Symbolizing features 4.1 Color Each color is a combination of three perceptual dimensions: hue, lightness and saturation. 4.1.1 Hue Hue is the perceptual dimension associated with color names. Typically, we use different hues to represent different categories of data. Figure 4.1: An example of eight different hues. Hues are associated with color names such as green, red or blue. Note that magentas and purples are not part of the natural visible light spectrum; instead they are a mix of reds and blues (or violets) from the spectrum’s tail ends. 4.1.2 Lightness Lightness (sometimes referred to as value) describes how much light reflects (or is emitted) off of a surface. Lightness is an important dimension for representing ordinal/interval/ratio data. Figure 4.2: Eight different hues (across columns) with decreasing lightness values (across rows). 4.1.3 Saturation Saturation (sometimes referred to as chroma) is a measure of a color’s vividness. You can use saturated colors to help distinguish map symbols. But be careful when manipulating saturation, its property should be modified sparingly in most maps. Figure 4.3: Eight different hues (across columns) with decreasing saturation values (across rows). 4.2 Color Space The three perceptual dimensions of color can be used to construct a 3D color space. This 3D space need not be a cube (as one would expect given that we are combining three dimensions) but a cone where lightness, saturation and hue are the cone’s height, radius and circumference respectively. Figure 4.4: This is how the software defines the color space. But does this match our perception of color space? The cone shape reflects the fact that as one decreases saturation, the distinction between different hues disappears leading to a grayscale color (the central axis of the cone). So if one sets the saturation value of a color to 0, the hue ends up being some shade of grey. The color space implemented in most software is symmetrical about the value/lightness axis. However, this is not how we “perceive” color space: our perceptual view of the color space is not perfectly symmetrical. Let’s examine a slice of the symmetrical color space along the blue/yellow hue axis at a lightness value of about 90%. Figure 4.5: A cross section of the color space with constant hues and lightness values and decreasing saturation values where the two hues merge. Now, how many distinct yellows can you make out? How many distinct blues can you make out? Do the numbers match? Unless you have incredible color perception, you will probably observe that the number of distinct colors do not match when in fact they do! There are exactly 30 distinct blues and 30 distinct yellows. Let’s add a border to each color to convince ourselves that the software did indeed generate the same number of distinct colors. Figure 4.6: A cross section of the color space with each color distinctly outlined. It should be clear by now that a symmetrical color space does not reflect the way we “perceive” colors. There are more rigorously designed color spaces such as CIELAB and Munsell that depict the color space as a non-symmetrical object as perceived by humans. For example, in a Munsell color space, a vertical slice of the cone along the blue/yellow axis looks like this. Figure 4.7: A slice of the Munsell color space. Note that based on the Munsell color space, we can make out fewer yellows than blues across all lightness values. In fact, for these two hues, we can make out only 29 different shades of yellow (we do not include the gray levels where saturation = 0) vs 36 shades of blue. So how do we leverage our understanding of color spaces when choosing colors for our map features? The next section highlights three different color schemes: qualitative, sequential and divergent. 4.3 Classification 4.3.1 Qualitative color scheme Qualitative schemes are used to symbolize data having no inherent order (i.e. categorical data). Different hues with equal lightness and saturation values are normally used to distinguish different categorical values. Figure 4.8: Example of four different qualitative color schemes. Color hex numbers are superimposed on each palette. Election results is an example of a dataset that can be displayed using a qualitative color scheme. But be careful in your choice of hues if a cultural bias exists (i.e. it may not make sense to assign “blue” to republican or “red”” to democratic regions). Figure 4.9: Map of 2012 election results shown in a qualitative color scheme. Note the use of three hues (red, blue and gray) of equal lightness and saturation. Most maps created in this course will be generated from polygon layers where continuous values will be assigned discrete color swatches. Such maps are referred to as choropleth maps. The choice of classification schemes for choropleth maps are shown next. 4.3.2 Sequential color scheme Sequential color schemes are used to highlight ordered data such as income, temperature, elevation or infection rates. A well designed sequential color scheme ranges from a light color (representing low attribute values) to a dark color (representing high attribute values). Such color schemes are typically composed of a single hue, but may include two hues as shown in the last two color schemes of the following figure. Figure 4.10: Example of four different sequential color schemes. Color hex numbers are superimposed on each palette. Distribution of income is a good example of a sequential map. Income values are interval/ratio data which have an implied order. Figure 4.11: Map of household income shown in a sequential color scheme. Note the use of a single hue (green) and 7 different lightness levels. 4.3.3 Divergent color scheme Divergent color schemes apply to ordered data as well. However, there is an implied central value about which all values are compared. Typically, a divergent color scheme is composed of two hues–one for each side of the central value. Each hue’s lightness/saturation value is then adjusted symmetrically about the central value. Examples of such a color scheme follows: Figure 4.12: Example of four different divergent color schemes. Color hex numbers are superimposed onto each palette. Continuing with the last example, we now focus on the divergence of income values about the median value of $36,641. We use a brown hue for income values below the median and a green/blue hue for values above the median. Figure 4.13: This map of household income uses a divergent color scheme where two different hues (brown and blue-green) are used for two sets of values separated by the median income of 36,641 dollars. Each hue is then split into three separate colors using decreasing lightness values away from the median. 4.4 So how do I find a proper color scheme for my data? Fortunately, there is a wonderful online resource that will guide you through the process of picking a proper set of color swatches given the nature of your data (i.e. sequential, diverging, and qualitative) and the number of intervals (aka classes). The website is http://colorbrewer2.org/ and was developed by Cynthia Brewer et. al at the Pennsylvania State University. You’ll note that the ColorBrewer website limits the total number of color swatches to 12 or less. There is a good reason for this in that our eyes can only associate so many different colors with value ranges/bins. Try matching 9 different shades of green in a map to the legend box swatches! Additional features available on that website include choosing colorblind safe colors and color schemes that translate well into grayscale colors (useful if your work is to be published in journals that do not offer color prints). 4.5 Classification Intervals You may have noticed the use of different classification breaks in the last two maps. For the sequential color scheme map, an equal interval classification scheme was used where the full range of values in the map are split equally into 7 intervals so that each color swatch covers an equal range of values. The divergent color scheme map adopts a quantile interval classification where each color swatch is represented an equal number of times across each polygon. Using different classification intervals will result in different looking maps. In the following figure, three maps of household income (aggregated at the census tract level) are presented using different classification intervals: quantile, equal and Jenks. Note the different range of values covered by each color swatch. Figure 4.14: Three different representations of the same spatial data using different classification intervals. The quantile interval scheme ensures that each color swatch is represented an equal number of times. If we have 20 polygons and 5 classes, the interval breaks will be such that each color is assigned to 4 different polygons. The equal interval scheme breaks up the range of values into equal interval widths. If the polygon values range from 10,000 to 25,000 and we have 5 classes, the intervals will be [10,000 ; 13,000], [13,000 ; 16,000], …, [22,000 ; 25,000]. The Jenks interval scheme (aka natural breaks) uses an algorithm that identifies clusters in the dataset. The number of clusters is defined by the desired number of intervals. It may help to view the breaks when superimposed on top of a distribution of the attribute data. In the following graphics the three classification intervals are superimposed on a histogram of the per-household income data. The histogram shows the distribution of values as “bins” where each bin represents a range of income values. The y-axis shows the frequency (or number of occurrences) for values in each bin. Figure 4.15: Three different classification intervals used in the three maps. Note how each interval scheme encompasses different ranges of values (hence the reason all three maps look so different). 4.5.1 An Interactive Example The following interactive frame demonstrates the different “looks” a map can take given different combinations of classification schemes and class numbers. "],["statistical-maps.html", "Chapter 5 Statistical maps 5.1 Statistical distribution maps 5.2 Mapping uncertainty", " Chapter 5 Statistical maps 5.1 Statistical distribution maps The previous chapter demonstrated how the choice of a classification scheme can generate different looking maps. Your choice of classification breaks should be driven by the data. This chapter will focus on statistical approaches to generating classification breaks. Many spatial datasets consist of continuous values. As such, one can have as many unique values as there are unique polygons in a data layer. For example, a Massachusetts median household income map where a unique color is assigned to each unique value will look like this: Figure 5.1: Example of a continuous color scheme applied to a choropleth map. Such a map may not be as informative as one would like it to be. In statistics, we seek to reduce large sets of continuous values to discrete entities to help us better “handle” the data. In the field of statistics, discretization of values can take on the form of a histogram where values are assigned to one of several equal width bins. A choropleth map classification equivalent is the equal interval classification scheme. Figure 5.2: An equal interval choropleth map using 10 bins. The histogram in the above figure is “flipped” so as to match the bins with the color swatches. The length of each grey bin reflects the number of polygons assigned their respective color swatches. An equal interval map benefits from having each color swatch covering an equal range of values. This makes is easier to compare differences between pairs of swatches. Note that a sequential color scheme is used since there is no implied central value in this classification scheme. 5.1.1 Quantile map While an equal interval map benefits from its intuitiveness, it may not be well suited for data that are not uniformly distributed across their range (note the disproportionate distribution of observations in each color a bin in the above figure). Quantiles define ranges of values that have equal number of observations. For example, the following plot groups the data into six quantiles with each quantile representing the same number of observations (Exceptions exist when multiple observations share the same exact value). Figure 5.3: Example of a quantile map. You’ll note the differing color swatch lengths in the color bar reflecting the different ranges of values covered by each color swatch. For example, the darkest color swatch covers the largest range of values, [131944, 250001], yet it is applied to the same number of polygons as most other color swatches in this classification scheme. 5.1.2 Boxplot map The discretization of continuous values can also include measures of centrality (e.g. the mean and the median) and measures of spread (e.g. standard deviation units) with the goal of understanding the nature of the distribution such as its shape (e.g. symmetrical, skewed, etc…) and range. The boxplot is an example of statistical plot that offers both. This plot reduces the data to a five summary statistics including the median, the upper and lower quartiles (within which 50% of the data lie–also known as the interquartile range,IQR), and upper and lower “whiskers” that encompass 1.5 times the interquartile range. The boxplot may also display “outliers”–data points that may be deemed unusual or not characteristic of the bulk of the data. Figure 5.4: Example of a boxplot map. Here, we make use of a divergent color scheme to take advantage of the implied measure of centrality (i.e. the median). 5.1.3 IQR map The IQR map is a reduction of the boxplot map whereby we reduce the classes to just three: the interquartile range (IQR) and the upper and lower extremes. The map’s purpose is to highlight the polygons covering the mid 50% range of values. This mid range usually benefits from a darker hue to help distinguish it from the upper and lower sets of values. Figure 5.5: Example of an IQR map. The IQR map differs from the preceding maps shown in this chapter in that upper and lower values are no longer emphasized–whether implicitly or explicitly. While these maps consistently highlighted the prominent east-west gradient in income values with the higher values occurring in the east and the lower values occurring in the west, the IQR map reveals that the distribution of middle income households follows a pattern that is more dispersed across the state of Massachusetts. 5.1.4 Standard deviation map If the data distribution can be approximated by a Normal distribution (a theoretical distribution defined by a mathematical function), the classification scheme can be broken up into different standard deviation units. Figure 5.6: Example of a standard deviation map. You’ll note from the figure that the income data do not follow a Normal distribution exactly–they have a slight skew toward higher values. This results in more polygons being assigned higher class breaks than lower ones. 5.1.5 Outlier maps So far, emphasis has been placed on the distribution of values which attempts to place emphasis on the full range of values. However, there may be times when we want to place emphasis on the extreme values. For example, we may want to generate a map that identifies the regions with unusually high or unusually low values. What constitutes an outlier can be subjective. For this reason, we will rely on statistical techniques covered in the last section to help characterize regions with unusually high and/or low values. 5.1.5.1 Boxplot outlier map We can tweak the boxplot map from the last section by assigning darker hues to observations outside the whiskers (outliers) and a single light colored hue to all other values. By minimizing the range of color swatches, we place emphasis on the outliers. Figure 5.7: Example of a boxplot outlier choropleth map. You’ll note the asymmetrical distribution of outliers with a bit more than a dozen regions with unusually high income values and just one region with unusually low income values. 5.1.5.2 Standard deviation outliers In this next example, we use the +/- 2 standard deviation bounds from the Normal distribution to identify outliers in the income data. Hence, if the data were to follow a perfectly Normal distribution, this would translate to roughly the top 2.5% and bottom 2.5% of the distribution. Figure 5.8: Example of a standard deviation outlier choropleth map. 5.1.5.3 quantile outliers In this last example, we’ll characterize the top and bottom 2.5% of values as outliers by splitting the data into 40 quantiles then maping the top and bottom quantiles to capture the 2.5% fo values. Figure 5.9: Example of a quantile outlier choropleth map where the top and bottom 2.5% regions are characterized as outliers. 5.2 Mapping uncertainty Many census datasets such as the U.S. Census Bureau’s American Community Survey (ACS) data are based on surveys from small samples. This entails that the variables provided by the Census Bureau are only estimates with a level of uncertainty often provided as a margin of error (MoE) or a standard error (SE). Note that the Bureau’s MoE encompasses a 90% confidence interval1 (i.e. there is a 90% chance that the MoE range covers the true value being estimated). This poses a challenge to both the visual exploration of the data as well as any statistical analyses of that data. One approach to mapping both estimates and SE’s is to display both as side-by-side maps. Figure 5.10: Maps of income estimates (left) and associated standard errors (right). While there is nothing inherently wrong in doing this, it can prove to be difficult to mentally process the two maps, particularly if the data consist of hundreds or thousands of small polygons. Another approach is to overlay the measure of uncertainty (SE or MoE) as a textured layer on top of the income layer. Figure 5.11: Map of estimated income (in shades of green) superimposed with different hash marks representing the ranges of income SE. Or, one could map both the upper and lower ends of the MoE range side by side. Figure 5.12: Maps of top end of 90 percent income estimate (left) and bottom end of 90 percent income estimate (right). 5.2.1 Problems in mapping uncertainty Attempting to convey uncertainty using the aforementioned maps fails to highlight the reason one chooses to map values in the first place: that is to compare values across a spatial domain. More specifically, we are interested in identifying spatial patterns of high or low values. What is implied in the above maps is that the estimates will always maintain their order across the polygons. In other words, if one polygon’s estimate is greater than all neighboring estimates, this order will always hold true if another sample was surveyed. But this assumption is incorrect. Each polygon (or county in the above example) can derive different estimates independently from its neighboring polygon. Let’s look at a bar plot of our estimates. Figure 5.13: Income estimates by county with 90 percent confidence interval. Note that many counties have overlapping estimate ranges. Note, for example, how Piscataquis county’s income estimate (grey point in the graphic) is lower than that of Oxford county. If another sample of the population was surveyed in each county, the new estimates could place Piscataquis above Oxford county in income rankings as shown in the following example: Figure 5.14: Example of income estimates one could expect to sample based on the 90 percent confidence interval shown in the previous plot. Note how, in this sample, Oxford’s income drops in ranking below that of Piscataquis and Franklin counties. A similar change in ranking is observed for Sagadahoc county which drops down two counties: Hancock and Lincoln. How does the estimated income map compare with the simulated income map? Figure 5.15: Original income estimate (left) and realization of a simulated sample (right). A few more simulated samples (using the 90% confidence interval) are shown below: Figure 5.16: Original income estimate (left) and realizations from simulated samples (R2 through R5). 5.2.2 Class comparison maps There is no single solution to effectively convey both estimates and associated uncertainty in a map. Sun and Wong (Sun and Wong 2010) offer several suggestions dependent on the context of the problem. One approach adopts a class comparison method whereby a map displays both the estimate and a measure of whether the MoE surrounding that estimate extends beyond the assigned class. For example, if we adopt the classification breaks [0 , 20600 , 22800 , 25000 , 27000 , 34000 ], we will find that many of the estimates’ MoE extend beyond the classification breaks assigned to them. Figure 5.17: Income estimates by county with 90 percent confidence interval. Note that many of the counties’ MoE have ranges that cross into an adjacent class. Take Piscataquis county, for example. Its estimate is assigned the second classification break (20600 to 22800 ), yet its lower confidence interval stretches into the first classification break indicating that we cannot be 90% confident that the estimate is assigned the proper class (i.e. its true value could fall into the first class). Other counties such as Cumberland and Penobscot don’t have that problem since their 90% confidence intervals fall inside the classification breaks. This information can be mapped as a hatch mark overlay. For example, income could be plotted using varying shades of green with hatch symbols indicating if the lower interval crosses into a lower class (135° hatch), if the upper interval crosses into an upper class (45° hatch), if both interval ends cross into a different class (90°-vertical-hatch) or if both interval ends remain inside the estimate’s class (no hatch). Figure 5.18: Plot of income with class comparison hatches. 5.2.3 Problem when performing bivariate analysis Data uncertainty issues do not only affect choropleth map presentations but also affect bivariate or multivariate analyses where two or more variables are statistically compared. One popular method in comparing variables is the regression analysis where a line is best fit to a bivariate scatterplot. For example, one can regress “percent not schooled”” to “income”” as follows: Figure 5.19: Regression between percent not having completed any school grade and median per capita income for each county. The \\(R^2\\) value associated with this regression analysis is 0.2 and the p-value is 0.081. But another realization of the survey could produce the following output: Figure 5.20: Example of what a regression line could look like had another sample been surveyed for each county. With this new (simulated) sample, the \\(R^2\\) value dropped to 0.07 and the p-value is now 0.322–a much less significant relationship then computed with the original estimate! In fact, if we were to survey 1000 different samples within each county we would get the following range of regression lines: Figure 5.21: A range of regression lines computed from different samples from each county. These overlapping lines define a type of confidence interval (aka confidence envelope). In other words, the true regression line between both variables lies somewhere within the dark region delineated by this interval. References "],["pitfalls-to-avoid.html", "Chapter 6 Pitfalls to avoid 6.1 Representing Count 6.2 MAUP 6.3 Ecological Fallacy 6.4 Mapping rates 6.5 Coping with Unstable Rates", " Chapter 6 Pitfalls to avoid 6.1 Representing Count Let’s define a 5km x 5km area and map the location of each individual inside the study area. Let’s assume, for sake of argument, that individuals are laid out in a perfect grid pattern. Now let’s define two different zoning schemes: one which follows a uniform grid pattern and another that does not. The layout of individuals relative to both zonal schemes are shown in Figure 6.1. Figure 6.1: Figure shows the layout of individuals inside two different zonal unit configurations. If we sum the number of individuals in each polygon, we get two maps that appear to be giving us two completely different population distribution patterns: Figure 6.2: Count of individuals in each zonal unit. Note how an underlying point distribution can generate vastly different looking choropleth maps given different aggregation schemes. The maps highlight how non-uniform aerial units can fool us into thinking a pattern exists when in fact this is just an artifact of the aggregation scheme. A solution to this problem is to represent counts as ratios such as number of deaths per number of people or number of people per square kilometer. In Figure 6.3, we opt for the latter ratio (number of people per square kilometer). Figure 6.3: Point density choropleth maps. The sample study extent is 20x20 units which generates a uniform point density of 1. The slight discrepancy in values for the map on the right is to be expected given that the zonal boundaries do not split the distance between points exactly. 6.2 MAUP Continuing with the uniform point distribution from the last section, let’s assume that as part of the survey, two variables (v1 and v2) were recorded for each point (symbolized as varying shades of green and reds in the two left-hand maps of Figure 6.4). We might be interested in assessing if the variables v1 and v2 are correlated (i.e. as variable v1 increases in value, does this trigger a monotonic increase or decrease in variable v2?). One way to visualize the relationship between two variables is to generate a bivariate scatter plot (right plot of Figure 6.4). Figure 6.4: Plots of variables v1 and v2 for each individual in the survey. The color scheme is sequential with darker colors depicting higher values and lighter colors depicting lower values. It’s obvious from the adjoining scatter plot that there is little to no correlation between variables v1 and v2 at the individual level; both the slope and coefficient of determination, \\(R^2\\), are close to \\(0\\). But many datasets (such as the US census data) are provided to us not at the individual level but at various levels of aggregation units such as the census tract, the county or the state levels. When aggregated, the relationship between variables under investigation may change. For example, if we aggregated v1 and v2 using the uniform aggregation scheme highlighted earlier we get the following relationship. Figure 6.5: Data summarized using a uniform aggregation scheme. The resulting regression analysis is shown in the right-hand plot. Note the slight increase in slope and \\(R^2\\) values. If we aggregate the same point data using the non-homogeneous aggregation scheme, we get yet another characterization of the relationship between v1 and v2. Figure 6.6: Data summarized using a non-uniform aggregation scheme.The resulting regression analysis is shown in the right-hand plot. Note the high \\(R^2\\) value, yet the underlying v1 and v2 variables from which the aggregated values were computed were not at all correlated! It should be clear by now that different aggregation schemes can result in completely different analyses outcomes. In fact, it would not be impossible to come up with an aggregation scheme that would produce near perfect correlation between variables v1 and v2. This problem is often referred to as the modifiable aerial unit problem (MAUP) and has, as you can well imagine by now, some serious implications. Unfortunately, this problem is often overlooked in many analyses that involve aggregated data. 6.3 Ecological Fallacy But, as is often the case, our analysis is constrained by the data at hand. So when analyzing aggregated data, you must be careful in how you frame the results. For example, if your analysis was conducted with the data summarized using the non-uniform aggregation scheme shown in Figure 6.6, you might be tempted to state that there is a strong relationship between variables v1 and v2 at the individual level. But doing so leads to the ecological fallacy where the statistical relationship at one level of aggregation is (wrongly) assumed to hold at any other levels of aggregation (including at the individual level). In fact, all you can really say is that “at this level of aggregation, we observe a strong relationship between v1 and v2” and nothing more! 6.4 Mapping rates One of the first pitfalls you’ve been taught to avoid is the mapping of counts when the aerial units associated with these values are not uniform in size and shape. Two options in resolving this problem are: normalizing counts to area or normalizing counts to some underlying population count. An example of the latter is the mapping of infection rates or mortality rates. For example, the following map displays the distribution of kidney cancer death rates (by county) for the period 1980 to 1984. Figure 6.7: Kidney cancer death rates for the period spanning 1980-1984. Now let’s look at the top 10% of counties with the highest death rates. Figure 6.8: Top 10% of counties with the highest kidney cancer death rates. And now let’s look at the bottom 10% of counties with the lowest death rates. Figure 6.9: Bottom 10% of counties with the lowest kidney cancer death rates. A quick glance of these maps suggests clustering of high and low rates around the same parts of the country. In fact, if you were to explore these maps in a GIS, you would note that many of the bottom 10% counties are adjacent to the top 10% counties! If local environmental factors are to blame for kidney cancer deaths, why would they be present in one county and not in an adjacent county? Could differences in regulations between counties be the reason? These are hypotheses that one would probably want to explore, but before pursuing these hypotheses, it would behoove us to look a bit more closely at the batch of numbers we are working with. Let’s first look at a population count map (note that we are purposely not normalizing the count data). Figure 6.10: Population count for each county. Note that a quantile classification scheme is adopted forcing a large range of values to be assigned a single color swatch. The central part of the states where we are observing both very high and very low cancer death rates seem to have low population counts. Could population count have something to do with this odd congruence of high and low cancer rates? Let’s explore the relationship between death rates and population counts outside of a GIS environment and focus solely on the two batches of numbers. The following plot is a scatterplot of death rates and population counts. Figure 6.11: Plot of rates vs population counts. Note the skewed nature of both data batches. Transforming both variables reveals much more about the relationship between them. Figure 6.12: Plot of rates vs population counts on log scales. One quickly notices a steady decrease in death rate variability about some central value of ~0.000045 (or 4.5e-5) as the population count increases. This is because lower population counts tend to generate the very high and very low rates observed in our data. This begs the question: does low population count cause very high and low cancer death rates, or is this simply a numerical artifact? To answer this question, let’s simulate some data. Let’s assume that the real death rate is 5 per 100,000 people . If a county has a population of 1000, then \\(1000 \\times 5e-5 = 0.05\\) persons would die of kidney cancer; when rounded to the next whole person, that translates to \\(0\\) deaths in that county. Now, there is still the possibility that a county of a 1000 could have one person succumb to the disease in which case the death rate for that county would be \\(1/1000=0.001\\) or 1 in a 1000, a rate much greater than the expected rate of 5 in 100,000! This little exercise reveals that you could never calculate a rate of 5 in 100,000 with a population count of just 1000. You either compute a rate of \\(0\\) or a rate of \\(0.001\\) (or more). In fact, you would need a large population count to accurately estimate the real death rate. Turning our attention back to our map, you will notice that a large majority of the counties have a small population count (about a quarter have a population count of 22,000 or less). This explains the wide range of rates observed for these smaller counties; the larger counties don’t have such a wide swing in values because they have a larger sample size which can more accurately reflect the true death rate. Rates that are computed using relatively small “at risk” population counts are deemed unstable. 6.5 Coping with Unstable Rates To compensate for the small population counts, we can minimize the influence those counties have on the representation of the spatial distribution of rates. One such technique, empirical Bayes (EB) method, does just that. Where county population counts are small, the “rates” are modified to match the overall expected rate (which is an average value of all rates in the map). This minimizes the counties’ influence on the range of rate values. EB techniques for rate smoothing aren’t available in ArcGIS but are available in a couple of free and open source applications such as GeoDa and R. An example implementation in R is shown in the Appendix section. An EB smoothed representation of kidney cancer deaths gives us the following rate vs population plot: Figure 6.13: Plot of EB smoothed rates vs population counts on log scales. The variability in rates for smaller counties has decreased. The range of rate values has dropped from 0.00045 to 0.00023. Variability is still greater for smaller counties than larger ones, but not as pronounced as it was with the raw rates Maps of the top 10% and bottom 10% EB smoothed rates are shown in the next two figures. Figure 6.14: Top 10% of counties with the highest kidney cancer death rates using EB smoothing techniques. Figure 6.15: Bottom 10% of counties with the lowest kidney cancer death rates using EB smoothing technique. Note the differences in rate distribution. For example, higher rates now show up in Florida which would be expected given the large retirement population, and clusters are now contiguous which could suggest local effects. But it’s important to remember that EB smoothing does not reveal the true underlying rate; it only masks those that are unreliable. Also, EB smoothing does not completely eliminate unstable rates–note the slighlty higher rates for low population counts in Figure 6.15. Other solutions to the unstable rate problem include: Grouping small counties into larger ones–thus increasing population sample size. Increasing the study’s time interval. In this example, data were aggregated over the course of 5 years (1980-1984) but could be increased by adding 5 more years thus increasing sample sizes in each county. Grouping small counties AND increasing the study’s time interval. These solutions do have their downside in that they decrease the spatial and/or temporal resolutions. It should be clear by now that there is no single one-size-fits-all solution to the unstable rate problem. A sound analysis will usually require that one or more of the aforementioned solutions be explored. "],["good-map-making-tips.html", "Chapter 7 Good Map Making Tips 7.1 Elements of a map 7.2 How to create a good map 7.3 Typefaces and Fonts", " Chapter 7 Good Map Making Tips 7.1 Elements of a map A map can be composed of many different map elements. They may include: Main map body, legend, title, scale indicator, orientation indicator, inset map and source and ancillary information. Not all elements need to be present in a map. In fact, in some cases they may not be appropriate at all. A scale bar, for instance, may not be appropriate if the coordinate system used does not preserve distance across the map’s extent. Knowing why and for whom a map is being made will dictate its layout. If it’s to be included in a paper as a figure, then parsimony should be the guiding principle. If it’s intended to be a standalone map, then additional map elements may be required. Knowing the intended audience should also dictate what you will convey and how. If it’s a general audience with little technical expertise then a simpler presentation may be in order. If the audience is well versed in the topic, then the map may be more complex. Figure 7.1: Map elements. Note that not all elements are needed, nor are they appropriate in some cases. Can you identify at least one element that does not belong in the map (hint, note the orientation of the longitudinal lines; are they parallel to one another? What implication does this have on the North direction and the placement of the North arrow?) 7.2 How to create a good map Here’s an example of a map layout that showcases several bad practices. Figure 7.2: Example of a bad map. Can you identify the problematic elements in this map? A good map establishes a visual hierarchy that ensures that the most important elements are at the top of this hierarchy and the least important are at the bottom. Typically, the top elements should consist of the main map body, the title (if this is a standalone map) and a legend (when appropriate). When showcasing Choropleth maps, it’s best to limit the color swatches to less than a dozen–it becomes difficult for the viewer to tie too many different colors in a map to a color swatch element in the legend. Also, classification breaks should not be chosen at random but should be chosen carefully; for example adopting a quantile classifications scheme to maximize the inclusion of the different color swatches in the map; or a classification system designed based on logical breaks (or easy to interpret breaks) when dictated by theory or cultural predisposition. Scale bars and north arrows should be used judiciously and need not be present in every map. These elements are used to measure orientation and distances. Such elements are critical in reference maps such as USGS Topo maps and navigation maps but serve little purpose in a thematic map where the goal is to highlight differences between aerial units. If, however, these elements are to be placed in a thematic map, reduce their visual prominence (see Figure 7.3 for examples of scale bars). The same principle applies to the selection of an orientation indicator (north arrow) element. Use a small north arrow design if it is to be placed low in the hierarchy, larger if it is to be used as a reference (such as a nautical chart). Figure 7.3: Scale bar designs from simplest (top) to more complex (bottom). Use the simpler design if it’s to be placed low in the visual hierarchy. Title and other text elements should be concise and to the point. If the map is to be embedded in a write-up such as a journal article, book or web page, title and text(s) elements should be omitted in favor of figure captions and written description in the accompanying text. Following the aforementioned guidelines can go a long way in producing a good map. Here, a divergent color scheme is chosen whereby the two hues converge to the median income value. A coordinate system that minimizes distance error measurements and that preserves “north” orientation across the main map’s extent is chosen since a scale bar and north arrow are present in the map. The inset map (lower left map body) is placed lower in the visual hierarchy and could be omitted if the intended audience was familiar with the New England area. A unique (and unconventional) legend orders the color swatches in the order in which they appear in the map (i.e. following a strong north-south income gradient). Figure 7.4: Example of an improved map. 7.3 Typefaces and Fonts Maps may include text elements such as labels and ancillary text blocks. The choice of typeface (font family) and font (size, weight and style of a typeface) can impact the legibility of the map. A rule of thumb is to limit the number of fonts to two: a serif and a sans serif font. Figure 7.5: Serif fonts are characterized by brush strokes at the letter tips (circled in red in the figure). Sans Serif fonts are devoid of brush strokes. Serif fonts are generally used to label natural features such as mountain ridges and water body names. Sans serif fonts are usually used to label anthropogenic features such as roads, cities and countries. Varying the typeset size across the map should be avoided unless a visual hierarchy of labels is desired. You also may want to stick with a single font color across the map unless the differences in categories need to be emphasized. In the following example, a snapshot of a map before (left) and after (right) highlight how manipulating typeset colors and styles (i.e. italic, bold) can have a desirable effect if done properly. Figure 7.6: The lack of typeset differences makes the map on the left difficult to differentiate county names from lake/river names. The judicious use of font colors and style on the right facilitate the separation of features. "],["spatial-operations-and-vector-overlays.html", "Chapter 8 Spatial Operations and Vector Overlays 8.1 Selection by Attribute 8.2 Selection by location 8.3 Vector Overlay", " Chapter 8 Spatial Operations and Vector Overlays 8.1 Selection by Attribute Features in a GIS layer can be selected graphically or by querying attribute values. For example, if a GIS layer represents land parcels, one could use the Area field to select all parcels having an area greater than 2.0 acres. Set algebra is used to define conditions that are to be satisfied while Boolean algebra is used to combine a set of conditions. 8.1.1 Set Algebra Set algebra consists of four basic operators: less than (<), greater than (>), equal to (=) not equal to (<>). In some programming environments (such as R and Python), the equality condition is presented as two equal signs, ==, and not one. In such an environment x = 3 is interpreted as “pass the value 3 to x” and x == 3 is interpreted as “is x equal to 3?. If you have a GIS layer of major cities and you want to identify all census tracts having a population count greater than 50000, you would write the expression as \"POP\" > 50000 (of course, this assumes that the attribute field name for population count is POP). Figure 8.1: An example of the Select Layer by Attributes tool in ArcGIS Pro where the pull-down menu is used to define the selection. Figure 8.2: An example of the Select Layer by Attributes tool in ArcGIS Pro where the SQL syntax is used to define the selection. Figure 8.3: Selected cities meeting the criterion are shown in cyan color in ArcGIS Pro. The result of this operation is a selected subset of the Cities point layer. Note that in most GIS applications the selection process does not create a new layer. 8.1.2 Boolean Algebra You can combine conditions from set algebra operations using the following Boolean algebra operators: or (two conditions can be met), and (two conditions must be met), not (condition must not be met). Following up with the last example, let’s now select cities having a population greater than 50000 and that are in the US (and not Canada or Mexico). Assuming that the country field is labeled FIPS_CNTRY we could setup the expression as \"POP\" > 50000 AND \"FIPS_CNTRY\" = US. Note that a value need not be numeric. In this example we are asking that an attribute value equal a specific string value (i.e. that it equal the string 'US'). Figure 8.4: Selected cities meeting where POP > 50000 AND FIPS_CNTRY == US criteria are shown in cyan color. 8.2 Selection by location We can also select features from one GIS layer based on their spatial association with another GIS layer. This type of spatial association can be broken down into four categories: adjacency (whether features from one layer share a boundary with features of another), containment (whether features from one layer are inside features of another), intersection (whether features of one layer intersect features of another), and distance (whether one feature is within a certain distance from another). Continuing with our working example, we might be interested in Cities that are within 100 miles of earthquakes. The earthquake points are from another GIS layer called Earthquakes. Figure 8.5: An example of a Select Layer by Location tool in ArcGIS Pro. The spatial association chosen is distance. Figure 8.6: Selected cities meeting the criterion are shown in cyan color. 8.3 Vector Overlay The concept of vector overlay is not new and goes back many decades–even before GIS became ubiquitous. It was once referred to as sieve mapping by land use planners who combined different layers–each mapped onto separate transparencies–to isolate or eliminate areas that did or did not meet a set of criteria. Map overlay refers to a group of procedures and techniques used in combining information from different data layers. This is an important capability of most GIS environments. Map overlays involve at least two input layers and result in at least one new output layer. A basic set of overlay tools include clipping, intersecting and unioning. 8.3.1 Clip Clipping takes one GIS layer (the clip feature) and another GIS layer (the to-be-clipped input feature). The output is a clipped version of the original input layer. The output attributes table is a subset of the original attributes table where only records for the clipped polygons are preserved. Figure 8.7: The Maine counties polygon layer is clipped to the circle polygon. Note that the ouput layer is limited to the county polygon geometry and its attributes (and does not include the clipping circle polygon). 8.3.2 Intersect Intersecting takes both layers as inputs then outputs the features from both layers that share the same spatial extent. Note that the output attribute table inherits attributes from both input layers (this differs from clipping where attributes from just one layer are carried through). Figure 8.8: The Maine counties polygon layer is intersected with the circle polygon. The ouput layer combines both intersecting geometries and attributes. 8.3.3 Union Unioning overlays both input layers and outputs all features from the two layers. Features that overlap are intersected creating new polygons. This overlay usually produces more polygons than are present in both input layers combined. The output attributes table contains attribute values from both input features (note that only a subset of the output attributes table is shown in the following figure). Figure 8.9: The Maine counties polygon layer is unioned with the circle polygon. The ouput layer combines both (complete) geometries and attributes. Where spatial overlaps do not occur, most software will either assign a NULL value or a 0. "],["chp09_0.html", "Chapter 9 Coordinate Systems 9.1 Geographic Coordinate Systems 9.2 Projected Coordinate Systems 9.3 Spatial Properties 9.4 Geodesic geometries", " Chapter 9 Coordinate Systems Implicit with any GIS data is a spatial reference system. It can consist of a simple arbitrary reference system such as a 10 m x 10 m sampling grid in a wood lot or, the boundaries of a soccer field or, it can consist of a geographic reference system, i.e. one where the spatial features are mapped to an earth based reference system. The focus of this topic is on earth reference systems which can be based on a Geographic Coordinate System (GCS) or a Project Coordinate System (PCS). 9.1 Geographic Coordinate Systems A geographic coordinate system is a reference system for identifying locations on the curved surface of the earth. Locations on the earth’s surface are measured in angular units from the center of the earth relative to two planes: the plane defined by the equator and the plane defined by the prime meridian (which crosses Greenwich England). A location is therefore defined by two values: a latitudinal value and a longitudinal value. Figure 9.1: Examples of latitudinal lines are shown on the left and examples of longitudinal lines are shown on the right. The 0° degree reference lines for each are shown in red (equator for latitudinal measurements and prime meridian for longitudinal measurements). A latitude measures the angle from the equatorial plane to the location on the earth’s surface. A longitude measures the angle between the prime meridian plane and the north-south plane that intersects the location of interest. For example Colby College is located at around 45.56° North and 69.66° West. In a GIS system, the North-South and East-West directions are encoded as signs. North and East are assigned a positive (+) sign and South and West are assigned a negative (-) sign. Colby College’s location is therefore encoded as +45.56° and -69.66°. Figure 9.2: A slice of earth showing the latitude and longitude measurements. A GCS is defined by an ellipsoid, geoid and datum. These elements are presented next. 9.1.1 Sphere and Ellipsoid Assuming that the earth is a perfect sphere greatly simplifies mathematical calculations and works well for small-scale maps (maps that show a large area of the earth). However, when working at larger scales, an ellipsoid representation of earth may be desired if accurate measurements are needed. An ellipsoid is defined by two radii: the semi-major axis (the equatorial radius) and the semi-minor axis (the polar radius). The reason the earth has a slightly ellipsoidal shape has to do with its rotation which induces a centripetal force along the equator. This results in an equatorial axis that is roughly 21 km longer than the polar axis. Figure 9.3: The earth can be mathematically modeled as a simple sphere (left) or an ellipsoid (right). Our estimate of these radii is quite precise thanks to satellite and computational capabilities. The semi-major axis is 6,378,137 meters and the semi-minor axis is 6,356,752 meters. Differences in distance measurements along the surfaces of an ellipsoid vs. a sphere are small but measurable (the difference can be as high as 20 km) as illustrated in the following lattice plots. Figure 9.4: Differences in distance measurements between the surface of a sphere and an ellipsoid. Each graphic plots the differences in distance measurements made from a single point location along the 0° meridian identified by the green colored box (latitude value) to various latitudinal locations along a longitude (whose value is listed in the bisque colored box). For example, the second plot from the top-left corner plot shows the differences in distance measurements made from a location at 90° north (along the prime meridian) to a range of latitudinal locations along the 45° meridian. 9.1.2 Geoid Representing the earth’s true shape, the geoid, as a mathematical model is crucial for a GIS environment. However, the earth’s shape is not a perfectly smooth surface. It has undulations resulting from changes in gravitational pull across its surface. These undulations may not be visible with the naked eye, but they are measurable and can influence locational measurements. Note that we are not including mountains and ocean bottoms in our discussion, instead we are focusing solely on the earth’s gravitational potential which can be best visualized by imagining the earth’s surface completely immersed in water and measuring the distance from the earth’s center to the water surface over the entire earth surface. Figure 9.5: Earth’s EGM 2008 geoid. The ondulations depicted in the graphics are exaggerated x4000. The earth’s gravitational field is dynamic and is tied to the flow of the earth’s hot and fluid core. Hence its geoid is constantly changing, albeit at a large temporal scale.The measurement and representation of the earth’s shape is at the heart of geodesy–a branch of applied mathematics. 9.1.3 Datum So how are we to reconcile our need to work with a (simple) mathematical model of the earth’s shape with the ondulating nature of the earth’s surface (i.e. its geoid)? The solution is to align the geoid with the ellipsoid (or sphere) representation of the earth and to map the earth’s surface features onto this ellipsoid/sphere. The alignment can be local where the ellipsoid surface is closely fit to the geoid at a particular location on the earth’s surface (such as the state of Kansas) or geocentric where the ellipsoid is aligned with the center of the earth. How one chooses to align the ellipsoid to the geoid defines a datum. Figure 9.6: Alignment of a geoid with a spheroid or ellipsoid help define a datum. 9.1.3.1 Local Datum Figure 9.7: A local datum couples a geoid with the ellipsoid at a location on each element’s surface. There are many local datums to choose from, some are old while others are more recently defined. The choice of datum is largely driven by the location of interest. For example, when working in the US, a popular local datum to choose from is the North American Datum of 1927 (or NAD27 for short). NAD27 works well for the US but it’s not well suited for other parts of the world. For example, a far better local datum for Europe is the European Datum of 1950 (ED50 for short). Examples of common local datums are shown in the following table: Local datum Acronym Best for Comment North American Datum of 1927 NAD27 Continental US This is an old datum but still prevalent because of the wide use of older maps. European Datum of 1950 ED50 Western Europe Developed after World War II and still quite popular today. Not used in the UK. World Geodetic System 1972 WGS72 Global Developed by the Department of Defense. 9.1.3.2 Geocentric Datum Figure 9.8: A geocentric datum couples a geoid with the ellipsoid at each element’s center of mass. Many of the modern datums use a geocentric alignment. These include the popular World Geodetic Survey for 1984 (WGS84) and the North American Datums of 1983 (NAD83). Most of the popular geocentric datums use the WGS84 ellipsoid or the GRS80 ellipsoid. These two ellipsoids share nearly identical semi-major and semi-minor axes: 6,378,137 meters and 6,356,752 meters respectively. Examples of popular geocentric datums are shown in the following table: Geocentric datum Acronym Best for Comment North American Datum of 1983 NAD83 Continental US This is one of the most popular modern datums for the contiguous US. European Terrestrial Reference System 1989 ETRS89 Western Europe This is the most popular modern datum for much of Europe. World Geodetic System 1984 WGS84 Global Developed by the Department of Defense. 9.1.4 Building the Geographic Coordinate System A Geographic Coordinate System (GCS) is defined by the ellipsoid model and by the way this ellipsoid is aligned with the geoid (thus defining the datum). It is important to know which GCS is associated with a GIS file or a map document reference system. This is particularly true when the overlapping layers are tied to different datums (and therefore GCS’). This is because a location on the earth’s surface can take on different coordinate values. For example, a location recorded in an NAD 1927 GCS having a coordinate pair of 44.56698° north and 69.65939° west will register a coordinate value of 44.56704° north and 69.65888° west in a NAD83 GCS and a coordinate value of 44.37465° north and -69.65888° west in a sphere based WGS84 GCS. If the coordinate systems for these point coordinate values were not properly defined, then they could be misplaced on a map. This is analogous to recording temperature using different units of measure (degrees Celsius, Fahrenheit and Kelvin)–each unit of measure will produce a different numeric value. Figure 9.9: Map of the Colby flagpole in two different geographic coordinate systems (GCS NAD 1983 on the left and GCS NAD 1927 on the right). Note the offset in the 44.5639° line of latitude relative to the flagpole. Also note the 0.0005° longitudinal offset between both reference systems. 9.2 Projected Coordinate Systems The surface of the earth is curved but maps are flat. A projected coordinate system (PCS) is a reference system for identifying locations and measuring features on a flat (map) surface. It consists of lines that intersect at right angles, forming a grid. Projected coordinate systems (which are based on Cartesian coordinates) have an origin, an x axis, a y axis, and a linear unit of measure. Going from a GCS to a PCS requires mathematical transformations. The myriad of projection types can be aggregated into three groups: planar, cylindrical and conical. 9.2.1 Planar Projections A planar projection (aka Azimuthal projection) maps the earth surface features to a flat surface that touches the earth’s surface at a point (tangent case), or along a line of tangency (a secant case). This projection is often used in mapping polar regions but can be used for any location on the earth’s surface (in which case they are called oblique planar projections). Figure 9.10: Examples of three planar projections: orthographic (left), gnomonic (center) and equidistant (right). Each covers a different spatial range (with the latter covering both northern and southern hemispheres) and each preserves a unique set of spatial properties. 9.2.2 Cylindrical Projection A cylindrical map projection maps the earth surface onto a map rolled into a cylinder (which can then be flattened into a plane). The cylinder can touch the surface of the earth along a single line of tangency (a tangent case), or along two lines of tangency (a secant case). The cylinder can be tangent to the equator or it can be oblique. A special case is the Transverse aspect which is tangent to lines of longitude. This is a popular projection used in defining the Universal Transverse Mercator (UTM) and State Plane coordinate systems. The UTM PCS covers the entire globe and is a popular coordinate system in the US. It’s important to note that the UTM PCS is broken down into zones and therefore limits its extent to these zones that are 6° wide. For example, the State of Maine (USA) uses the UTM coordinate system (Zone 19 North) for most of its statewide GIS maps. Most USGS quad maps are also presented in a UTM coordinate system. Popular datums tied to the UTM coordinate system in the US include NAD27 and NAD83. There is also a WGS84 based UTM coordinate system. Distortion is minimized along the tangent or secant lines and increases as the distance from these lines increases. Figure 9.11: Examples of two cylindrical projections: Mercator (preserves shape but distortes area and distance) and equa-area (preserves area but distorts shape). 9.2.3 Conical Projection A conical map projection maps the earth surface onto a map rolled into a cone. Like the cylindrical projection, the cone can touch the surface of the earth along a single line of tangency (a tangent case), or along two lines of tangency (a secant case). Distortion is minimized along the tangent or secant lines and increases as the distance from these lines increases. When distance or area measurements are needed for the contiguous 48 states, use one of the conical projections such as Equidistant Conic (distance preserving) or Albers Equal Area Conic (area preserving). Conical projections are also popular PCS’ in European maps such as Europe Albers Equal Area Conic and Europe Lambert Conformal Conic. Figure 9.12: Examples of three conical projections: Albers equal area (preserves area), equidistant (preserves distance) and conformal (preserves shape). 9.3 Spatial Properties All projections distort real-world geographic features to some degree. The four spatial properties that are subject to distortion are: shape, area, distance and direction. A map that preserves shape is called conformal; one that preserves area is called equal-area; one that preserves distance is called equidistant; and one that preserves direction is called azimuthal. For most GIS applications (e.g. ArcGIS and QGIS), many of the built-in projections are named after the spatial properties they preserve. Each map projection is good at preserving only one or two of the four spatial properties. So when working with small-scale (large area) maps and when multiple spatial properties are to be preserved, it is best to break the analyses across different projections to minimize errors associated with spatial distortion. If you want to assess a projection’s spatial distortion across your study region, you can generate Tissot indicatrix (TI) ellipses. The idea is to project a small circle (i.e. small enough so that the distortion remains relatively uniform across the circle’s extent) and to measure its distorted shape on the projected map. For example, in assessing the type of distortion one could expect with a Mollweide projection across the continental US, a grid of circles could be generated at regular latitudinal and longitudinal intervals. Note the varying levels of distortion type and magnitude across the region. Let’s explore a Tissot circle at 44.5°N and 69.5°W (near Waterville Maine): The plot shows a perfect circle (displayed in a filled bisque color) that one would expect to see if no distortion was at play. The blue distorted ellipse (the indicatrix) is the transformed circle for this particular projection and location. The green and red lines show the magnitude and direction of the ellipse’s major and minor axes respectively. These lines can also be used to assess scale distortion (note that scale distortion can vary as a function of bearing). The green line shows maximum scale distortion and the red line shows minimum scale distortion–these are sometimes referred to as the principal directions. In this working example, the principal directions are 1.1293 and 0.8856. A scale value of 1 indicates no distortion. A value less than 1 indicates a smaller-than-true scale and a value greater than 1 indicates a greater-than-true scale. Projections can distort scale, but this does not necessarily mean that area is distorted. In fact, for this particular projection, area is relatively well preserved despite distortion in principal directions. Area distortion can easily be computed by taking the product of the two aforementioned principal directions. In this working example, area distortion is 1.0001 (i.e. negligible). The north-south dashed line in the graphic shows the orientation of the meridian. The east-west dotted line shows the orientation of the parallel. It’s important to recall that these distortions occur at the point where the TI is centered and not necessarily across the region covered by the TI circle. 9.4 Geodesic geometries The reason projected coordinate systems introduce errors in their geometric measurements has to do with the nature of the projection whereby the distance between two points on a sphere or ellipsoid will be difficult to replicate on a projected coordinate system unless these points are relatively close to one another. In most cases, such errors can be tolerated if the expected level of precision is met; many other sources of error in the spatial representation of the features can often usurp any measurement errors made in a projected coordinate system. However, if the scale of analysis is small (i.e. the spatial extent covers a large proportion of the earth’s surface such as the North American continent), then the measurement errors associated with a projected coordinate system may no longer be acceptable. A way to circumvent projected coordinate system limitations is to adopt a geodesic solution. A geodesic distance is the shortest distance between two points on an ellipsoid (or spheroid). Likewise, a geodesic area measurement is one that is measured on an ellipsoid. Such measurements are independent of the underlying projected coordinate system. The Tissot circles presented in figures from the last section were all generated using geodesic geometry. If you are not convinced of the benefits afforded by geodesic geometry, compare the distances measured between two points located on either sides of the Atlantic in the following map. The blue solid line represents the shortest distance between the two points on a planar coordinate system. The red dashed line represents the shortest distance between the two points as measured on a spheroid. At first glance, the geodesic distance may seem nonsensical given its curved appearance on the projected map. However, this curvature is a byproduct of the current reference system’s increasing distance distortion as one progresses poleward. If you are still not convinced, you can display the geodesic and planar distance layers on a 3D globe (or a projection that mimics the view of the 3D earth as viewed from space, centered on the mid-point of the geodesic line segment). So if a geodesic measurement is more precise than a planar measurement, why not perform all spatial operations using geodesic geometry? In many cases, a geodesic approach to spatial operations can be perfectly acceptable and is even encouraged. The downside is in its computational requirements. It’s far more computationally efficient to compute area/distance on a plane than it is on a spheroid. This is because geodesic calculations have no simple algebraic solutions and involve approximations that may require iterative solutions. So this may be a computationally taxing approach if processing millions of line segments. Note that not all geodesic measurement implementations are equal. Some more efficient algorithms that minimize computation time may reduce precision in the process. Some of ArcGIS’s functions offer the option to compute geodesic distances and areas. The data analysis environment R has several packages that will compute geodesic measurements including geosphere (which implements a well defined geodesic measurement algorithms adopted from the authoritative set of GeographicLib libraries), lwgeom, and an implementation of Google’s spherical measurement library called s2 . "],["chp10_0.html", "Chapter 10 Map Algebra 10.1 Local operations and functions 10.2 Focal operations and functions 10.3 Zonal operations and functions 10.4 Global operations and functions 10.5 Operators and functions", " Chapter 10 Map Algebra Dana Tomlin (Tomlin 1990) is credited with defining a framework for the analysis of field data stored as gridded values (i.e. rasters). He coined this framework map algebra. Though gridded data can be stored in a vector format, map algebra is usually performed on raster data. Map algebra operations and functions are broken down into four groups: local, focal, zonal and global. Each is explored in the following sections. 10.1 Local operations and functions Local operations and functions are applied to each individual cell and only involve those cells sharing the same location. For example, if we start off with an original raster, then multiply it by 2 then add 1, we get a new raster whose cell values reflect the series of operations performed on the original raster cells. This is an example of a unary operation where just one single raster is involved in the operation. Figure 10.1: Example of a local operation where output=(2 * raster + 1). More than one raster can be involved in a local operation. For example, two rasters can be summed (i.e. each overlapping pixels are summed) to generate a new raster. Figure 10.2: Example of a local operation where output=(raster1+raster2). Note how each cell output only involves input raster cells that share the same exact location. Local operations also include reclassification of values. This is where a range of values from the input raster are assigned a new (common) value. For example, we might want to reclassify the input raster values as follows: Original values Reclassified values 0-25 25 26-50 50 51-75 75 76-100 100 Figure 10.3: Example of a local operation where the output results from the reclassification of input values. 10.2 Focal operations and functions Focal operations are also referred to as neighborhood operations. Focal operations assign to the output cells some summary value (such as the mean) of the neighboring cells from the input raster. For example, a cell output value can be the average of all 9 neighboring input cells (including the center cell); this acts as a smoothing function. Figure 10.4: Example of a focal operation where the output cell values take on the average value of neighboring cells from the input raster. Focal cells surrounded by non-existent cells are assigned an NA in this example. Notice how, in the above example, the edge cells from the output raster have been assigned a value of NA (No Data). This is because cells outside of the extent have no value. Some GIS applications will ignore the missing surrounding values and just compute the average of the available cells as demonstrated in the next example. Figure 10.5: Example of a focal operation where the output cell values take on the average value of neighboring cells from the input raster. Surrounding non-existent cells are ignored. Focal (or neighbor) operations require that a window region (a kernel) be defined. In the above examples, a simple 3 by 3 kernel (or window) was used in the focal operations. The kernel can take on different dimensions and shape such as a 3 by 3 square where the central pixel is ignored (thus reducing the number of neighbors to 8) or a circular neighbor defined by a radius. Figure 10.6: Example of a focal operation where the kernel is defined by a 3 by 3 cell without the center cell and whose output cell takes on the average value of those neighboring cells. In addition to defining the neighborhood shape and dimension, a kernel also defines the weight each neighboring cell contributes to the summary statistic. For example, all cells in a 3 by 3 neighbor could each contribute 1/9th of their value to the summarized value (i.e. equal weight). But the weight can take on a more complex form defined by a function; such weights are defined by a kernel function. One popular function is a Gaussian weighted function which assigns greater weight to nearby cells than those further away. Figure 10.7: Example of a focal operation where the kernel is defined by a Gaussian function whereby the closest cells are assigned a greater weight. 10.3 Zonal operations and functions A zonal operation computes a new summary value (such as the mean) from cells aggregated for some zonal unit. In the following example, the cell values from the raster layer are aggregated into three zones whose boundaries are delineated in red. Each output zone shows the average value of the cells within that zone. Figure 10.8: Example of a zonal operation where the cell values are averaged for each of the three zones delineated in red. This technique is often used with rasters derived from remotely sensed imagery (e.g. NDVI) where areal units (such as counties or states) are used to compute the average cell values from the raster. 10.4 Global operations and functions Global operations and functions may make use of some or all input cells when computing an output cell value. An example of a global function is the Euclidean Distance tool which computes the shortest distance between a pixel and a source (or destination) location. In the following example, a new raster assigns to each cell a distance value to the closest cell having a value of 1 (there are just two such cells in the input raster). Figure 10.9: Example of a global function: the Euclidean distance. Each pixel is assigned its closest distance to one of the two source locations (defined in the input layer). Global operations and functions can also generate single value outputs such as the overall pixel mean or standard deviation. Another popular use of global functions is in the mapping of least-cost paths where a cost surface raster is used to identify the shortest path between two locations which minimizes cost (in time or money). 10.5 Operators and functions Operations and functions applied to gridded data can be broken down into three groups: mathematical, logical comparison and Boolean. 10.5.1 Mathematical operators and functions Two mathematical operators have already been demonstrated in earlier sections: the multiplier and the addition operators. Other operators include division and the modulo (aka the modulus) which is the remainder of a division. Mathematical functions can also be applied to gridded data manipulation. Examples are square root and sine functions. The following table showcases a few examples with ArcGIS and R syntax. Operation ArcGIS Syntax R Syntax Example Addition + + R1 + R2 Subtraction - - R1 - R2 Division / / R1 / R2 Modulo Mod() %% Mod(R1, 100), R1 %% 10 Square root SquareRoot() sqrt() SquareRoot(R1), sqrt(R1) 10.5.2 Logical comparison The logical comparison operators evaluate a condition then output a value of 1 if the condition is true and 0 if the condition is false. Logical comparison operators consist of greater than, less than, equal and not equal. Logical comparison Syntax Greater than > Less than < Equal == Not equal != For example, the following figure shows the output of the comparison between two rasters where we are assessing if cells in R1 are greater than those in R2 (on a cell-by-cell basis). Figure 10.10: Output of the operation R1 > R2. A value of 1 in the output raster indicates that the condition is true and a value of 0 indicates that the condition is false. When assessing whether two cells are equal, some programming environments such as R and ArcMap’s Raster Calculator require the use of the double equality syntax, ==, as in R1 == R2. In these programming environments, the single equality syntax is usually interpreted as an assignment operator so R1 = R2 would instruct the computer to assign the cell values in R2 to R1 (which is not what we want to do here). Some applications make use of special functions to test a condition. For example, ArcMap has a function called Con(condition, out1, out2) which assigns the value out1 if the condition is met and a value of out2 if it’s not. For example, ArcMap’s raster calculator expression Con( R1 > R2, 1, 0) outputs a value of 1 if R1 is greater than R2 and 0 if not. It generates the same output as the one shown in the above figure. Note that in most programming environments (including ArcMap), the expression R1 > R2 produces the same output because the value 1 is the numeric representation of TRUE and 0 that of FALSE. 10.5.3 Boolean (or Logical) operators In map algebra, Boolean operators are used to compare conditional states of a cell (i.e. TRUE or FALSE). The three Boolean operators are AND, OR and NOT. Boolean ArcGIS R Example AND & & R1 & R2 OR | | R1 | R2 NOT ~ ! ~R2, !R2 A “TRUE” state is usually encoded as a 1 or any non-zero integer while a “FALSE” state is usually encoded as a 0. For example, if cell1=0 and cell2=1, the Boolean operation cell1 AND cell2 results in a FALSE (or 0) output cell value. This Boolean operation can be translated into plain English as “are the cells 1 and 2 both TRUE?” to which we answer “No they are not” (cell1 is FALSE). The OR operator can be interpreted as “is x or y TRUE?” so that cell1 OR cell2 would return TRUE. The NOT interpreter can be interpreted as “is x not TRUE?” so that NOT cell1 would return TRUE. Figure 10.11: Output of the operation R1 AND R2. A value of 1 in the output raster indicates that the condition is true and a value of 0 indicates that the condition is false. Note that many programming environments treat any none 0 values as TRUE so that -3 AND -4 will return TRUE. Figure 10.12: Output of the operation NOT R2. A value of 1 in the output raster indicates that the input cell is NOT TRUE (i.e. has a value of 0). 10.5.4 Combining operations Both comparison and Boolean operations can be combined into a single expression. For example, we may wish to find locations (cells) that satisfy requirements from two different raster layers: e.g. 0<R1<4 AND R2>0. To satisfy the first requirement, we can write out the expression as (R1>0) & (R1<4). Both comparisons (delimited by parentheses) return a 0 (FALSE) or a 1 (TRUE). The ampersand, &, is a Boolean operator that checks that both conditions are met and returns a 1 if yes or a 0 if not. This expression is then combined with another comparison using another ampersand operator that assesses the criterion R2>0. The amalgamated expression is thus ((R1>0) & (R1<4)) & (R2>0). Figure 10.13: Output of the operation ((R1>0) & (R1<4)) & (R2>0). A value of 1 in the output raster indicates that the condition is true and a value of 0 indicates that the condition is false. Note that most software environments assign the ampersand character, &, to the AND Boolean operator. References "],["chp11_0.html", "Chapter 11 Point Pattern Analysis 11.1 Centrography 11.2 Density based analysis 11.3 Distance based analysis 11.4 First and second order effects", " Chapter 11 Point Pattern Analysis 11.1 Centrography A very basic form of point pattern analysis involves summary statistics such as the mean center, standard distance and standard deviational ellipse. These point pattern analysis techniques were popular before computers were ubiquitous since hand calculations are not too involved, but these summary statistics are too concise and hide far more valuable information about the observed pattern. More powerful analysis methods can be used to explore point patterns. These methods can be classified into two groups: density based approach and distance based approach. 11.2 Density based analysis Density based techniques characterize the pattern in terms of its distribution vis-a-vis the study area–a first-order property of the pattern. A first order property of a pattern concerns itself with the variation of the observations’ density across a study area. For example, the distribution of oaks will vary across a landscape based on underlying soil characteristics (resulting in areas having dense clusters of oaks and other areas not). In these lecture notes, we’ll make a distinction between the intensity of a spatial process and the observed density of a pattern under study. A point pattern can be thought of as a “realization” of an underlying process whose intensity \\(\\lambda\\) is estimated from the observed point pattern’s density (which is sometimes denoted as \\(\\widehat{\\lambda}\\) where the caret \\(\\verb!^!\\) is referring to the fact that the observed density is an estimate of the underlying process’ intensity). Density measurements can be broken down into two categories: global and local. 11.2.1 Global density A basic measure of a pattern’s density \\(\\widehat{\\lambda}\\) is its overall, or global, density. This is simply the ratio of observed number of points, \\(n\\), to the study region’s surface area, \\(a\\), or: \\(\\begin{equation} \\widehat{\\lambda} = \\frac{n}{a} \\label{eq:global-density} \\end{equation}\\) Figure 11.1: An example of a point pattern where n = 31 and the study area (defined by a square boundary) is 10 units squared. The point density is thus 31/100 = 0.31 points per unit area. 11.2.2 Local density A point pattern’s density can be measured at different locations within the study area. Such an approach helps us assess if the density–and, by extension, the underlying process’ local (modeled) intensity \\(\\widehat{\\lambda}_i\\)–is constant across the study area. This can be an important property of the data since it may need to be mitigated for when using the distance based analysis tools covered later in this chapter. Several techniques for measuring local density are available, here we will focus on two such methods: quadrat density and kernel density. 11.2.2.1 Quadrat density This technique requires that the study area be divided into sub-regions (aka quadrats). Then, the point density is computed for each quadrat by dividing the number of points in each quadrat by the quadrat’s area. Quadrats can take on many different shapes such as hexagons and triangles, here we use square shaped quadrats to demonstrate the procedure. Figure 11.2: An example of a quadrat count where the study area is divided into four equally sized quadrats whose area is 25 square units each. The density in each quadrat can be computed by dividing the number of points in each quadrat by that quadrat’s area. The choice of quadrat numbers and quadrat shape can influence the measure of local density and must be chosen with care. If very small quadrat sizes are used you risk having many quadrats with no points which may prove uninformative. If very large quadrat sizes are used, you risk missing subtle changes in spatial density distributions such as the east-west gradient in density values in the above example. Quadrat regions do not have to take on a uniform pattern across the study area, they can also be defined based on a covariate. For example, if it’s believed that the underlying point pattern process is driven by elevation, quadrats can be defined by sub-regions such as different ranges of elevation values (labeled 1 through 4 on the right-hand plot in the following example). This can result in quadrats having non-uniform shape and area. Converting a continuous field into discretized areas is sometimes referred to as tessellation. The end product is a tessellated surface. Figure 11.3: Example of a covariate. Figure on the left shows the elevation map. Figure on the right shows elevation broken down into four sub-regions (a tessellated surface) for which local density values will be computed. If the local intensity changes across the tessellated covariate, then there is evidence of a dependence between the process that generated the point pattern and the covariate. In our example, sub-regions 1 through 4 have surface areas of 23.54, 25.2, 25.21, 26.06 map units respectively. To compute these regions’ point densities, we simply divide the number of points by the respective area values. Figure 11.4: Figure on the left displays the number of points in each elevation sub-region (sub-regions are coded as values ranging from 1 to 4). Figure on the right shows the density of points (number of points divided by the area of the sub-region). We can plot the relationship between point density and elevation regions to help assess any dependence between the variables. Figure 11.5: Plot of point density vs elevation regions. It’s important to note that how one chooses to tessellate a surface can have an influence on the resulting density distribution. For example, dividing the elevation into seven sub-regions produces the following density values. Figure 11.6: Same analysis as last figure using different sub-regions. Note the difference in density distribution. While the high density in the western part of the study area remains, the density values to the east are no longer consistent across the other three regions. The quadrat analysis approach has its advantages in that it is easy to compute and interpret however, it does suffer from the modifiable areal unit problem (MAUP) as highlighted in the last two examples. Another density based approach that will be explored next (and that is less susceptible to the MAUP) is the kernel density. 11.2.2.2 Kernel density The kernel density approach is an extension of the quadrat method: Like the quadrat density, the kernel approach computes a localized density for subsets of the study area, but unlike its quadrat density counterpart, the sub-regions overlap one another providing a moving sub-region window. This moving window is defined by a kernel. The kernel density approach generates a grid of density values whose cell size is smaller than that of the kernel window. Each cell is assigned the density value computed for the kernel window centered on that cell. A kernel not only defines the shape and size of the window, but it can also weight the points following a well defined kernel function. The simplest function is a basic kernel where each point in the kernel window is assigned equal weight. Figure 11.7: An example of a basic 3x3 kernel density map (ArcGIS calls this a point density map) where each point is assigned an equal weight. For example, the cell centered at location x=1.5 and y =7.5 has one point within a 3x3 unit (pixel) region and thus has a local density of 1/9 = 0.11. Some of the most popular kernel functions assign weights to points that are inversely proportional to their distances to the kernel window center. A few such kernel functions follow a gaussian or quartic like distribution function. These functions tend to produce a smoother density map. Figure 11.8: An example of a kernel function is the 3x3 quartic kernel function where each point in the kernel window is weighted based on its proximity to the kernel’s center cell (typically, closer points are weighted more heavily). Kernel functions, like the quartic, tend to generate smoother surfaces. 11.2.2.3 Kernel Density Adjusted for Covariate In the previous section, we learned that we could use a covariate, like elevation, to define the sub-regions (quadrats) within which densities were computed. Here, instead of dividing the study region into discrete sub-regions (as was done with quadrat analysis), we create an intensity function that is dependent on the underlying covariate. This function, which we’ll denote as \\(\\rho\\), can be estimated in one of three different ways– by ratio, re-weight and transform methods. We will not delve into the differences between these methods, but note that there is more than one way to estimate \\(\\rho\\) in the presence of a covariate. In the following example, the elevation raster is used as the covariate in the \\(\\rho\\) function using the ratio method. The right-most plot maps the modeled intensity as a function of elevation. Figure 11.9: An estimate of \\(\\rho\\) using the ratio method. The figure on the left shows the point distribution superimposed on the elevation layer. The middle figure plots the estimated \\(\\rho\\) as a function of elevation. The envelope shows the 95% confidence interval. The figure on the right shows the modeled density of \\(\\widehat{\\lambda}\\) which is a function of the elevation raster (i.e. \\(\\widehat{\\lambda}=\\widehat{\\rho}_{elevation}\\)). We can compare the modeled intensity function to the kernel density function of the observed point pattern via a scatter plot. A red one-to-one diagonal line is added to the plot. While an increase in predicted intensity is accompanied with increasing observed intensity, the relationship is not linear. This can be explained by the small area covered by these high elevation locations which result in fewer observation opportunities and thus higher uncertainty for that corner of the study extent. This uncertainty is very apparent in the \\(\\rho\\) vs. elevation plot where the 95% confidence interval envelope widens at higher elevation values (indicating the greater uncertainty in our estimated \\(\\rho\\) value at those higher elevation values). 11.2.3 Modeling intensity as a function of a covariate So far, we have learned techniques that describe the distribution of points across a region of interest. But it is often more interesting to model the relationship between the distribution of points and some underlying covariate by defining that relationship mathematically. This can be done by exploring the changes in point density as a function of a covariate, however, unlike techniques explored thus far, this approach makes use of a statistical model. One such model is a Poisson point process model which can take on the form of: \\[ \\begin{equation} \\lambda(i) = e^{\\alpha + \\beta Z(i)} \\label{eq:density-covariate} \\end{equation} \\] where \\(\\lambda(i)\\) is the modeled intensity at location \\(i\\), \\(e^{\\alpha}\\) (the exponent of \\(\\alpha\\)) is the base intensity when the covariate is zero and \\(e^{\\beta}\\) is the multiplier by which the intensity increases (or decreases) for each 1 unit increase in the covariate \\(Z(i)\\). This is a form of the logistic regression model–popular in the field of statistics. This equation implies that the relationship between the process that lead to the observed point pattern is a loglinear function of the underlying covariate (i.e. one where the process’ intensity is exponentially increasing or decreasing as a function of the covariate). Note that taking the log of both sides of the equation yields the more familiar linear regression model where \\(\\alpha + \\beta Z(i)\\) is the linear predictor. Note: The left-hand side of a logistic regression model is often presented as the probability, \\(P\\), of occurrence and is related to \\(\\lambda\\) as \\(\\lambda=P/(1-P)\\) which is the ratio of probability of occurrence. Solving for \\(P\\) gives us \\(P = \\lambda/(1 + \\lambda)\\) which yields the following equation: \\[ P(i) = \\frac{e^{\\alpha + \\beta Z(i)}}{1 + e^{\\alpha + \\beta Z(i)}} \\] Let’s work with the point distribution of Starbucks cafes in the state of Massachusetts. The point pattern clearly exhibits a non-random distribution. It might be helpful to compare this distribution to some underlying covariate such as the population density distribution. Figure 11.10: Location of Starbucks relative to population density. Note that the classification scheme follows a log scale to more easily differentiate population density values. We can fit a poisson point process model to these data where the modeled intensity takes on the form: \\[ \\begin{equation} Starbucks\\ density(i) = e^{\\alpha + \\beta\\ population(i)} \\label{eq:walmart-model} \\end{equation} \\] The parameters \\(\\alpha\\) and \\(\\beta\\) are estimated from a method called maximum likelihood. Its implementation is not covered here but is widely covered in many statistics text books. The index \\((i)\\) serves as a reminder that the point density and the population distribution both can vary as a function of location \\(i\\). The estimated value for \\(\\alpha\\) in our example is -18.966. This is interpreted as stating that given a population density of zero, the base intensity of the point process is e-18.966 or 5.79657e-09 cafes per square meter (the units are derived from the point’s reference system)–a number close to zero (as one would expect). The estimated value for \\(\\beta\\) is 0.00017. This is interpreted as stating that for every unit increase in the population density derived from the raster, the intensity of the point process increases by a factor of e0.00017 or 1.00017. If we are to plot the relationship between density and population, we get: Figure 11.11: Poisson point process model fitted to the relationship between Starbucks store locations and population density. The model assumes a loglinear relationship. Note that the density is reported in number of stores per map unit area (the map units are in meters). 11.3 Distance based analysis An alternative to the density based methods explored thus far are the distance based methods for pattern analysis whereby the interest lies in how the points are distributed relative to one another (a second-order property of the point pattern) as opposed to how the points are distributed relative to the study extent A second order property of a pattern concerns itself with the observations’ influence on one another. For example, the distribution of oaks will be influenced by the location of parent trees–where parent oaks are present we would expect dense clusters of oaks to emerge. Three distance based approaches are covered next: The average nearest neighbor (ANN), the K and L functions, and the pair correlation function. 11.3.1 Average Nearest Neighbor An average nearest neighbor (ANN) analysis measures the average distance from each point in the study area to its nearest point. In the following example, the average nearest neighbor for all points is 1.52 units. Figure 11.12: Distance between each point and its closest point. For example, the point closest to point 1 is point 9 which is 2.32 map units away. An extension of this idea is to plot the ANN values for different order neighbors, that is for the first closest point, then the second closest point, and so forth. Figure 11.13: ANN values for different neighbor order numbers. For example, the ANN for the first closest neighbor is 1.52 units; the ANN for the 2nd closest neighbor is 2.14 map units; and so forth. The shape of the ANN curve as a function of neighbor order can provide insight into the spatial arrangement of points relative to one another. In the following example, three different point patterns of 20 points are presented. Figure 11.14: Three different point patterns: a single cluster, a dual cluster and a randomly scattered pattern. Each point pattern offers different ANN vs. neighbor order plots. Figure 11.15: Three different ANN vs. neighbor order plots. The black ANN line is for the first point pattern (single cluster); the blue line is for the second point pattern (double cluster) and the red line is for the third point pattern. The bottom line (black dotted line) indicates that the cluster (left plot) is tight and that the distances between a point and all other points is very short. This is in stark contrast with the top line (red dotted line) which indicates that the distances between points is much greater. Note that the way we describe these patterns is heavily influenced by the size and shape of the study region. If the region was defined as the smallest rectangle encompassing the cluster of points, the cluster of points would no longer look clustered. Figure 11.16: The same point pattern presented with two different study areas. How differently would you describe the point pattern in both cases? An important assumption that underlies our interpretation of the ANN results is that of stationarity of the underlying point process (i.e. that there is no overall drift or trend in the process’ intensity). If the point pattern is not stationary, then it will be difficult to assess if the results from the ANN analysis are due to interactions between the points or due to changes in some underlying factor that changes as a function of location. Correcting for lack of stationarity when performing hypothesis tests is described in the next chapter. 11.3.2 K and L functions 11.3.2.1 K function The average nearest neighbor (ANN) statistic is one of many distance based point pattern analysis statistics. Another statistic is the K-function which summarizes the distance between points for all distances. The calculation of K is fairly simple: it consists of dividing the mean of the sum of the number of points at different distance lags for each point by the area event density. For example, for point \\(S1\\) we draw circles, each of varying radius \\(d\\), centered on that point. We then count the number of points (events) inside each circle. We repeat this for point \\(S2\\) and all other points \\(Si\\). Next, we compute the average number of points in each circle then divide that number by the overall point density \\(\\hat{\\lambda}\\) (i.e. total number of events per study area). Distance band (km) # events from S1 # events from S2 # events from Si K 10 0 1 … 0.012 20 3 5 … 0.067 30 9 14 … 0.153 40 17 17 … 0.269 50 25 23 … 0.419 We can then plot K and compare that plot to a plot we would expect to get if an IRP/CSR process was at play (Kexpected). Figure 11.17: The K-function calculated from the Walmart stores point distribution in MA (shown in black) compared to\\(K_{expected}\\) under the IRP/CSR assumption (shown in red). \\(K\\) values greater than \\(K_{expected}\\) indicate clustering of points at a given distance band; K values less than \\(K_{expected}\\) indicate dispersion of points at a given distance band. In our example, the stores appear to be more clustered than expected at distances greater than 12 km. Note that like the ANN analysis, the \\(K\\)-function assumes stationarity in the underlying point process (i.e. that there is no overall drift or trend in the process’ intensity). 11.3.2.2 L function One problem with the \\(K\\) function is that the shape of the function tends to curve upward making it difficult to see small differences between \\(K\\) and \\(K_{expected}\\). A workaround is to transform the values in such a way that the expected values, \\(K_{expected}\\), lie horizontal. The transformation is calculated as follows: \\[ \\begin{equation} L=\\sqrt{\\dfrac{K(d)}{\\pi}}-d \\label{eq:L-function} \\end{equation} \\] The \\(\\hat{K}\\) computed earlier is transformed to the following plot (note how the \\(K_{expected}\\) red line is now perfectly horizontal): Figure 11.18: L-function (a simple transformation of the K-function). This graph makes it easier to compare \\(K\\) with \\(K_{expected}\\) at lower distance values. Values greater than \\(0\\) indicate clustering, while values less than \\(0\\) indicate dispersion. It appears that Walmart locations are more dispersed than expected under CSR/IRP up to a distance of 12 km but more clustered at distances greater than 12 km. 11.3.3 The Pair Correlation Function \\(g\\) A shortcoming of the \\(K\\) function (and by extension the \\(L\\) function) is its cumulative nature which makes it difficult to know at exactly which distances a point pattern may stray from \\(K_{expected}\\) since all points up to distance \\(r\\) can contribute to \\(K(r)\\). The pair correlation function, \\(g\\), is a modified version of the \\(K\\) function where instead of summing all points within a distance \\(r\\), points falling within a narrow distance band are summed instead. Figure 11.19: Difference in how the \\(K\\) and \\(g\\) functions aggregate points at distance \\(r\\) (\\(r\\) = 30 km in this example). All points up to \\(r\\) contribute to \\(K\\) whereas just the points in the annulus band at \\(r\\) contribute to \\(g\\). The plot of the \\(g\\) function follows. Figure 11.20: \\(g\\)-function of the Massachusets Walmart point data. Its interpretation is similar to that of the \\(K\\) and \\(L\\) functions. Here, we observe distances between stores greater than expected under CSR up to about 5 km. Note that this cutoff is less than the 12 km cutoff observed with the \\(K\\)/\\(L\\) functions. If \\(g(r)\\) = 1, then the inter-point distances (at and around distance \\(r\\)) are consistent with CSR. If \\(g(r)\\) > 1, then the points are more clustered than expected under CSR. If \\(g(r)\\) < 1, then the points are more dispersed than expected under CSR. Note that \\(g\\) can never be less than 0. Like its \\(K\\) and ANN counterparts, the \\(g\\)-function assumes stationarity in the underlying point process (i.e. that there is no overall drift or trend in the process’ intensity). 11.4 First and second order effects The concept of 1st order effects and 2nd order effects is an important one. It underlies the basic principles of spatial analysis. Figure 11.21: Tree distribution can be influenced by 1st order effects such as elevation gradient or spatial distribution of soil characteristics; this, in turn, changes the tree density distribution across the study area. Tree distribution can also be influenced by 2nd order effects such as seed dispersal processes where the process is independent of location and, instead, dependent on the presence of other trees. Density based measurements such as kernel density estimations look at the 1st order property of the underlying process. Distance based measurements such as ANN and K-functions focus on the 2nd order property of the underlying process. It’s important to note that it is seldom feasible to separate out the two effects when analyzing point patterns, thus the importance of relying on a priori knowledge of the phenomena being investigated before drawing any conclusions from the analyses results. "],["hypothesis-testing.html", "Chapter 12 Hypothesis testing 12.1 IRP/CSR 12.2 Testing for CSR with the ANN tool 12.3 Alternatives to CSR/IRP 12.4 Monte Carlo test with K and L functions 12.5 Testing for a covariate effect", " Chapter 12 Hypothesis testing 12.1 IRP/CSR Figure 12.1: Could the distribution of Walmart stores in MA have been the result of a CSR/IRP process? Popular spatial analysis techniques compare observed point patterns to ones generated by an independent random process (IRP) also called complete spatial randomness (CSR). CSR/IRP satisfy two conditions: Any event has equal probability of occurring in any location, a 1st order effect. The location of one event is independent of the location of another event, a 2nd order effect. In the next section, you will learn how to test for complete spatial randomness. In later sections, you will also learn how to test for other non-CSR processes. 12.2 Testing for CSR with the ANN tool 12.2.1 ArcGIS’ Average Nearest Neighbor Tool ArcMap offers a tool (ANN) that tests whether or not the observed first order nearest neighbor is consistent with a distribution of points one would expect to observe if the underlying process was completely random (i.e. IRP). But as we will learn very shortly, ArcMap’s ANN tool has its limitations. 12.2.1.1 A first attempt Figure 12.2: ArcGIS’ ANN tool. The size of the study area is not defined in this example. ArcGIS’ average nearest neighbor (ANN) tool computes the 1st nearest neighbor mean distance for all points. It also computes an expected mean distance (ANNexpected) under the assumption that the process that lead to the observed pattern is completely random. ArcGIS’ ANN tool offers the option to specify the study surface area. If the area is not explicitly defined, ArcGIS will assume that the area is defined by the smallest area encompassing the points. ArcGIS’ ANN analysis outputs the nearest neighbor ratio computed as: \\[ ANN_{ratio}=\\dfrac{ANN}{ANN_{expected}} \\] Figure 12.3: ANN results indicating that the pattern is consistent with a random process. Note the size of the study area which defaults to the point layer extent. If ANNratio is 1, the pattern results from a random process. If it’s greater than 1, it’s dispersed. If it’s less than 1, it’s clustered. In essence, ArcGIS is comparing the observed ANN value to ANNexpected one would compute if a complete spatial randomness (CSR) process was at play. ArcGIS’ tool also generates a p-value (telling us how confident we should be that our observed ANN value is consistent with a perfectly random process) along with a bell shaped curve in the output graphics window. The curve serves as an infographic that tells us if our point distribution is from a random process (CSR), or is more clustered/dispersed than one would expect under CSR. For example, if we were to run the Massachusetts Walmart point location layer through ArcGIS’ ANN tool, an ANNexpected value of 12,249 m would be computed along with an ANNratio of 1.085. The software would also indicate that the observed distribution is consistent with a CSR process (p-value of 0.28). But is it prudent to let the software define the study area for us? How does it know that the area we are interested in is the state of Massachusetts since this layer is not part of any input parameters? 12.2.1.2 A second attempt Figure 12.4: ArcGIS’ ANN tool. The size of the study is defined in this example. Here, we explicitly tell ArcGIS that the study area (Massachusetts) covers 21,089,917,382 m² (note that this is the MA shapefile’s surface area and not necessarily representative of MA’s actual surface area). ArcGIS’ ANN tool now returns a different output with a completely different conclusion. This time, the analysis suggests that the points are strongly dispersed across the state of Massachusetts and the very small p-value (p = 0.006) tells us that there is less than a 0.6% chance that a CSR process could have generated our observed point pattern. (Note that the p-value displayed by ArcMap is for a two-sided test). Figure 12.5: ArcGIS’ ANN tool output. Note the different output result with the study area size defined. The output indicates that the points are more dispersed than expected under IRP. So how does ArcGIS estimate the ANNexpected value under CSR? It does so by taking the inverse of the square root of the number of points divided by the area, and multiplying this quotient by 0.5. \\[ ANN_{Expected}=\\dfrac{0.5}{\\sqrt{n/A}} \\] In other words, the expected ANN value under a CSR process is solely dependent on the number of points and the study extent’s surface area. Do you see a problem here? Could different shapes encompassing the same point pattern have the same surface area? If so, shouldn’t the shape of our study area be a parameter in our ANN analysis? Unfortunately, ArcGIS’ ANN tool cannot take into account the shape of the study area. An alternative work flow is outlined in the next section. 12.2.2 A better approach: a Monte Carlo test The Monte Carlo technique involves three steps: First, we postulate a process–our null hypothesis, \\(Ho\\). For example, we hypothesize that the distribution of Walmart stores is consistent with a completely random process (CSR). Next, we simulate many realizations of our postulated process and compute a statistic (e.g. ANN) for each realization. Finally, we compare our observed data to the patterns generated by our simulated processes and assess (via a measure of probability) if our pattern is a likely realization of the hypothesized process. Following our working example, we randomly re-position the location of our Walmart points 1000 times (or as many times computationally practical) following a completely random process–our hypothesized process, \\(Ho\\)–while making sure to keep the points confined to the study extent (the state of Massachusetts). Figure 12.6: Three different outcomes from simulated patterns following a CSR point process. These maps help answer the question how would Walmart stores be distributed if their locations were not influenced by the location of other stores and by any local factors (such as population density, population income, road locations, etc…) For each realization of our process, we compute an ANN value. Each simulated pattern results in a different ANNexpected value. We plot all ANNexpected values using a histogram (this is our \\(Ho\\) sample distribution), then compare our observed ANN value of 13,294 m to this distribution. Figure 12.7: Histogram of simulated ANN values (from 1000 simulations). This is the sample distribution of the null hypothesis, ANNexpected (under CSR). The red line shows our observed (Walmart) ANN value. About 32% of the simulated values are greater (more extreme) than our observed ANN value. Note that by using the same study region (the state of Massachusetts) in the simulations we take care of problems like study area boundary and shape issues since each simulated point pattern is confined to the exact same study area each and every time. 12.2.2.1 Extracting a \\(p\\)-value from a Monte Carlo test The p-value can be computed from a Monte Carlo test. The procedure is quite simple. It consists of counting the number of simulated test statistic values more extreme than the one observed. If we are interested in knowing the probability of having simulated values more extreme than ours, we identify the side of the distribution of simulated values closest to our observed statistic, count the number of simulated values more extreme than the observed statistic then compute \\(p\\) as follows: \\[ \\dfrac{N_{extreme}+1}{N+1} \\] where Nextreme is the number of simulated values more extreme than our observed statistic and N is the total number of simulations. Note that this is for a one-sided test. A practical and more generalized form of the equation looks like this: \\[ \\dfrac{min(N_{greater}+1 , N + 1 - N_{greater})}{N+1} \\] where \\(min(N_{greater}+1 , N + 1 - N_{greater})\\) is the smallest of the two values \\(N_{greater}+1\\) and \\(N + 1 - N_{greater}\\), and \\(N_{greater}\\) is the number of simulated values greater than the observed value. It’s best to implement this form of the equation in a scripting program thus avoiding the need to visually seek the side of the distribution closest to our observed statistic. For example, if we ran 1000 simulations in our ANN analysis and found that 319 of those were more extreme (on the right side of the simulated ANN distribution) than our observed ANN value, our p-value would be (319 + 1) / (1000 + 1) or p = 0.32. This is interpreted as “there is a 32% probability that we would be wrong in rejecting the null hypothesis Ho.” This suggests that we would be remiss in rejecting the null hypothesis that a CSR process could have generated our observed Walmart point distribution. But this is not to say that the Walmart stores were in fact placed across the state of Massachusetts randomly (it’s doubtful that Walmart executives make such an important decision purely by chance), all we are saying is that a CSR process could have been one of many processes that generated the observed point pattern. If a two-sided test is desired, then the equation for the \\(p\\) value takes on the following form: \\[ 2 \\times \\dfrac{min(N_{greater}+1 , N + 1 - N_{greater})}{N+1} \\] where we are simply multiplying the one-sided p-value by two. 12.3 Alternatives to CSR/IRP Figure 12.8: Walmart store distribution shown on top of a population density layer. Could population density distribution explain the distribution of Walmart stores? The assumption of CSR is a good starting point, but it’s often unrealistic. Most real-world processes exhibit 1st and/or 2nd order effects. We therefore may need to correct for a non-stationary underlying process. We can simulate the placement of Walmart stores using the population density layer as our inhomogeneous point process. We can test this hypothesis by generating random points that follow the population density distribution. Figure 12.9: Examples of two randomly generated point patterns using population density as the underlying process. Note that even though we are not referring to a CSR/IRP point process, we are still treating this as a random point process since the points are randomly located following the underlying population density distribution. Using the same Monte Carlo (MC) techniques used with IRP/CSR processes, we can simulate thousands of point patterns (following the population density) and compare our observed ANN value to those computed from our MC simulations. Figure 12.10: Histogram showing the distribution of ANN values one would expect to get if population density distribution were to influence the placement of Walmart stores. In this example, our observed ANN value falls far to the right of our simulated ANN values indicating that our points are more dispersed than would be expected had population density distribution been the sole driving process. The percentage of simulated values more extreme than our observed value is 0% (i.e. a p-value \\(\\backsimeq\\) 0.0). Another plausible hypothesis is that median household income could have been the sole factor in deciding where to place the Walmart stores. Figure 12.11: Walmart store distribution shown on top of a median income distribution layer. Running an MC simulation using median income distribution as the underlying density layer yields an ANN distribution where about 16% of the simulated values are more extreme than our observed ANN value (i.e. p-value = 0.16): Figure 12.12: Histogram showing the distribution of ANN values one would expect to get if income distribution were to influence the placement of Walmart stores. Note that we now have two competing hypotheses: a CSR/IRP process and median income distribution process. Both cannot be rejected. This serves as a reminder that a hypothesis test cannot tell us if a particular process is the process involved in the generation of our observed point pattern; instead, it tells us that the hypothesis is one of many plausible processes. It’s important to remember that the ANN tool is a distance based approach to point pattern analysis. Even though we are randomly generating points following some underlying probability distribution map we are still concerning ourselves with the repulsive/attractive forces that might dictate the placement of Walmarts relative to one another–i.e. we are not addressing the question “can some underlying process explain the X and Y placement of the stores” (addressed in section 12.5). Instead, we are controlling for the 1st order effect defined by population density and income distributons. 12.4 Monte Carlo test with K and L functions MC techniques are not unique to average nearest neighbor analysis. In fact, they can be implemented with many other statistical measures as with the K and L functions. However, unlike the ANN analysis, the K and L functions consist of multiple test statistics (one for each distance \\(r\\)). This results in not one but \\(r\\) number of simulated distributions. Typically, these distributions are presented as envelopes superimposed on the estimated \\(K\\) or \\(L\\) functions. However, since we cannot easily display the full distribution at each \\(r\\) interval, we usually limit the envelope to a pre-defined acceptance interval. For example, if we choose a two-sided significance level of 0.05, then we eliminate the smallest and largest 2.5% of the simulated K values computed for for each \\(r\\) intervals (hence the reason you might sometimes see such envelopes referred to as pointwise envelopes). This tends to generate a saw-tooth like envelope. Figure 12.13: Simulation results for the IRP/CSR hypothesized process. The gray envelope in the plot covers the 95% significance level. If the observed L lies outside of this envelope at distance \\(r\\), then there is less than a 5% chance that our observed point pattern resulted from the simulated process at that distance. The interpretation of these plots is straight forward: if \\(\\hat K\\) or \\(\\hat L\\) lies outside of the envelope at some distance \\(r\\), then this suggests that the point pattern may not be consistent with \\(H_o\\) (the hypothesized process) at distance \\(r\\) at the significance level defined for that envelope (0.05 in this example). One important assumption underlying the K and L functions is that the process is uniform across the region. If there is reason to believe this not to be the case, then the K function analysis needs to be controlled for inhomogeneity in the process. For example, we might hypothesize that population density dictates the density distribution of the Walmart stores across the region. We therefore run an MC test by randomly re-assigning Walmart point locations using the population distribution map as the underlying point density distribution (in other words, we expect the MC simulation to locate a greater proportion of the points where population density is the greatest). Figure 12.14: Simulation results for an inhomogeneous hypothesized process. When controlled for population density, the significance test suggests that the inter-distance of Walmarts is more dispersed than expected under the null up to a distance of 30 km. It may be tempting to scan across the plot looking for distances \\(r\\) for which deviation from the null is significant for a given significance value then report these findings as such. For example, given the results in the last figure, we might not be justified in stating that the patterns between \\(r\\) distances of 5 and 30 are more dispersed than expected at the 5% significance level but at a higher significance level instead. This problem is referred to as the multiple comparison problem–details of which are not covered here. 12.5 Testing for a covariate effect The last two sections covered distance based approaches to point pattern analysis. In this section, we explore hypothesis testing on a density based approach to point pattern analysis: The Poisson point process model. Any Poisson point process model can be fit to an observed point pattern, but just because we can fit a model does not imply that the model does a good job in explaining the observed pattern. To test how well a model can explain the observed point pattern, we need to compare it to a base model (such as one where we assume that the points are randomly distributed across the study area–i.e. IRP). The latter is defined as the null hypothesis and the former is defined as the alternate hypothesis. For example, we may want to assess if the Poisson point process model that pits the placement of Walmarts as a function of population distribution (the alternate hypothesis) does a better job than the null model that assumes homogeneous intensity (i.e. a Walmart has no preference as to where it is to be placed). This requires that we first derive estimates for both models. A Poisson point process model (of the the Walmart point pattern) implemented in a statistical software such as R produces the following output for the null model: Stationary Poisson process Fitted to point pattern dataset 'P' Intensity: 2.1276e-09 Estimate S.E. CI95.lo CI95.hi Ztest Zval log(lambda) -19.96827 0.1507557 -20.26375 -19.6728 *** -132.4545 and the following output for the alternate model. Nonstationary Poisson process Fitted to point pattern dataset 'P' Log intensity: ~pop Fitted trend coefficients: (Intercept) pop -2.007063e+01 1.043115e-04 Estimate S.E. CI95.lo CI95.hi Ztest (Intercept) -2.007063e+01 1.611991e-01 -2.038657e+01 -1.975468e+01 *** pop 1.043115e-04 3.851572e-05 2.882207e-05 1.798009e-04 ** Zval (Intercept) -124.508332 pop 2.708284 Problem: Values of the covariate 'pop' were NA or undefined at 0.7% (4 out of 572) of the quadrature points Thus, the null model (homogeneous intensity) takes on the form: \\[ \\lambda(i) = e^{-19.96} \\] and the alternate model takes on the form: \\[ \\lambda(i) = e^{-20.1 + 1.04^{-4}population} \\] The models are then compared using the likelihood ratio test which produces the following output: Npar Df Deviance Pr(>Chi) 5 NA NA NA 6 1 4.253072 0.0391794 The value under the heading PR(>Chi) is the p-value which gives us the probability we would be wrong in rejecting the null. Here p=0.039 suggests that there is an 3.9% chance that we would be remiss to reject the base model in favor of the alternate model–put another way, the alternate model may be an improvement over the null. "],["spatial-autocorrelation.html", "Chapter 13 Spatial Autocorrelation 13.1 Global Moran’s I 13.2 Moran’s I at different lags 13.3 Local Moran’s I 13.4 Moran’s I equation explained", " Chapter 13 Spatial Autocorrelation “The first law of geography: Everything is related to everything else, but near things are more related than distant things.” Waldo R. Tobler (Tobler 1970) Mapped events or entities can have non-spatial information attached to them (some GIS software call these attributes). When mapped, these values often exhibit some degree of spatial relatedness at some scale. This is what Tobler was getting at: The idea that values close to one another tend to be similar. In fact, you will be hard-pressed to find mapped features that do not exhibit some kind of non-random pattern. So how do we model spatial patterns? The approach taken will depend on how one chooses to characterize the underlying process–this can be either a spatial trend model or a spatial clustering/dispersion model. This chapter focuses on the latter. 13.1 Global Moran’s I Though our visual senses can, in some cases, discern clustered regions from non-clustered regions, the distinction may not always be so obvious. We must therefore come up with a quantitative and objective approach to quantifying the degree to which similar features cluster or disperse and where such clustering occurs. One popular measure of spatial autocorrelation is the Moran’s I coefficient. 13.1.1 Computing the Moran’s I Let’s start with a working example: 2020 median per capita income for the state of Maine. Figure 13.1: Map of 2020 median per capita income for Maine counties (USA). It may seem apparent that, when aggregated at the county level, the income distribution appears clustered with high counties surrounded by high counties and low counties surrounded by low counties. But a qualitative description may not be sufficient; we might want to quantify the degree to which similar (or dissimilar) counties are clustered. One measure of this type or relationship is the Moran’s I statistic. The Moran’s I statistic is the correlation coefficient for the relationship between a variable (like income) and its neighboring values. But before we go about computing this correlation, we need to come up with a way to define a neighbor. One approach is to define a neighbor as being any contiguous polygon. For example, the northern most county (Aroostook), has four contiguous neighbors while the southern most county (York) has just two contiguous counties. Other neighborhood definitions can include distance bands (e.g. counties within 100 km) and k nearest neighbors (e.g. the 2 closest neighbors). Note that distance bands and k nearest neighbors are usually measured using the polygon’s centroids and not their boundaries. Figure 13.2: Maps show the links between each polygon and their respective neighbor(s) based on the neighborhood definition. A contiguous neighbor is defined as one that shares a boundary or a vertex with the polygon of interest. Orange numbers indicate the number of neighbors for each polygon. Note that the top most county has no neighbors when a neighborhood definition of a 100 km distance band is used (i.e. no centroids are within a 100 km search radius) Once we’ve defined a neighborhood for our analysis, we identify the neighbors for each polygon in our dataset then summaries the values for each neighborhood cluster (by computing their mean values, for example). This summarized neighborhood value is sometimes referred to as a spatially lagged value (Xlag). In our working example, we adopt a contiguity neighborhood and compute the average neighboring income value (Incomelag) for each county in our dataset. We then plot Incomelag vs. Income for each county. The Moran’s I coefficient between Incomelag and Income is nothing more than the slope of the least squares regression line that best fits the points after having equalized the spread between both sets of data. Figure 13.3: Scatter plot of spatially lagged income (neighboring income) vs. each countie’s income. If we equalize the spread between both axes (i.e. convert to a z-value) the slope of the regression line represents the Moran’s I statistic. If there is no degree of association between Income and Incomelag, the slope will be close to flat (resulting in a Moran’s I value near 0). In our working example, the slope is far from flat with a Moran’s I value is 0.28. So this begs the question, how significant is this Moran’s I value (i.e. is the computed slope significantly different from 0)? There are two approaches to estimating the significance: an analytical solution and a Monte Carlo solution. The analytical solution makes some restrictive assumptions about the data and thus cannot always be reliable. Another approach (and the one favored here) is a Monte Carlo test which makes no assumptions about the dataset including the shape and layout of each polygon. 13.1.2 Monte Carlo approach to estimating significance In a Monte Carlo test (a permutation bootstrap test, to be exact), the attribute values are randomly assigned to polygons in the data set and, for each permutation of the attribute values, a Moran’s I value is computed. Figure 13.4: Results from 199 permutations. Plot shows Moran’s I slopes (in gray) computed from each random permutation of income values. The observed Moran’s I slope for the original dataset is shown in red. The output is a sampling distribution of Moran’s I values under the (null) hypothesis that attribute values are randomly distributed across the study area. We then compare our observed Moran’s I value to this sampling distribution. Figure 13.5: Histogram shows the distribution of Moran’s I values for all 199 permutations; red vertical line shows our observed Moran’s I value of 0.28. In our working example, 199 simulations indicate that our observed Moran’s I value of 0.28 is not a value we would expect to compute if the income values were randomly distributed across each county. A (pseudo) P-value can easily be computed from the simulation results: \\[ \\dfrac{N_{extreme}+1}{N+1} \\] where \\(N_{extreme}\\) is the number of simulated Moran’s I values more extreme than our observed statistic and \\(N\\) is the total number of simulations. Here, out of 199 simulations, just three simulated I values were more extreme than our observed statistic, \\(N_{extreme}\\) = 3, so \\(p\\) is equal to (3 + 1) / (199 + 1) = 0.02. This is interpreted as “there is a 2% probability that we would be wrong in rejecting the null hypothesis Ho.” Note that in this permutation example, we shuffled around the observed income values such that all values were present in each permutation outcome–this is sometimes referred to as a randomization option in a few software implementations of the Moran’s I hypothesis test. Note that here, randomization is not to be confused with the way the permutation technique “randomly” assigns values to features in the data layer. Alternatively, one can choose to randomly assign a set of values to each feature in a data layer from a theorized distribution (for example, a Normal distribution). This may result in a completely different set of values for each permutation outcome. Note that you would only adopt this approach if the theorized distribution underpinning the value of interest is known a priori. Another important consideration when computing a P-value from a permutation test is the number of simulations to perform. In the above example we ran 199 permutations, thus, the smallest P-value we could possibly come up with is 1 / (199 + 1) or a P-value of 0.005. You should therefore chose a number of permutations, \\(N\\), large enough to ensure a reliable level of significance. 13.2 Moran’s I at different lags So far we have looked at spatial autocorrelation where we define neighbors as all polygons sharing a boundary with the polygon of interest. We may also be interested in studying the ranges of autocorrelation values as a function of distance. The steps for this type of analysis are straightforward: Compute lag values for a defined set of neighbors. Calculate the Moran’s I value for this set of neighbors. Repeat steps 1 and 2 for a different set of neighbors (at a greater distance for example) . For example, the Moran’s I values for income distribution in the state of Maine at distances of 75, 125, up to 325 km are presented in the following plot: Figure 13.6: Moran’s I at different spatial lags defined by a 50 km width annulus at 50 km distance increments. Red dots indicate Moran I values for which a P-value was 0.05 or less. The plot suggests that there is significant spatial autocorrelation between counties within 25 km of one another, but as the distances between counties increases, autocorrelation shifts from being positive to being negative meaning that at greater distances, counties tend to be more dissimilar. 13.3 Local Moran’s I We can decompose the global Moran’s I into a localized measure of autocorrelation–i.e. a map of “hot spots” and “cold spots”. A local Moran’s I analysis is best suited for relatively large datasets, especially if a hypothesis test is to be implemented. We’ll therefor switch to another dataset: Massachusetts household income data. Applying a contiguity based definition of a neighbor, we get the following scatter plot of spatially lagged income vs. income. Figure 13.7: Grey vertical and horizontal lines define the mean values for both axes values. Red points highlight counties with relatively high income values (i.e. greater than the mean) surrounded by counties whose average income value is relatively high. Likewise, dark blue points highlight counties with relatively low income values surrounded by counties whose average income value is relatively low. You’ll note that the mean value for Income, highlighted as light grey vertical and horizontal lines in the above plot, carve up the four quadrant defining the low-low, high-low, high-high and low-high quadrants when starting from the bottom-left quadrant and working counterclockwise. Note that other measures of centrality, such as the median, could be used to delineate these quadrants. The values in the above scatter plot can be mapped to each polygon in the dataset as shown in the following figure. Figure 13.8: A map view of the low-low (blue), high-low (light-blue), high-high (red) and low-high (orange) counties. Each observation that contributes to the global Moran’s I can be assigned a localized version of that statistic, \\(I_i\\) where the subscript \\(i\\) references the individual geometric unit. The calculation of \\(I_i\\) is shown later in the chapter. At this point, we have identified the counties that are surrounded by similar values. However, we have yet to assess which polygons are “significantly” similar or dissimilar to their neighbors. As with the global Moran’s I, there is both an analytical and Monte Carlo approach to computing significance of \\(I_i\\). In the case of a Monte Carlo approach, one shuffles all values in the dataset except for the value, \\(y_i\\), of the geometric unit \\(i\\) whose \\(I_i\\) we are assessing for significance. For each permutation, we compare the value at \\(y_i\\) to the average value of its neighboring values. From the permutations, we generate a distribution of \\(I_i\\) values (for each \\(y_i\\) feature) we would expect to get if the values were randomly distributed across all features. We can use the following polygon in eastern Massachusetts as example. Figure 13.9: Polygon whose signifcance value we are asessing in this example. Its local Moran’s I statistic is 0.85. A permutation test shuffles the income values around it all the while keeping its value constant. An example of an outcome of a few permutations follows: Figure 13.10: Local Moran’s I outcomes of a few permutations of income values. You’ll note that even though the income value remains the same in the polygon of interest, its local Moran’s I statistic will change because of the changing income values in its surrounding polygons. If we perform many more permutations, we come up with a distribution of \\(I_i\\) values under the null that the income values are randomly distributed across the state of Massachusetts. The distribution of \\(I_i\\) for the above example is plotted using a histogram. Figure 13.11: Distribution of \\(I_i\\) values under the null hypothesis that income values are randomly distributed across the study extent. The red vertical line shows the observed \\(I_i\\) for comparison. About 9.3% of the simulated values are more extreme than our observed \\(I_i\\) giving us a pseudo p-value of 0.09. If we perform this permutation for all polygons in our dataset, we can map the pseudo p-values for each polygon. Note that here, we are mapping the probability that the observed \\(I_i\\) value is more extreme than expected (equivalent to a one-tail test). Figure 13.12: Map of the pseudo p-values for each polygons’ \\(I_i\\) statistic. One can use the computed p-values to filter the \\(I_i\\) values based on a desired level of significance. For example, the following scatter plot and map shows the high/low “hotspots” for which a pseudo p-value of 0.05 or less was computed from the above simulation. Figure 13.13: Local Moran’s I values having a signifcance level of 0.05 or less. You’ll note that the levels of significance do not apply to just the high-high and low-low regions, they can apply to all combinations of highs and lows. Here’s another example where the \\(I_i\\) values are filtered based on a more stringent significance level of 0.01. Figure 13.14: Local Moran’s I values having a signifcance level of 0.01 or less. 13.4 Moran’s I equation explained The Moran’s I equation can take on many forms. One form of the equation can be presented as: \\[ I = \\frac{N}{\\sum\\limits_i (X_i-\\bar X)^2} \\frac{\\sum\\limits_i \\sum\\limits_j w_{ij}(X_i-\\bar X)(X_j-\\bar X)}{\\sum\\limits_i \\sum\\limits_j w_{ij}} \\tag{1} \\] \\(N\\) is the total number of features in the dataset, \\(X\\) is the quantity of interest. Subscripts \\(i\\) and \\(j\\) reference two different features in the dataset, and \\(w_{ij}\\) is a weight that defines the relationship between features \\(i\\) and \\(j\\) (i.e. the weights will determine if feature \\(j\\) is a neighbor of \\(i\\) and how much weight feature \\(j\\) should be given when computing some overall neighboring \\(X\\) value). There a few key components of this equation worth highlighting. First, you’ll note the standardization of both sets of values by the subtraction of each value in \\(X_i\\) or \\(X_j\\) by the mean of \\(X\\). This highlights the fact that we are seeking to compare the deviation of each value from an overall mean and not the deviation of their absolute values. Second, you’ll note an inverted variance term on the left-hand side of equation (1)–this is a measure of spread. You might recall from an introductory statistics course that the variance can be computed as: \\[ s^2 = \\frac{\\sum\\limits_i (X_i-\\bar X)^2}{N}\\tag{2} \\] Note that a more common measure of variance, the sample variance, where one divides the above numerator by \\((n-1)\\) can also be adopted in the Moran’s I calculation. Equation (1) is thus dividing the large fraction on the right-hand side by the variance. This has for effect of limiting the range of possible Moran’s I values between -1 and 1 (note that in some extreme cases, \\(I\\) can take on a value more extreme than [-1; 1]). We can re-write the Moran’s I equation by plugging in \\(s^2\\) as follows: \\[ I = \\frac{\\sum\\limits_i \\sum\\limits_j w_{ij}\\frac{(X_i-\\bar X)}{s}\\frac{(X_j-\\bar X)}{s}}{\\sum\\limits_i \\sum\\limits_j w_{ij}} \\tag{3} \\] Note that here \\(s\\times s = s^2\\). You might recognize the numerator as a sum of the product of standardized z-values between neighboring features. If we let \\(z_i = \\frac{(X_i-\\bar X)}{s}\\) and \\(z_j = \\frac{(X_j-\\bar X)}{s}\\), The Moran’s I equation can be reduced to: \\[ I = \\frac{\\sum\\limits_i \\sum\\limits_j w_{ij}(z_i\\ z_j)}{\\sum\\limits_i \\sum\\limits_j w_{ij}} \\tag{4} \\] Recall that we are comparing a variable \\(X\\) at \\(i\\) to all of its neighboring values at \\(j\\). More specifically, we are computing a summary value (such as the mean) of the neighboring values at \\(j\\) and multiplying that by \\(X_i\\). So, if we let \\(y_i = \\sum\\limits_j w_{ij} z_j\\), the Moran’s I coefficient can be rewritten as: \\[ I = \\frac{\\sum\\limits_i z_i y_i}{\\sum\\limits_i \\sum\\limits_j w_{ij}} \\tag{5} \\] So, \\(y_i\\) is the average z-value for the neighboring features thus making the product \\(z_i y_i\\) nothing more than a correlation coefficient. The product \\(z_iy_i\\) is a local measure of spatial autocorrelation, \\(I_i\\). If we don’t summarize across all locations \\(i\\), we get our local I statistic, \\(I_i\\): \\[ I_i = z_iy_i \\tag{6} \\] The global Moran’s I statistic, \\(I\\), is thus the average of all \\(I_i\\) values. \\[ I = \\frac{\\sum\\limits_i I_i}{\\sum\\limits_i \\sum\\limits_j w_{ij}} \\tag{5} \\] Let’s explore elements of the Moran’s I equation using the following sample dataset. Figure 13.15: Simulated spatial layer. The figure on the left shows each cell’s ID value. The figure in the middle shows the values for each cell. The figure on the right shows the standardized values using equation (2). The first step in the computation of a Moran’s I index is the generation of weights. The weights can take on many different values. For example, one could assign a value of 1 to a neighboring cell as shown in the following matrix. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 3 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 4 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 5 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 6 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 0 7 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 0 8 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 9 0 0 0 0 1 1 0 0 0 1 0 0 1 1 0 0 10 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 11 0 0 0 0 0 1 1 1 0 1 0 1 0 1 1 1 12 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 13 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 14 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 15 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 1 16 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 For example, cell ID 1 (whose value is 25 and whose standardized value, \\(z_1\\), is 0.21) has for neighbors cells 2, 5 and 6. Computationally (working with the standardized values), this gives us a summarized neighboring value (aka lagged value), \\(y_1(lag)\\) of: \\[ \\begin{align*} y_1 = \\sum\\limits_j w_{1j} z_j {}={} & (0)(0.21)+(1)(1.17)+(0)(1.5)+ ... + \\\\ & (1)(0.69)+(1)(0.93)+(0)(-0.36)+...+ \\\\ & (0)(-0.76) = 2.79 \\end{align*} \\] Computing the spatially lagged values for the other 15 cells generates the following scatterplot: Figure 13.16: Moran’s I scatterplot using a binary weight. The red point is the (\\(z_1\\), \\(y_1\\)) pair computed for cell 1. You’ll note that the range of neighboring values along the \\(y\\)-axis is much greater than that of the original values on the \\(x\\)-axis. This is not necessarily an issue given that the Moran’s \\(I\\) correlation coefficient standardizes the values by recentering them on the overall mean \\((X - \\bar{X})/s\\). This is simply to re-emphasize that we are interested in how a neighboring value varies relative to a feature’s value, regardless of the scale of values in either batches. If there is a downside to adopting a binary weight, it’s the bias that the different number of neighbors can introduce in the calculation of the spatially lagged values. In other words, a feature with 5 polygons (such as feature ID 12) will have a larger neighboring value than a feature with 3 neighbors (such as feature ID 1) whose neighboring value will be less if there was no spatial autocorrelation in the dataset. A more natural weight is one where the values are standardized across each row of the weights matrix such that the weights across each row sum to one. For example: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 0 0.333 0 0 0.333 0.333 0 0 0 0 0 0 0 0 0 0 2 0.2 0 0.2 0 0.2 0.2 0.2 0 0 0 0 0 0 0 0 0 3 0 0.2 0 0.2 0 0.2 0.2 0.2 0 0 0 0 0 0 0 0 4 0 0 0.333 0 0 0 0.333 0.333 0 0 0 0 0 0 0 0 5 0.2 0.2 0 0 0 0.2 0 0 0.2 0.2 0 0 0 0 0 0 6 0.125 0.125 0.125 0 0.125 0 0.125 0 0.125 0.125 0.125 0 0 0 0 0 7 0 0.125 0.125 0.125 0 0.125 0 0.125 0 0.125 0.125 0.125 0 0 0 0 8 0 0 0.2 0.2 0 0 0.2 0 0 0 0.2 0.2 0 0 0 0 9 0 0 0 0 0.2 0.2 0 0 0 0.2 0 0 0.2 0.2 0 0 10 0 0 0 0 0.125 0.125 0.125 0 0.125 0 0.125 0 0.125 0.125 0.125 0 11 0 0 0 0 0 0.125 0.125 0.125 0 0.125 0 0.125 0 0.125 0.125 0.125 12 0 0 0 0 0 0 0.2 0.2 0 0 0.2 0 0 0 0.2 0.2 13 0 0 0 0 0 0 0 0 0.333 0.333 0 0 0 0.333 0 0 14 0 0 0 0 0 0 0 0 0.2 0.2 0.2 0 0.2 0 0.2 0 15 0 0 0 0 0 0 0 0 0 0.2 0.2 0.2 0 0.2 0 0.2 16 0 0 0 0 0 0 0 0 0 0 0.333 0.333 0 0 0.333 0 The spatially lagged value for cell ID 1 is thus computed as: \\[ \\begin{align*} y_1 = \\sum\\limits_j w_{1j} z_j {}={} & (0)(0.21)+(0.333)(1.17)+(0)(1.5)+...+ \\\\ & (0.333)(0.69)+(0.333)(0.93)+(0)(-0.36)+...+ \\\\ & (0)(-0.76) = 0.93 \\end{align*} \\] Multiplying each neighbor by the standardized weight, then summing these values, is simply computing the neighbor’s mean value. Using the standardized weights generates the following scatter plot. Plot on the left shows the raw values on the x and y axes; plot on the right shows the standardized values \\(z_i\\) and \\(y_i = \\sum\\limits_j w_{ij} z_j\\). You’ll note that the shape of the point cloud is the same in both plots given that the axes on the left plot are scaled such as to match the standardized scales in both axes. Figure 13.17: Moran’s scatter plot with original values on the left and same Moran’s I scatter plot on the right using the standardzied values \\(z_i\\) and \\(y_i\\). Note the difference in the point cloud pattern in the above plot from the one generated using the binary weights. Other weights can be used such as inverse distance and k-nearest neighbors to name just a few. However, must software implementations of the Moran’s I statistic will adopt the row standardized weights. 13.4.1 Local Moran’s I Once a spatial weight is chosen, and both \\(z_i\\) and \\(y_i\\) are computed. We can compute the \\(z_iy_i\\) product for all locations of \\(i\\) thus giving us a measure of the local Moran’s I statistic. Taking feature ID of 1 in our example, we compute \\(I_1(lag) = 0.21 \\times 0.93 = 0.19\\). Computing \\(I_i\\) for all cells gives us the following plot. Figure 13.18: The left plot shows the Moran’s I scatter plot with the point colors symbolizing the \\(I_i\\) values. The figure on the right shows the matching \\(I_i\\) values mapped to each respective cell. Here, we are adopting a different color scheme from that used earlier. Green colors highlight features whose values are surrounded by similar values. These can be either positive values surrounded by standardized values that tend to be positive or negative values surrounded by values that tend to be negative. In both cases, the calculated \\(I_i\\) will be positive. Red colors highlight features whose values are surrounded by dissimilar values. These can be either negative values surrounded by values that tend to be positive or positive values surrounded by values that tend to be negative. In both cases, the calculated \\(I_i\\) will be negative. In our example, two features have a negative Moran’s I coefficient: cell IDs 7 and 12. 13.4.2 Global Moran’s I The Global Moran’s I coefficient, \\(I\\) is nothing more than a summary of the local Moran’s I coefficients. Using a standardized weight, \\(I\\) is the average of all \\(I_i\\) values. \\[ \\begin{pmatrix} \\frac{0.19+0.7+1.15+0.68+0.18+0.15+-0.24+0.44+0.25+0.12+0.14+-0.29+1.18+1.39+0.71+0.39}{\\sum\\limits_i\\sum\\limits_j w_{ij}} = 0.446 \\end{pmatrix} \\] In this example, \\(\\sum\\limits_j w_{ij}\\) is the sum of all 256 values in Table (2) which, using standardized weights, sums to 16. \\(I\\) is thus the slope that best fits the data. This can be plotted using either the standardized values or the raw values. Figure 13.19: Moran’s scatter with fitted Moran’s I slope (red line). The left plot uses the raw values \\((X_i,X_i(lag))\\) for its axes labels. Right plot uses the standardized values \\((z_i,y_i)\\) for its axes labels. References "],["spatial-interpolation.html", "Chapter 14 Spatial Interpolation 14.1 Deterministic Approach to Interpolation 14.2 Statistical Approach to Interpolation", " Chapter 14 Spatial Interpolation Given a distribution of point meteorological stations showing precipitation values, how I can I estimate the precipitation values where data were not observed? Figure 14.1: Average yearly precipitation (reported in inches) for several meteorological sites in Texas. To help answer this question, we need to clearly define the nature of our point dataset. We’ve already encountered point data earlier in the course where our interest was in creating point density maps using different kernel windows. However, the point data used represented a complete enumeration of discrete events or observations–i.e. the entity of interest only occurred a discrete locations within a study area and therefore could only be measured at those locations. Here, our point data represents sampled observations of an entity that can be measured anywhere within our study area. So creating a point density raster from this data would only make sense if we were addressing the questions like “where are the meteorological stations concentrated within the state of Texas?”. Another class of techniques used with points that represent samples of a continuous field are interpolation methods. There are many interpolation tools available, but these tools can usually be grouped into two categories: deterministic and statistical interpolation methods. 14.1 Deterministic Approach to Interpolation We will explore two deterministic methods: proximity (aka Thiessen) techniques and inverse distance weighted techniques (IDW for short). 14.1.1 Proximity interpolation This is probably the simplest (and possibly one of the oldest) interpolation method. It was introduced by Alfred H. Thiessen more than a century ago. The goal is simple: Assign to all unsampled locations the value of the closest sampled location. This generates a tessellated surface whereby lines that split the midpoint between each sampled location are connected thus enclosing an area. Each area ends up enclosing a sample point whose value it inherits. Figure 14.2: Tessellated surface generated from discrete point samples. This is also known as a Thiessen interpolation. One problem with this approach is that the surface values change abruptly across the tessellated boundaries. This is not representative of most surfaces in nature. Thiessen’s method was very practical in his days when computers did not exist. But today, computers afford us more advanced methods of interpolation as we will see next. 14.1.2 Inverse Distance Weighted (IDW) The IDW technique computes an average value for unsampled locations using values from nearby weighted locations. The weights are proportional to the proximity of the sampled points to the unsampled location and can be specified by the IDW power coefficient. The larger the power coefficient, the stronger the weight of nearby points as can be gleaned from the following equation that estimates the value \\(z\\) at an unsampled location \\(j\\): \\[ \\hat{Z_j} = \\frac{\\sum_i{Z_i/d^n_{ij}}}{\\sum_i{1/d^n_{ij}}} \\] The carat \\(\\hat{}\\) above the variable \\(z\\) reminds us that we are estimating the value at \\(j\\). The parameter \\(n\\) is the weight parameter that is applied as an exponent to the distance thus amplifying the irrelevance of a point at location \\(i\\) as distance to \\(j\\) increases. So a large \\(n\\) results in nearby points wielding a much greater influence on the unsampled location than a point further away resulting in an interpolated output looking like a Thiessen interpolation. On the other hand, a very small value of \\(n\\) will give all points within the search radius equal weight such that all unsampled locations will represent nothing more than the mean values of all sampled points within the search radius. In the following figure, the sampled points and values are superimposed on top of an (IDW) interpolated raster generated with a \\(n\\) value of 2. Figure 14.3: An IDW interpolation of the average yearly precipitation (reported in inches) for several meteorological sites in Texas. An IDW power coefficient of 2 was used in this example. In the following example, an \\(n\\) value of 15 is used to interpolate precipitation. This results in nearby points having greater influence on the unsampled locations. Note the similarity in output to the proximity (Thiessen) interpolation. Figure 14.4: An IDW interpolation of the average yearly precipitation (reported in inches) for several meteorological sites in Texas. An IDW power coefficient of 15 was used in this example. 14.1.3 Fine tuning the interpolation parameters Finding the best set of input parameters to create an interpolated surface can be a subjective proposition. Other than eyeballing the results, how can you quantify the accuracy of the estimated values? One option is to split the points into two sets: the points used in the interpolation operation and the points used to validate the results. While this method is easily implemented (even via a pen and paper adoption) it does suffer from significant loss in power–i.e. we are using just half of the information to estimate the unsampled locations. A better approach (and one easily implemented in a computing environment) is to remove one data point from the dataset and interpolate its value using all other points in the dataset then repeating this process for each point in that dataset (while making sure that the interpolator parameters remain constant across each interpolation). The interpolated values are then compared with the actual values from the omitted point. This method is sometimes referred to as jackknifing or leave-one-out cross-validation. The performance of the interpolator can be summarized by computing the root-mean of squared residuals (RMSE) from the errors as follows: \\[ RMSE = \\sqrt{\\frac{\\sum_{i=1}^n (\\hat {Z_{i}} - Z_i)^2}{n}} \\] where \\(\\hat {Z_{i}}\\) is the interpolated value at the unsampled location i (i.e. location where the sample point was removed), \\(Z_i\\) is the true value at location i and \\(n\\) is the number of points in the dataset. We can create a scatterplot of the predicted vs. expected precipitation values from our dataset. The solid diagonal line represents the one-to-one slope (i.e. if the predicted values matched the true values exactly, then the points would fall on this line). The red dashed line is a linear fit to the points which is here to help guide our eyes along the pattern generated by these points. Figure 14.5: Scatter plot pitting predicted values vs. the observed values at each sampled location following a leave-one-out cross validation analysis. The computed RMSE from the above working example is 6.989 inches. We can extend our exploration of the interpolator’s accuracy by creating a map of the confidence intervals. This involves layering all \\(n\\) interpolated surfaces from the aforementioned jackknife technique, then computing the confidence interval for each location ( pixel) in the output map (raster). If the range of interpolated values from the jackknife technique for an unsampled location \\(i\\) is high, then this implies that this location is highly sensitive to the presence or absence of a single point from the sample point locations thus producing a large confidence interval (i.e. we can’t be very confident of the predicted value). Conversely, if the range of values estimated for location \\(i\\) is low, then a small confidence interval is computed (providing us with greater confidence in the interpolated value). The following map shows the 95% confidence interval for each unsampled location (pixel) in the study extent. Figure 14.6: In this example an IDW power coefficient of 2 was used and the search parameters was confined to a minimum number of points of 10 and a maximum number of points of 15. The search window was isotropic. Each pixel represents the range of precipitation values (in inches) around the expected value given a 95% confidence interval. IDW interpolation is probably one of the most widely used interpolators because of its simplicity. In many cases, it can do an adequate job. However, the choice of power remains subjective. There is another class of interpolators that makes use of the information provided to us by the sample points–more specifically, information pertaining to 1st and 2nd order behavior. These interpolators are covered next. 14.2 Statistical Approach to Interpolation The statistical interpolation methods include surface trend and Kriging. 14.2.1 Trend Surfaces It may help to think of trend surface modeling as a regression on spatial coordinates where the coefficients apply to those coordinate values and (for more complicated surface trends) to the interplay of the coordinate values. We will explore a 0th order, 1st order and 2nd order surface trend in the following sub-sections. 14.2.1.1 0th Order Trend Surface The first model (and simplest model), is the 0th order model which takes on the following expression: Z = a where the intercept a is the mean precipitation value of all sample points (27.1 in our working example). This is simply a level (horizontal) surface whose cell values all equal 27.1. Figure 14.7: The simplest model where all interpolated surface values are equal to the mean precipitation. This makes for an uninformative map. A more interesting surface trend map is one where the surface trend has a slope other than 0 as highlighted in the next subsection. 14.2.1.2 1st Order Trend Surface The first order surface polynomial is a slanted flat plane whose formula is given by: Z = a + bX + cY where X and Y are the coordinate pairs. Figure 14.8: Result of a first order interpolation. The 1st order surface trend does a good job in highlighting the prominent east-west trend. But is the trend truly uniform along the X axis? Let’s explore a more complicated surface: the quadratic polynomial. 14.2.1.3 2nd Order Trend Surface The second order surface polynomial (aka quadratic polynomial) is a parabolic surface whose formula is given by: \\(Z = a + bX + cY + dX^2 + eY^2 + fXY\\) Figure 14.9: Result of a second order interpolation This interpolation picks up a slight curvature in the east-west trend. But it’s not a significant improvement over the 1st order trend. 14.2.2 Ordinary Kriging Several forms of kriging interpolators exist: ordinary, universal and simple just to name a few. This section will focus on ordinary kriging (OK) interpolation. This form of kriging usually involves four steps: Removing any spatial trend in the data (if present). Computing the experimental variogram, \\(\\gamma\\), which is a measure of spatial autocorrelation. Defining an experimental variogram model that best characterizes the spatial autocorrelation in the data. Interpolating the surface using the experimental variogram. Adding the kriged interpolated surface to the trend interpolated surface to produce the final output. These steps our outlined in the following subsections. 14.2.2.1 De-trending the data One assumption that needs to be met in ordinary kriging is that the mean and the variation in the entity being studied is constant across the study area. In other words, there should be no global trend in the data (the term drift is sometimes used to describe the trend in other texts). This assumption is clearly not met with our Texas precipitation dataset where a prominent east-west gradient is observed. This requires that we remove the trend from the data before proceeding with the kriging operations. Many pieces of software will accept a trend model (usually a first, second or third order polynomial). In the steps that follow, we will use the first order fit computed earlier to de-trend our point values (recall that the second order fit provided very little improvement over the first order fit). Removing the trend leaves us with the residuals that will be used in kriging interpolation. Note that the modeled trend will be added to the kriged interpolated surface at the end of the workflow. Figure 14.10: Map showing de-trended precipitation values (aka residuals). These detrended values are then passed to the ordinary kriging interpolation operations. You can think of these residuals as representing variability in the data not explained by the global trend. If variability is present in the residuals then it is best characterized as a distance based measure of variability (as opposed to a location based measure). 14.2.2.2 Experimental Variogram In Kriging interpolation, we focus on the spatial relationship between location attribute values. More specifically, we are interested in how these attribute values (precipitation residuals in our working example) vary as the distance between location point pairs increases. We can compute the difference, \\(\\gamma\\), in precipitation values by squaring their differences then dividing by 2. For example, if we take two meteorological stations (one whose de-trended precipitation value is -1.2 and the other whose value is 1.6), Figure 14.11: Locations of two sample sites used to demonstrate the calculation of gamma. we can compute their difference (\\(\\gamma\\)) as follows: \\[ \\gamma = \\frac{(Z_2 - Z_1)^2}{2} = \\frac{(-1.2 - (1.6))^2}{2} = 3.92 \\] We can compute \\(\\gamma\\) for all point pairs then plot these values as a function of the distances that separate these points: Figure 14.12: Experimental variogram plot of precipitation residual values. The red point in the plot is the value computed in the above example. The distance separating those two points is about 209 km. This value is mapped in 14.12 as a red dot. The above plot is called an experimental semivariogram cloud plot (also referred to as an experimental variogram cloud plot). The terms semivariogram and variogram are often used interchangeably in geostatistics (we’ll use the term variogram henceforth since this seems to be the term of choice in current literature). Also note that the word experimental is sometimes dropped when describing these plots, but its use in our terminology is an important reminder that the points we are working with are just samples of some continuous field whose spatial variation we are attempting to model. 14.2.2.3 Sample Experimental Variogram Cloud points can be difficult to interpret due to the sheer number of point pairs (we have 465 point pairs from just 50 sample points, and this just for 1/3 of the maximum distance lag!). A common approach to resolving this issue is to “bin” the cloud points into intervals called lags and to summarize the points within each interval. In the following plot, we split the data into 15 bins then compute the average point value for each bin (displayed as red points in the plot). The red points that summarize the cloud are the sample experimental variogram estimates for each of the 15 distance bands and the plot is referred to as the sample experimental variogram plot. Figure 14.13: Sample experimental variogram plot of precipitation residual values. 14.2.2.4 Experimental Variogram Model The next step is to fit a mathematical model to our sample experimental variogram. Different mathematical models can be used; their availability is software dependent. Examples of mathematical models are shown below: Figure 14.14: A subset of variogram models available in R’s gstat package. The goal is to apply the model that best fits our sample experimental variogram. This requires picking the proper model, then tweaking the partial sill, range, and nugget parameters (where appropriate). The following figure illustrates a nonzero intercept where the nugget is the distance between the \\(0\\) variance on the \\(y\\) axis and the variogram’s model intercept with the \\(y\\) axis. The partial sill is the vertical distance between the nugget and the part of the curve that levels off. If the variogram approaches \\(0\\) on the \\(y\\)-axis, then the nugget is \\(0\\) and the partial sill is simply referred to as the sill. The distance along the \\(x\\) axis where the curve levels off is referred to as the range. Figure 14.15: Graphical description of the range, sill and nugget parameters in a variogram model. In our working example, we will try to fit the Spherical function to our sample experimental variogram. This is one of three popular models (the other two being linear and gaussian models.) Figure 14.16: A spherical model fit to our residual variogram. 14.2.2.5 Kriging Interpolation The variogram model is used by the kriging interpolator to provide localized weighting parameters. Recall that with the IDW, the interpolated value at an unsampled site is determined by summarizing weighted neighboring points where the weighting parameter (the power parameter) is defined by the user and is applied uniformly to the entire study extent. Kriging uses the variogram model to compute the weights of neighboring points based on the distribution of those values–in essence, kriging is letting the localized pattern produced by the sample points define the weights (in a systematic way). The exact mathematical implementation will not be covered here (it’s quite involved), but the resulting output is shown in the following figure: Figure 14.17: Krige interpolation of the residual (detrended) precipitation values. Recall that the kriging interpolation was performed on the de-trended data. In essence, we predicted the precipitation values based on localized factors. We now need to combine this interpolated surface with that produced from the trend interpolated surface to produce the following output: Figure 14.18: The final kriged surface. A valuable by-product of the kriging operation is the variance map which gives us a measure of uncertainty in the interpolated values. The smaller the variance, the better (note that the variance values are in squared units). Figure 14.19: Variance map resulting from the Kriging analysis. "],["references.html", "Chapter 15 References", " Chapter 15 References "],["reading-and-writing-spatial-data-in-r.html", "A Reading and writing spatial data in R Sample files for this exercise Introduction Creating spatial objects Converting from an sf object Converting to an sf object Dissecting the sf file object Exporting to different data file formats", " A Reading and writing spatial data in R R sf terra tidygeocoder spatstat 4.3.1 1.0.14 1.7.55 1.0.5 3.0.7 Sample files for this exercise First, you will need to download some sample files from the github repository. Make sure to set your R session folder to the directory where you will want to save the sample files before running the following code chunks. download.file("https://github.com/mgimond/Spatial/raw/main/Data/Income_schooling.zip", destfile = "Income_schooling.zip" , mode='wb') unzip("Income_schooling.zip", exdir = ".") file.remove("Income_schooling.zip") download.file("https://github.com/mgimond/Spatial/raw/main/Data/rail_inters.gpkg", destfile = "./rail_inters.gpkg", mode='wb') download.file("https://github.com/mgimond/Spatial/raw/main/Data/elev.img", destfile = "./elev.img", mode='wb') Introduction There are several different R spatial formats to choose from. Your choice of format will largely be dictated by the package(s) and or function(s) used in your workflow. A breakdown of formats and intended use are listed below. Data format Used with… Used in package… Used for… Comment sf vector sf, others visualizing, manipulating, querying This is the new spatial standard in R. Will also read from spatially enabled databases such as postgresSQL. raster raster raster, others visualizing, manipulating, spatial statistics This has been the most popular raster format fo rmany years. But, it is gradually being supplanted by terra SpatRaster terra terra, others visualizing, manipulating, spatial statistics This is gradually replacing raster SpatialPoints* SpatialPolygons* SpatialLines* SpatialGrid* vector and raster sp, spdep Visualizing, spatial statistics These are legacy formats. spdep now accepts sf objects ppp owin vector spatstat Point pattern analysis/statistics NA im raster spatstat Point pattern analysis/statistics NA 1 The spatial* format includes SpatialPointsDataFrame, SpatialPolygonsDataFrame, SpatialLinesDataFrame, etc… There is an attempt at standardizing the spatial format in the R ecosystem by adopting a well established set of spatial standards known as simple features. This effort results in a recently developed package called sf (Pebesma 2018). It is therefore recommended that you work in an sf framework when possible. As of this writing, most of the basic data manipulation and visualization operations can be successfully conducted using sf spatial objects. Some packages such as spdep and spatstat require specialized data object types. This tutorial will highlight some useful conversion functions for this purpose. Creating spatial objects The following sections demonstrate different spatial data object creation strategies. Reading a shapefile Shapefiles consist of many files sharing the same core filename and different suffixes (i.e. file extensions). For example, the sample shapefile used in this exercise consists of the following files: [1] "Income_schooling.dbf" "Income_schooling.prj" "Income_schooling.sbn" "Income_schooling.sbx" [5] "Income_schooling.shp" "Income_schooling.shx" Note that the number of files associated with a shapefile can vary. sf only needs to be given the *.shp name. It will then know which other files to read into R such as projection information and attribute table. library(sf) s.sf <- st_read("Income_schooling.shp") Let’s view the first few records in the spatial data object. head(s.sf, n=4) # List spatial object and the first 4 attribute records Simple feature collection with 4 features and 5 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: 379071.8 ymin: 4936182 xmax: 596500.1 ymax: 5255569 Projected CRS: NAD83 / UTM zone 19N NAME Income NoSchool NoSchoolSE IncomeSE geometry 1 Aroostook 21024 0.01338720 0.00140696 250.909 MULTIPOLYGON (((513821.1 51... 2 Somerset 21025 0.00521153 0.00115002 390.909 MULTIPOLYGON (((379071.8 50... 3 Piscataquis 21292 0.00633830 0.00212896 724.242 MULTIPOLYGON (((445039.5 51... 4 Penobscot 23307 0.00684534 0.00102545 242.424 MULTIPOLYGON (((472271.3 49... Note that the sf object stores not only the geometry but the coordinate system information and attribute data as well. These will be explored later in this exercise. Reading a GeoPackage A geopackage can store more than one layer. To list the layers available in the geopackage, type: st_layers("rail_inters.gpkg") Driver: GPKG Available layers: layer_name geometry_type features fields crs_name 1 Interstate Multi Line String 35 1 NAD83 2 Rail Multi Line String 730 3 NAD83 / UTM zone 19N In this example, we have two separate layers: Interstate and Rail. We can extract each layer separately via the layer= parameter. inter.sf <- st_read("rail_inters.gpkg", layer="Interstate") rail.sf <- st_read("rail_inters.gpkg", layer="Rail") Reading a raster In earlier versions of this tutorial, the raster package was used to read raster files. This is being supplanted by terra which will be the package used in this and in subsequent exercises. terra will read many different raster file formats such as geoTiff, Imagine and HDF5 just to name a few. To see a list of supported raster file formats on your computer simply run: terra::gdal(drivers = TRUE) |> subset(type == "raster") In the following example, an Imagine raster file is read into R using the rast function. library(terra) elev.r <- rast("elev.img") The object class is of type SpatRaster. class(elev.r) [1] "SpatRaster" attr(,"package") [1] "terra" What sets a SpatRaster object apart from other R data file objects is its storage. By default, data files are loaded into memory, but SpatRaster objects are not. This can be convenient when working with raster files too large for memory. But this comes at a performance cost. If your RAM is large enough to handle your raster file, it’s best to load the entire dataset into memory. To check if the elev.r object is loaded into memory, run: inMemory(elev.r) [1] FALSE An output of FALSE indicates that it is not. To force the raster into memory use set.values: set.values(elev.r) Let’s check that the raster is indeed loaded into memory: inMemory(elev.r) [1] TRUE Now let’s look at the raster’s properties: elev.r class : SpatRaster dimensions : 994, 652, 1 (nrow, ncol, nlyr) resolution : 500, 500 (x, y) extent : 336630.3, 662630.3, 4759303, 5256303 (xmin, xmax, ymin, ymax) coord. ref. : NAD_1983_UTM_Zone_19N (EPSG:26919) source(s) : memory varname : elev name : Layer_1 min value : 0 max value : 1546 The raster object returns its grid dimensions (number of rows and columns), pixel size/resolution (in the layer’s coordinate system units), geographic extent, native coordinate system (UTM NAD83 Zone 19 with units of meters) and min/max raster values. Creating a spatial object from a data frame Geographic point data locations recorded in a spreadsheet can be converted to a spatial point object. Note that it’s important that you specify the coordinate system used to record the coordinate pairs since such information is not stored in a data frame. In the following example, the coordinate values are recorded in a WGS 1984 geographic coordinate system (crs = 4326). # Create a simple dataframe with lat/long values df <- data.frame(lon = c(-68.783, -69.6458, -69.7653), lat = c(44.8109, 44.5521, 44.3235), Name= c("Bangor", "Waterville", "Augusta")) # Convert the dataframe to a spatial object. Note that the # crs= 4326 parameter assigns a WGS84 coordinate system to the # spatial object p.sf <- st_as_sf(df, coords = c("lon", "lat"), crs = 4326) p.sf Simple feature collection with 3 features and 1 field Geometry type: POINT Dimension: XY Bounding box: xmin: -69.7653 ymin: 44.3235 xmax: -68.783 ymax: 44.8109 Geodetic CRS: WGS 84 Name geometry 1 Bangor POINT (-68.783 44.8109) 2 Waterville POINT (-69.6458 44.5521) 3 Augusta POINT (-69.7653 44.3235) Geocoding street addresses The tidygeocoder package will convert street addresses to latitude/longitude coordinate pairs using a wide range of geocoding services such as the US census and Google. Some of these geocoding services will require an API key, others will not. Click here to see the list of geocoding services supported by tidygeocoder and their geocoding limitations. In the example that follows, the osm geocoding service is used by default. library(tidygeocoder) options(pillar.sigfig = 7) # Increase significant digits in displayed output dat <- data.frame( name = c("Colby College", "Bates College", "Bowdoin College"), address = c("4000 Mayflower drive, Waterville, ME , 04901", "275 College st, Lewiston, ME 04240", "255 Maine St, Brunswick, ME 04011")) geocode(.tbl = dat, address = address, method = "osm") # A tibble: 3 × 4 name address lat long <chr> <chr> <dbl> <dbl> 1 Colby College 4000 Mayflower drive, Waterville, ME , 04901 44.56615 -69.66232 2 Bates College 275 College st, Lewiston, ME 04240 44.10638 -70.20636 3 Bowdoin College 255 Maine St, Brunswick, ME 04011 43.90870 -69.96142 Another free (but manual) alternative, is to use the US Census Bureau’s web geocoding service for creating lat/lon values from a file of US street addresses. This needs to be completed via their web interface and the resulting data table (a CSV file) would then need to be loaded into R as a data frame. Converting from an sf object Packages such as spdep (older versions only) and spatsat do not support sf objects. The following sections demonstrate methods to convert from sf to other formats. Converting an sf object to a Spatial* object (spdep/sp) The following code will convert point, polyline or polygon features to a spatial* object. While the current version of spdep will now accept sf objects, converting to spatial* objects will be necessary with legacy spdep packages. In this example, an sf polygon feature is converted to a SpatialPolygonsDataFrame object. s.sp <- as_Spatial(s.sf) class(s.sp) [1] "SpatialPolygonsDataFrame" attr(,"package") [1] "sp" Converting an sf polygon object to an owin object The spatstat package is used to analyze point patterns however, in most cases, the study extent needs to be explicitly defined by a polygon object. The polygon should be of class owin. library(spatstat) s.owin <- as.owin(s.sf) class(s.owin) [1] "owin" Note the loading of the package spatstat. This is required to access the as.owin.sf method for sf. Note too that the attribute table gets stripped from the polygon data. This is usually fine given that the only reason for converting a polygon to an owin format is for delineating the study boundary. Converting an sf point object to a ppp object The spatstat package is currently designed to work with projected (planar) coordinate system. If you attempt to convert a point object that is in a geographic coordinate system, you will get the following error message: p.ppp <- as.ppp(p.sf) Error: Only projected coordinates may be converted to spatstat class objects The error message reminds us that a geographic coordinate system (i.e. one that uses angular measurements such as latitude/longitude) cannot be used with this package. If you encounter this error, you will need to project the point object to a projected coordinate system. In this example, we’ll project the p.sf object to a UTM coordinate system (epsg=32619). Coordinate systems in R are treated in a later appendix. p.sf.utm <- st_transform(p.sf, 32619) # project from geographic to UTM p.ppp <- as.ppp(p.sf.utm) # Create ppp object class(p.ppp) [1] "ppp" Note that if the point layer has an attribute table, its attributes will be converted to ppp marks. These attribute values can be accessed via marks(p.ppp). Converting a SpatRaster object to an im object To create a spatstat im raster object from a SpatRaster object, you will need to first create a three column dataframe from the SpatRaster objects with the first two columns defining the X and Y coordinate values of each cell, and the third column defining the cell values df <- as.data.frame(elev.r,xy=TRUE) elev.im <- as.im(df) class(elev.im) [1] "im" Converting to an sf object All aforementioned spatial formats, except owin, can be coerced to an sf object via the st_as_sf function. for example: st_as_sf(p.ppp) # For converting a ppp object to an sf object st_as_sf(s.sp) # For converting a Spatial* object to an sf object Dissecting the sf file object head(s.sf,3) Simple feature collection with 3 features and 5 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: 379071.8 ymin: 4936182 xmax: 596500.1 ymax: 5255569 Projected CRS: NAD83 / UTM zone 19N NAME Income NoSchool NoSchoolSE IncomeSE geometry 1 Aroostook 21024 0.01338720 0.00140696 250.909 MULTIPOLYGON (((513821.1 51... 2 Somerset 21025 0.00521153 0.00115002 390.909 MULTIPOLYGON (((379071.8 50... 3 Piscataquis 21292 0.00633830 0.00212896 724.242 MULTIPOLYGON (((445039.5 51... The first line of output gives us the geometry type, MULTIPOLYGON, a multi-polygon data type. This is also referred to as a multipart polygon. A single-part sf polygon object will adopt the POLYGON geometry. The next few lines of output give us the layer’s bounding extent in the layer’s native coordinate system units. You can extract the extent via the st_bbox() function as in st_bbox(s.sf). The following code chunk can be used to extract addition coordinate information from the data. st_crs(s.sf) Depending on the version of the PROJ library used by sf, you can get two different outputs. If your version of sf is built with a version of PROJ older than 6.0, the output will consist of an epsg code (when available) and a proj4 string as follows: Coordinate Reference System: EPSG: 26919 proj4string: "+proj=utm +zone=19 +datum=NAD83 +units=m +no_defs" If your version of sf is built with a version of PROJ 6.0 or greater, the output will consist of a user defined CS definition (e.g. an epsg code), if available, and a Well Known Text (WKT) formatted coordinate definition that consists of a series of [ ] tags as follows: Coordinate Reference System: User input: NAD83 / UTM zone 19N wkt: PROJCRS["NAD83 / UTM zone 19N", BASEGEOGCRS["NAD83", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4269]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["Degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["Degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1]], ID["EPSG",26919]] The WKT format will usually start with a PROJCRS[...] tag for a projected coordinate system, or a GEOGCRS[...] tag for a geographic coordinate system. More information on coordinate systems in R can be found in the coordinate systems appendix. What remains of the sf summary output is the first few records of the attribute table. You can extract the object’s table to a dedicated data frame via: s.df <- data.frame(s.sf) class(s.df) [1] "data.frame" head(s.df, 5) NAME Income NoSchool NoSchoolSE IncomeSE geometry 1 Aroostook 21024 0.01338720 0.001406960 250.909 MULTIPOLYGON (((513821.1 51... 2 Somerset 21025 0.00521153 0.001150020 390.909 MULTIPOLYGON (((379071.8 50... 3 Piscataquis 21292 0.00633830 0.002128960 724.242 MULTIPOLYGON (((445039.5 51... 4 Penobscot 23307 0.00684534 0.001025450 242.424 MULTIPOLYGON (((472271.3 49... 5 Washington 20015 0.00478188 0.000966036 327.273 MULTIPOLYGON (((645446.5 49... The above chunk will also create a geometry column. This column is somewhat unique in that it stores its contents as a list of geometry coordinate pairs (polygon vertex coordinate values in this example). str(s.df) 'data.frame': 16 obs. of 6 variables: $ NAME : chr "Aroostook" "Somerset" "Piscataquis" "Penobscot" ... $ Income : int 21024 21025 21292 23307 20015 21744 21885 23020 25652 24268 ... $ NoSchool : num 0.01339 0.00521 0.00634 0.00685 0.00478 ... $ NoSchoolSE: num 0.001407 0.00115 0.002129 0.001025 0.000966 ... $ IncomeSE : num 251 391 724 242 327 ... $ geometry :sfc_MULTIPOLYGON of length 16; first list element: List of 1 ..$ :List of 1 .. ..$ : num [1:32, 1:2] 513821 513806 445039 422284 424687 ... ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg" You can also opt to remove this column prior to creating the dataframe as follows: s.nogeom.df <- st_set_geometry(s.sf, NULL) class(s.nogeom.df) [1] "data.frame" head(s.nogeom.df, 5) NAME Income NoSchool NoSchoolSE IncomeSE 1 Aroostook 21024 0.01338720 0.001406960 250.909 2 Somerset 21025 0.00521153 0.001150020 390.909 3 Piscataquis 21292 0.00633830 0.002128960 724.242 4 Penobscot 23307 0.00684534 0.001025450 242.424 5 Washington 20015 0.00478188 0.000966036 327.273 Exporting to different data file formats You can export an sf object to many different spatial file formats such as a shapefile or a geopackage. st_write(s.sf, "shapefile_out.shp", driver = "ESRI Shapefile") # create to a shapefile st_write(s.sf, "s.gpkg", driver = "GPKG") # Create a geopackage file If the file you are writing to already exists, the above will throw an error. To force an overwrite, simply add the delete_layer = TRUE argument to the st_write function. You can see a list of writable vector formats via: gdal(drivers = TRUE) |> subset(can %in% c("write", "read/write" ) & type == "vector") The value in the name column is the driver name to pass to the driver = argument in the st_write() function. To export a raster to a data file, use writeRaster() function. writeRaster(elev.r, "elev_out.tif", gdal = "GTiff" ) # Create a geoTiff file writeRaster(elev.r, "elev_out.img", gdal = "HFA" ) # Create an Imagine raster file You can see a list of writable raster formats via: gdal(drivers = TRUE) |> subset(can %in% c("write", "read/write" ) & type == "raster") The value in the name column is the driver name to pass to the gdal = argument in the writeRaster() function. References "],["mapping-data-in-r.html", "B Mapping data in R Sample files for this exercise tmap ggplot2 plot_sf", " B Mapping data in R R sf terra tmap ggplot2 4.3.1 1.0.14 1.7.55 3.3.3 3.4.3 There are many mapping environments that can be adopted in R. Three are presented in this tutorial: tmap, ggplot2 and plot_sf. Sample files for this exercise Data used in the following exercises can be loaded into your current R session by running the following chunk of code. library(sf) library(terra) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/elev.RDS")) elev.r <- unwrap(readRDS(z)) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/inter_sf.RDS")) inter.sf <- readRDS(z) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/rail_sf.RDS")) rail.sf <- readRDS(z) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/s_sf.RDS")) s.sf <- readRDS(z) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/p_sf.RDS")) p.sf <- readRDS(z) The data objects consist of five layers: an elevation raster (elev.r), an interstate polyline layer (inter.sf), a point cities layer (p.sf), a railroad polyline layer (rail.sf) and a Maine counties polygon layer (s.sf). All vector layers are sf objects. All layers are in a UTM/NAD83 projection (Zone 19N) except p.sf which is in a WGS 1984 geographic coordinate system. tmap The tmap package is specifically developed for mapping spatial data. As such, it offers the greatest mapping options. The package recognizes sf, raster and Spatial* objects. The basics To map the counties polygon layer using a grey color scheme, type: library(tmap) tm_shape(s.sf) + tm_polygons(col="grey", border.col="white") The tm_shape function loads the spatial object (vector or raster) into the mapping session. The tm_polygons function is one of many tmap functions that dictates how the spatial object is to be mapped. The col parameter defines either the polygon fill color or the spatial object’s attribute column to be used to define the polygons’ color scheme. For example, to use the Income attribute value to define the color scheme, type: tm_shape(s.sf) + tm_polygons(col="Income", border.col = "white") Note the + symbol used to piece together the functions (this is similar to the ggplot2 syntax). You can customize the map by piecing together various map element functions. For example, to move the legend box outside of the main map body add the tm_legend(outside = TRUE) function to the mapping operation. tm_shape(s.sf) + tm_polygons("Income", border.col = "white") + tm_legend(outside = TRUE) You can also choose to omit the legend box (via the legend.show = FALSE parameter) and the data frame border (via the tm_layout(frame = FALSE) function): tm_shape(s.sf) + tm_polygons("Income", border.col = "white", legend.show=FALSE) + tm_layout(frame = FALSE) If you want to omit the polygon border lines from the plot, simply add the border.col = NULL parameter to the tm_polygons function. tm_shape(s.sf) + tm_polygons("Income", border.col = NULL) + tm_legend(outside = TRUE) Note that the tm_fill function is nearly identical to the tm_polygons function with the difference being that the tm_fill function does not draw polygon borders. Combining layers You can easily stack layers by piecing together additional tm_shapefunctions. In the following example, the railroad layer and the point layer are added to the income map. The railroad layer is mapped using the tm_lines function and the cities point layer is mapped using the tm_dots function. Note that layers are pieced together using the + symbol. tm_shape(s.sf) + tm_polygons("Income", border.col = NULL) + tm_legend(outside = TRUE) + tm_shape(rail.sf) + tm_lines(col="grey70") + tm_shape(p.sf) + tm_dots(size=0.3, col="black") Layers are stacked in the order in which they are listed. In the above example, the point layer is the last layer called therefore it is drawn on top of the previously drawn layers. Note that if a layer’s coordinate system is properly defined, tmap will reproject, on-the-fly, any layer whose coordinate system does not match that of the first layer in the stack. In this example, s.sf defines the map’s coordinate system (UTM/NAD83). p.sf is in a geographic coordinate system and is thus reprojected on-the-fly to properly overlap the other layers in the map. Tweaking classification schemes You can control the classification type, color scheme, and bin numbers via the tm_polygons function. For example, to apply a quantile scheme with 6 bins and varying shades of green, type: tm_shape(s.sf) + tm_polygons("Income", style = "quantile", n = 6, palette = "Greens") + tm_legend(outside = TRUE) Other style classification schemes include fixed, equal, jenks, kmeans and sd. If you want to control the breaks manually set style=fixed and specify the classification breaks using the breaks parameter. For example, tm_shape(s.sf) + tm_polygons("Income", style = "fixed",palette = "Greens", breaks = c(0, 23000, 27000, 100000 )) + tm_legend(outside = TRUE) If you want a bit more control over the legend elements, you can tweak the labels parameter as in, tm_shape(s.sf) + tm_polygons("Income", style = "fixed",palette = "Greens", breaks = c(0, 23000, 27000, 100000 ), labels = c("under $23,000", "$23,000 to $27,000", "above $27,000"), text.size = 1) + tm_legend(outside = TRUE) Tweaking colors There are many color schemes to choose from, but you will probably want to stick to color swatches established by Cynthia Brewer. These palettes are available in tmap and their names are listed below. For sequential color schemes, you can choose from the following palettes. For divergent color schemes, you can choose from the following palettes. For categorical color schemes, you can choose from the following palettes. For example, to map the county names using the Pastel1 categorical color scheme, type: tm_shape(s.sf) + tm_polygons("NAME", palette = "Pastel1") + tm_legend(outside = TRUE) To map the percentage of the population not having attained a high school degree (column labeled NoSchool in s.sf) using a YlOrBr palette with 8 bins while modifying the legend title to read “Fraction without a HS degree”, type: tm_shape(s.sf) + tm_polygons("NoSchool", style="quantile", palette = "YlOrBr", n=8, title="Fraction without \\na HS degree") + tm_legend(outside = TRUE) The character \\n in the “Fraction without \\na HS degree” string is interpreted by R as a new line (carriage return). If you want to reverse the color scheme simply add the minus symbol - in front of the palette name as in palette = \"-YlOrBr\" Adding labels You can add text and labels using the tm_text function. In the following example, point labels are added to the right of the points with the text left justified (just = \"left\") and with an x offset of 0.5 units for added buffer between the point and the text. tm_shape(s.sf) + tm_polygons("NAME", palette = "Pastel1", border.col = "white") + tm_legend(outside = TRUE) + tm_shape(p.sf) + tm_dots(size= .3, col = "red") + tm_text("Name", just = "left", xmod = 0.5, size = 0.8) The tm_text function accepts an auto placement option via the parameter auto.placement = TRUE. This uses a simulated annealing algorithm. Note that this automated approach may not generate the same text placement after each run. Adding a grid or graticule You can add a grid or graticule to the map using the tm_grid function. You will need to modify the map’s default viewport setting via the tm_layout function to provide space for the grid labels. In the following example, the grid is generated using the layer’s UTM coordinate system and is divided into roughly four segments along the x-axis and five segments along the y-axis. The function will adjust the grid placement so as to generate “pretty” label values. tm_shape(s.sf) + tm_polygons("NAME", palette = "Pastel1") + tm_legend(outside = TRUE) + tm_layout(outer.margins = c(.1,.1,.1,.1)) + tm_grid(labels.inside.frame = FALSE, n.x = 4, n.y = 5) To generate a graticule (lines of latitude and longitude), simply modify the grid’s coordinate system to a geographic one using either an EPSG defined coordinate system, or a PROJ4 formatted string. But note that the PROJ string syntax is falling out of favor in current and future R spatial environments so, if possible, adopt an EPSG (or OGC) code. Here, we’ll use EPSG:4326 which defines the WGS 1984 geographic coordinate system. We will also modify the grid placement by explicitly specifying the lat/long grid values. tm_shape(s.sf) + tm_polygons("NAME", palette = "Pastel1") + tm_legend(outside = TRUE) + tm_layout(outer.margins = c(.1,.1,.1,.1)) + tm_grid(labels.inside.frame = FALSE, x = c(-70.5, -69, -67.5), y = c(44, 45, 46, 47), projection = "EPSG:4326") Adding the ° symbol to the lat/long values requires a bit more code: tm_shape(s.sf) + tm_polygons("NAME", palette = "Pastel1") + tm_legend(outside = TRUE) + tm_layout(outer.margins = c(.1,.1,.1,.1)) + tm_grid(labels.inside.frame = FALSE, x = c(-70.5, -69, -67.5) , y = c(44, 45, 46, 47), projection = "+proj=longlat", labels.format = list(fun=function(x) {paste0(x,intToUtf8(176))} ) ) Here, we use the unicode decimal representation of the ° symbol (unicode 176) and pass it to the intToUtf8 function. A list of unicode characters and their decimal representation can be found on this Wikipedia page. Adding statistical plots A histogram of the variables being mapped can be added to the legend element. By default, the histogram will inherit the colors used in the classification scheme. tm_shape(s.sf) + tm_polygons("NoSchool", palette = "YlOrBr", n = 6, legend.hist = TRUE, title = "% no school") + tm_legend(outside = TRUE, hist.width = 2) Mapping raster files Raster objects can be mapped by specifying the tm_raster function. For example to plot the elevation raster and assign 64 continuous shades of the built-in terrain color ramp, type: tm_shape(elev.r) + tm_raster(style = "cont", title = "Elevation (m)", palette = terrain.colors(64))+ tm_legend(outside = TRUE) Note the use of another style parameter option: cont for continuous color scheme. You can choose to symbolize the raster using classification breaks instead of continuous colors. For example, to manually set the breaks to 50, 100, 500, 750, 1000, and 15000 meters, type: tm_shape(elev.r) + tm_raster(style = "fixed", title = "Elevation (m)", breaks = c(0, 50, 100, 500, 750, 1000, 15000), palette = terrain.colors(5))+ tm_legend(outside = TRUE) Other color gradients that R offers include, heat.colors, rainbow, and topo.colors. You can also create your own color ramp via the colorRampPalette function. For example, to generate a 12 bin quantile classification scheme using a color ramp that changes from darkolivegreen4 to yellow to brown (these are built-in R colors), and adding a histogram to view the distribution of colors across pixels, type: tm_shape(elev.r) + tm_raster(style = "quantile", n = 12, title = "Elevation (m)", palette = colorRampPalette( c("darkolivegreen4","yellow", "brown"))(12), legend.hist = TRUE)+ tm_legend(outside = TRUE, hist.width = 2) Note that the Brewer palette names can also be used with rasters. Changing coordinate systems tmap can change the output’s coordinate system without needing to reproject the data layers. In the following example, the elevation raster, railroad layer and point city layer are mapped onto a USA Contiguous Albers Equal Area Conic projection. A lat/long grid is added as a reference. # Define the Albers coordinate system aea <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +ellps=GRS80 +datum=NAD83" # Map the data tm_shape(elev.r, projection = aea) + tm_raster(style = "quantile", n = 12, palette = colorRampPalette( c("darkolivegreen4","yellow", "brown"))(12), legend.show = FALSE) + tm_shape(rail.sf) + tm_lines(col = "grey70")+ tm_shape(p.sf) +tm_dots(size=0.5) + tm_layout(outer.margins = c(.1,.1,.1,.1)) + tm_grid(labels.inside.frame = FALSE, x = c(-70.5, -69, -67.5), y = c(44, 45, 46, 47), projection = "+proj=longlat") The first data layer’s projection= parameter will define the map’s coordinate system. Note that this parameter does not need to be specified in the other layers taking part in the output map. If a projection is not explicitly defined in the first call to tm_shape, then the output map will default to the first layer’s reference system. Side-by-side maps You can piece maps together side-by-side using the tmap_arrange function. You first need to save each map to a separate object before combining them. For example: inc.map <- tm_shape(s.sf) + tm_polygons(col="Income")+ tm_legend(outside=TRUE) school.map <- tm_shape(s.sf) + tm_polygons(col="NoSchool")+ tm_legend(outside=TRUE) name.map <- tm_shape(s.sf) + tm_polygons(col="NAME")+ tm_legend(outside=TRUE) tmap_arrange(inc.map, school.map, name.map) Splitting data by polygons or group of polygons You can split the output into groups of features based on a column attribute. For example, to split the income map into individual polygons via the NAME attribute, type: tm_shape(s.sf) + tm_polygons(col = "Income") + tm_legend(outside = TRUE) + tm_facets( by = "NAME", nrow = 2) The order of the faceted plot follows the alphanumeric order of the faceting attribute values. If you want to change the faceted order, you will need to change the attribute’s level order. ggplot2 If you are already familiar with ggplot2, you will find it easy to transition to spatial data visualization. The key geom used when mapping spatial data is geom_sf(). The basics If you wish to simply plot the geometric elements of a layer, type: library(ggplot2) ggplot(data = s.sf) + geom_sf() As with any ggplot operation, you can also pass the object’s name to the geom_sf() instead of the ggplot function as in: ggplot() + geom_sf(data = s.sf) This will prove practical later in this exercise when multiple layers are plotted on the map. By default, ggplot will add a graticule to the plot, even if the coordinate system associated with the layer is in a projected coordinate system. You can adopt any one of ggplot2’s gridline removal strategies to eliminate the grid from the plot. Here, we’ll make use of the theme_void() function. ggplot(data = s.sf) + geom_sf() + theme_void() If you want to have ggplot adopt the layer’s native coordinate system (UTM NAD 1983 in this example) instead of the default geographic coordinate system, type: ggplot(data = s.sf) + geom_sf() + coord_sf(datum = NULL) Or, you can explicitly assign the data layer’s datum via a call to st_crs as in ... + coord_sf(datum = st_crs(s.sf)) By setting datum to NULL, you prevent ggplot from figuring out how to convert the layer’s native coordinate system to a geographic one. You can control grid/graticule intervals using ggplot’s scale_..._continuous functions. For example: ggplot(data = s.sf) + geom_sf() + scale_x_continuous(breaks = c(-70, -69, -68)) + scale_y_continuous(breaks = 44:47) If you wish to apply a grid native to the layer’s coordinate system, type: ggplot(data = s.sf) + geom_sf() + coord_sf(datum = NULL) + scale_x_continuous(breaks = c(400000, 500000, 600000)) + scale_y_continuous(breaks = c(4900000, 5100000)) To symbolize a layer’s geometries using one of the layer’s attributes, add the aes() function. ggplot(data = s.sf, aes(fill = Income)) + geom_sf() Note that the data and aesthetics can be defined in the geom_sf function as well: ggplot() + geom_sf(data = s.sf, aes(fill = Income)) To change the border color, type: ggplot(data = s.sf, aes(fill = Income)) + geom_sf(col = "white") To remove outlines, simply pass NA to col (e.g. col = NA) in the geom_sf function. Tweaking classification schemes To bin the color scheme by assigning ranges of income values to a unique set of color swatches defined by hex values, use one of the scale_fill_steps* family of functions. ggplot(data = s.sf, aes(fill = Income)) + geom_sf() + scale_fill_stepsn(colors = c("#D73027", "#FC8D59", "#FEE08B", "#D9EF8B", "#91CF60") , breaks = c(22000, 25000, 27000, 30000)) You can adopt Brewer’s color schemes by applying one of the scale_..._fermenter() functions and specifying the classification type (sequential, seq; divergent, div; or categorical, qual) and the palette name. For example, to adopt a divergent color scheme using the \"PRGn\" colors, type: ggplot(data = s.sf, aes(fill = Income)) + geom_sf() + scale_fill_fermenter(type = "div", palette = "PRGn", n.breaks = 4) The flip the color scheme set direction to 1. ggplot(data = s.sf, aes(fill = Income)) + geom_sf() + scale_fill_fermenter(type = "div", palette = "PRGn", n.breaks = 4, direction = 1) ggplot offers many advanced options. For example, we can modify the bin intervals by generating a non-uniform classification scheme and scale the legend bar so as to reflect the non-uniform intervals using the guide_coloursteps() function and its even.steps = FALSE argument. We’ll also modify the legend bar dimensions and title in this code chunk. ggplot(data = s.sf, aes(fill = Income)) + geom_sf() + scale_fill_stepsn(colors = c("#D73027", "#FC8D59", "#FEE08B", "#D9EF8B", "#91CF60", "#1A9850") , breaks = c(22000, 25000, 26000, 27000, 30000), values = scales::rescale(c(22000, 25000, 26000, 27000, 30000), c(0,1)), guide = guide_coloursteps(even.steps = FALSE, show.limits = TRUE, title = "Per capita Income \\n(US dollars)", barheight = unit(2.2, "in"), barwidth = unit(0.15, "in"))) Combining layers You can overlap layers in the map by adding calls to geom_sf. In such a scenario, it might be best for readability sake to specify the layer name in the geom_sf function instead of the ggplot function. ggplot() + geom_sf(data = s.sf, aes(fill = Income)) + geom_sf(data = rail.sf, col = "white") + geom_sf(data = p.sf, col = "green") Note that ggplot will convert coordinate systems on-the-fly as needed. Here, p.sf is in a coordinate system different from the other layers. You can also add raster layers to the map. However, the raster layer must be in a dataframe format with x, y and z columns. The elev.r raster is in a SpatRaster format and will need to be converted to a dataframe using the as.data.frame function from the raster package. This function has a special method for raster layers, as such, it adds parameters unique to this method. These include xy = TRUE which instructs the function to create x and y coordinate columns from the data, and na.rm = TRUE which removes blank cells (this will help reduce the size of our dataframe given that elev.r does not fill its extent’s rectangular outline). Since the layers are drawn in the order listed, we will move the rail.sf vector layer to the bottom of the stack. ggplot() + geom_raster(data = as.data.frame(elev.r, xy=TRUE, na.rm = TRUE), aes(x = x, y = y, fill = elev)) + scale_fill_gradientn(colours = terrain.colors(7)) + geom_sf(data = rail.sf, col = "white") + geom_sf(data = p.sf, col = "black") + theme(axis.title = element_blank()) # Removes axes labels plot_sf The sf package has its own plot method. This is a convenient way to generate simple plots without needing additional plotting packages. The basics By default, when passing an sf object to `plot, the function will generate as may plots as there are attribute columns. For example plot(s.sf) To limit the plot to just one of the attribute columns, limit the dataset using basic R indexing techniques. For example, to plot the Income column, type plot(s.sf["Income"]) To limit the output to just the layer’s geometry, wrap the object name with the st_geometry function. plot(st_geometry(s.sf)) You can control the fill and border colors using the col and border parameters respectively. plot(st_geometry(s.sf), col ="grey", border = "white") Adding a graticule You can add a graticule by setting the graticule parameter to TRUE. To add graticule labels, set axes to TRUE. plot(st_geometry(s.sf), col ="grey", border = "white", graticule = TRUE, axes= TRUE) Combining layers To add layers, generate a new call to plot with the add parameter set to TRUE. For example, to add rail.sf and p.sf to the map, type: plot(st_geometry(s.sf), col ="grey", border = "white", graticule = TRUE, axes= TRUE) plot(rail.sf, col = "grey20", add = TRUE) Note that plot_sf requires that the layers be in the same coordinate system. For example, adding p.sf will not show the points on the map given that it’s in a different coordinate system. sf layers can be combined with raster layers. The order in which layers are listed will matter. You will usually want to map the raster layer first, then add the vector layer(s). plot(elev.r, col = terrain.colors(30)) plot(st_geometry(rail.sf), col ="grey", border = "white", add = TRUE) Tweaking colors You can tweak the color schemes as well as the legend display. The latter will require the use of R’s built-in par function whereby the las = 1 parameter will render the key labels horizontal, and the omi parameter will prevent the legend labels from being cropped. OP <- par(las = 1, omi=c(0,0,0,0.6)) p1 <- plot(s.sf["Income"], breaks = c(20000, 22000, 25000, 26000, 27000, 30000, 33000), pal = c("#D73027", "#FC8D59", "#FEE08B", "#D9EF8B", "#91CF60", "#1A9850"), key.width = 0.2, at = c(20000, 22000, 25000, 26000, 27000, 30000, 33000)) par(OP) While plot_sf offers succinct plotting commands and independence from other mapping packages, it is limited in its customization options. "],["anatomy-of-simple-feature-objects.html", "C Anatomy of simple feature objects Creating point ‘sf’ objects Creating polyline ‘sf’ objects Creating polygon ‘sf’ objects Extracting geometry from an sf object Alternative syntax Additional resources", " C Anatomy of simple feature objects R sf ggplot2 4.3.1 1.0.14 3.4.3 This tutorial exposes you to the building blocks of simple feature objects via the the creation of point, polyline and polygon features from scratch. Creating point ‘sf’ objects We will start off by exploring the creation of a singlepart point feature object. There are three phases in creating a point simple feature (sf) object: Defining the coordinate pairs via a point geometry object, sfg; Creating a simple feature column object, sfc, from the point geometries; Creating the simple feature object, sf. Step 1: Create the point geometry: sfg Here, we’ll create three separate point objects. We’ll adopt a geographic coordinate system, but note that we do not specify the coordinate system just yet. library(sf) p1.sfg <- st_point(c(-70, 45)) p2.sfg <- st_point(c(-69, 44)) p3.sfg <- st_point(c(-69, 45)) Let’s check the class of one of these point geometries. class(p1.sfg) [1] "XY" "POINT" "sfg" What we are looking for is a sfg class. You’ll note other classes associated with this object such as POINT which defines the geometric primitive. You’ll see examples of other geometric primitives later in this tutorial. Note that if a multipart point feature object is desired, the st_multipoint() function needs to be used instead of st_point() with the coordinate pairs defined in matrix as in st_multipoint(matrix( c(-70, 45, -69, 44, -69, 45), ncol = 2, byrow = TRUE ) ). Step 2: Create a column of simple feature geometries: sfc Next, we’ll combine the point geometries into a single object. Note that if you are to define a coordinate system for the features, you can do so here via the crs= parameter. We use the WGS 1984 reference system (EPSG code of 4326). p.sfc <- st_sfc( list(p1.sfg, p2.sfg, p3.sfg), crs = 4326 ) class(p.sfc) [1] "sfc_POINT" "sfc" The object is a simple feature column, sfc. More specifically, we’ve combined the point geometries into a single object whereby each geometry is assigned its own row or, to be technical, each point was assigned its own component via the list function. You can can confirm that each point geometry is assigned its own row in the following output. p.sfc Geometry set for 3 features Geometry type: POINT Dimension: XY Bounding box: xmin: -70 ymin: 44 xmax: -69 ymax: 45 Geodetic CRS: WGS 84 You can access each point using list operations. For example, to access the second point, type: p.sfc[[2]] Step 3: Create the simple feature object sf The final step is to create the simple feature object. p.sf <- st_sf(p.sfc) p.sf Simple feature collection with 3 features and 0 fields Geometry type: POINT Dimension: XY Bounding box: xmin: -70 ymin: 44 xmax: -69 ymax: 45 Geodetic CRS: WGS 84 p.sfc 1 POINT (-70 45) 2 POINT (-69 44) 3 POINT (-69 45) Renaming the geometry column The above step generated a geometry column named after the input sfc object name (p.sfc in our example). This is perfectly functional since the sf object knows that this is the geometry column. We can confirm this by checking out p.sf’s attributes. attributes(p.sf) $names [1] "p.sfc" $row.names [1] 1 2 3 $class [1] "sf" "data.frame" $sf_column [1] "p.sfc" $agr factor() Levels: constant aggregate identity What we are looking for is the $sf_column attribute which is , in our example, pointing to the p.sfc column. This attribute is critical in a spatial operation that makes use of the dataframe’s spatial objects. Functions that recognize sf objects will look for this attribute to identify the geometry column. You might chose to rename the column to something more meaningful such as coords (note that some spatially enabled databases adopt the name geom). You can use the names() function to rename that column, but note that you will need to re-define the geometry column in the attributes using the st_geometry() function. names(p.sf) <- "coords" st_geometry(p.sf) <- "coords" p.sf Simple feature collection with 3 features and 0 fields Geometry type: POINT Dimension: XY Bounding box: xmin: -70 ymin: 44 xmax: -69 ymax: 45 Geodetic CRS: WGS 84 coords 1 POINT (-70 45) 2 POINT (-69 44) 3 POINT (-69 45) Adding attributes to an sf object The p.sf object is nothing more than a dataframe with a geometry column of list data type. typeof(p.sf$coords) [1] "list" Storing spatial features in a dataframe has many benefits, one of which is operating on the features’ attribute values. For example, we can add a new column with attribute values for each geometry entry. Here, we’ll assign letters to each point. Note that the order in which the attribute values are passed to the dataframe must match that of the geometry elements. p.sf$val1 <- c("A", "B", "C") p.sf Simple feature collection with 3 features and 1 field Geometry type: POINT Dimension: XY Bounding box: xmin: -70 ymin: 44 xmax: -69 ymax: 45 Geodetic CRS: WGS 84 coords val1 1 POINT (-70 45) A 2 POINT (-69 44) B 3 POINT (-69 45) C We can use sf’s plot function to view the points. plot(p.sf, pch = 16, axes = TRUE, main = NULL) Adding a geometry column to an existing non-spatial dataframe A nifty property of the sfc object created in step 2 above is the ability to append it to an existing dataframe using the st_geometry() function. In the following example, we’ll create a dataframe, then append the geometry column to that dataframe. df <- data.frame(col1 = c("A", "B","C")) st_geometry(df) <- p.sfc Note that once we’ve added the geometry column, df becomes a spatial feature object and the geometry column is assigned the name geometry. df Simple feature collection with 3 features and 1 field Geometry type: POINT Dimension: XY Bounding box: xmin: -70 ymin: 44 xmax: -69 ymax: 45 Geodetic CRS: WGS 84 col1 geometry 1 A POINT (-70 45) 2 B POINT (-69 44) 3 C POINT (-69 45) Creating polyline ‘sf’ objects The steps are similar to creating a point object. You first create the geometry(ies), you then combine the geometry(ies) into a spatial feature column before creating the simple feature object. First, we need to define the vertices that will define each line segment of the polyline. The order in which the vertices are defined matters: The order defines each connecting line segment ends. The coordinate pairs of each vertex are stored in a matrix. l <- rbind( c(-70, 45), c(-69, 44), c(-69, 45) ) Next, we create a polyline geometry object. l.sfg <- st_linestring(l) Next, we create the simple feature column. We also add the reference system definition (crs = 4326). l.sfc <- st_sfc(list(l.sfg), crs = 4326) Finally, we create the simple feature object. l.sf <- st_sf(l.sfc) l.sf Simple feature collection with 1 feature and 0 fields Geometry type: LINESTRING Dimension: XY Bounding box: xmin: -70 ymin: 44 xmax: -69 ymax: 45 Geodetic CRS: WGS 84 l.sfc 1 LINESTRING (-70 45, -69 44,... Even though we have multiple line segments, they are all associated with a single polyline feature, hence they each share the same attribute. plot(l.sf, type = "b", pch = 16, main = NULL, axes = TRUE) Creating branching polyline features You can also create polyline features with branching segments (i.e. where at least one vertex is associated with more than two line segments). You simply need to make sure that the coordinate values for the overlapping vertices share the exact same values. # Define coordinate pairs l1 <- rbind( c(-70, 45), c(-69, 44), c(-69, 45) ) l2 <- rbind( c(-69, 44), c(-70, 44) ) l3 <- rbind( c(-69, 44), c(-68, 43) ) # Create simple feature geometry object l.sfg <- st_multilinestring(list(l1, l2, l3)) # Create simple feature column object l.sfc <- st_sfc(list(l.sfg), crs = 4326) # Create simple feature object l.sf <- st_sf(l.sfc) # Plot the data plot(l.sf, type = "b", pch = 16, axes = TRUE) Creating polygon ‘sf’ objects General steps in creating a polygon sf spatial object from scratch include: Defining the vertices of each polygon in a matrix; Creating a list object from each matrix object (the list structure will differ between POLYGON and MULTIPOLYGON geometries); Creating an sfg polygon geometry object from the list; Creating an sf spatial object. Defining a polygon’s geometry is a bit more involved than a polyline in that a polygon defines an enclosed area. By convention, simple features record vertices coordinate pairs in a counterclockwise direction such that the area to the left of a polygon’s perimeter when traveling in the direction of the recorded vertices is the polygon’s “inside”. This is counter to the order in which vertices are recorded in a shapefile whereby the area to the right of the traveled path along the polygon’s perimeter is deemed “inside”. A polygon hole has its ring defined in the opposite direction: clockwise for a simple feature object and counterclockwise for a shapefile. For many applications in R, the ring direction will not matter, but for a few they might. So when possible, adopt the simple feature paradigm when defining the coordinate pairs. Note that importing a shapefile into an R session will usually automatically reverse the polygons’ ring direction. There are two types of polygon geometries that can be adopted depending on your needs: POLYGON and MULTIPOLYGON. POLYGON simple feature A plain polygon We’ll first create a simple polygon shaped like a triangle. The sf output structure will be similar to that for the POINT and POLYLINE objects with the coordinate pairs defining the polygon vertices stored in a geometry column. The polygon coordinate values are defined in a matrix. The last coordinate pair must match the first coordinate pair. The coordinate values will be recorded in a geographic coordinate system (latitude, longitude) but the reference system won’t be defined until the creation of the sfc object. poly1.crd <- rbind( c(-66, 43), c(-70, 47), c(-70,43), c(-66, 43) ) Next, we create the POLYGON geometries. The polygon matrix needs to be wrapped in a list object. poly1.geom <- st_polygon( list(poly1.crd ) ) We now have a polygon geometry. poly1.geom Next, we create a simple feature column from the polygon geometry. We’ll also define the coordinate system used to report the coordinate values. poly.sfc <- st_sfc( list(poly1.geom), crs = 4326 ) poly.sfc Geometry set for 1 feature Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 Finally, to create the sf object, run the st_sf() function. poly.sf <- st_sf(poly.sfc) poly.sf Simple feature collection with 1 feature and 0 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 poly.sfc 1 POLYGON ((-66 43, -70 47, -... The coordinates column is assigned the name poly.sfc by default. If you wish to change the column name to coords, for example, type the following: names(poly.sf) <- "coords" st_geometry(poly.sf) <- "coords" poly.sf Simple feature collection with 1 feature and 0 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 coords 1 POLYGON ((-66 43, -70 47, -... plot(poly.sf, col = "bisque", axes = TRUE) A polygon with a hole In this example, we’ll add a hole to the polygon. Recall that its outer ring will need to be recorded in a counterclockwise direction and its hole in a clockwise direction. The resulting data object will have the following structure. # Polygon 1 poly1.outer.crd <- rbind( c(-66, 43),c(-70, 47), c(-70,43), c(-66, 43) ) # Outer ring poly1.inner.crd <- rbind( c(-68, 44), c(-69,44), c(-69, 45), c(-68, 44) ) # Inner ring Next, we combine the ring coordinates into a single geometric element. Note that this is done by combining the two coordinate matrices into a single list object. poly1.geom <- st_polygon( list(poly1.outer.crd, poly1.inner.crd)) We now create the simple feature column object. poly.sfc <- st_sfc( list(poly1.geom), crs = 4326 ) Finally, to create the sf object, run the st_sf() function. poly.sf <- st_sf(poly.sfc) We’ll take the opportunity to rename the coordinate column (even though this is not necessary). names(poly.sf) <- "coords" st_geometry(poly.sf) <- "coords" poly.sf Simple feature collection with 1 feature and 0 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 coords 1 POLYGON ((-66 43, -70 47, -... Let’s now plot the sf object. plot(poly.sf, col = "bisque", axes = TRUE) Combining polygons: singlepart features In this example, we’ll create two distinct polygons by adding a second polygon to the one created in the last step. The output will be a singlepart polygon feature (i.e. each polygon can be assigned its own unique attribute value). We’ll create the second polygon (the first polygon having already been created in the previous section). # Define coordinate matrix poly2.crd <- rbind( c(-67, 45),c(-67, 47), c(-69,47), c(-67, 45) ) # Create polygon geometry poly2.geom <- st_polygon( list(poly2.crd)) Next, we combine the geometries into a simple feature column, sfc. poly.sfc <- st_sfc( list(poly1.geom , poly2.geom), crs = 4326 ) Each polygon has its own row in the sfc object. poly.sfc Geometry set for 2 features Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 Finally, to create the sf object, run the st_sf() function. poly.sf <- st_sf(poly.sfc) poly.sf Simple feature collection with 2 features and 0 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 poly.sfc 1 POLYGON ((-66 43, -70 47, -... 2 POLYGON ((-67 45, -67 47, -... We’ll go ahead and rename the geometry column to coords. names(poly.sf) <- "coords" st_geometry(poly.sf) <- "coords" poly.sf Simple feature collection with 2 features and 0 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 coords 1 POLYGON ((-66 43, -70 47, -... 2 POLYGON ((-67 45, -67 47, -... plot(poly.sf, col = "bisque", axes = TRUE) Adding attributes As with the point sf object created earlier in this exercise, we can append columns to the polygon sf object. But make sure that the order of the attribute values match the order in which the polygons are stored in the sf object. poly.sf$id <- c("A", "B") poly.sf Simple feature collection with 2 features and 1 field Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 coords id 1 POLYGON ((-66 43, -70 47, -... A 2 POLYGON ((-67 45, -67 47, -... B plot(poly.sf["id"], axes = TRUE, main = NULL) MULTIPOLYGON simple feature: multipart features If multiple polygons are to share the same attribute record (a scenario referred to as multipart geometry in some GIS applications), you need to use the st_multipolygon() function when creating the sfg object. In this example, we’ll combine the two polygon created in the last example into a single geometry element. The multipolygon function groups polygons into a single list. If one of the polygons is made up of more than one ring (e.g. a polygon with a whole), its geometry is combined into a single sub-list object. # Create multipolygon geometry mpoly1.sfg <- st_multipolygon( list( list( poly1.outer.crd, # Outer loop poly1.inner.crd), # Inner loop list( poly2.crd)) ) # Separate polygon # Create simple feature column object mpoly.sfc <- st_sfc( list(mpoly1.sfg), crs = 4326) # Create simple feature object mpoly.sf <- st_sf(mpoly.sfc) mpoly.sf Simple feature collection with 1 feature and 0 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 Geodetic CRS: WGS 84 mpoly.sfc 1 MULTIPOLYGON (((-66 43, -70... Note the single geometric entry in the table. Mixing singlepart and multipart elements A MULTIPOLGON geometry can be used to store a single polygon as well. In this example, we’ll create a MULTIPOLYGON sf object that will combine multipart and singlepart polygons. To make this example more interesting, we’ll have one of the elements (poly4.coords) overlapping several polygons. Note that any overlapping polygon needs to be in its own MULTIPOLYGON or POLYGON entry–if it’s added to an existing entry (i.e. combined with another polygon geometry), it may be treated as a hole, even if the coordinate values are recorded in a counterclockwise direction. poly3.coords <- rbind( c(-66, 44), c(-64, 44), c(-66,47), c(-66, 44) ) poly4.coords <- rbind( c(-67, 43), c(-64, 46), c(-66.5,46), c(-67, 43) ) Note the embedded list() functions in the following code chunk. mpoly1.sfg <- st_multipolygon( list( list( poly1.outer.crd, # Outer loop poly1.inner.crd), # Inner loop list( poly2.crd)) ) # Separate poly mpoly2.sfg <- st_multipolygon( list( list(poly3.coords))) # Unique polygon mpoly3.sfg <- st_multipolygon( list( list(poly4.coords)) ) # Unique polygon Finally, we’ll generate the simple feature object, sf, via the creation of the simple feature column object, sfc. We’ll also assign the WGS 1984 geographic coordinate system (epsg = 4326). mpoly.sfc <- st_sfc( list(mpoly1.sfg, mpoly2.sfg, mpoly3.sfg), crs = 4326) mpoly.sf <- st_sf(mpoly.sfc) Next, we’ll add attribute values to each geometric object before generating a plot. We’ll apply a transparency to the polygons to reveal the overlapping geometries. mpoly.sf$ids <- c("A", "B", "C") plot(mpoly.sf["ids"], axes = TRUE, main = NULL, pal = sf.colors(alpha = 0.5, categorical = TRUE)) Note how polygon C overlaps the other polygon elements. We can check that this does not violate simple feature rules via the st_is_valid() function. st_is_valid(mpoly.sf) [1] TRUE TRUE TRUE This returns three boolean values, one for each element. A value of TRUE indicates that the geometry does not violate any rule. Avoid storing overlapping polygons in a same MULTIPOLYGON geometry. Doing so will create an “invalid” sf object which may pose problems with certain functions. Extracting geometry from an sf object You can extract the geometry from an sf object via the st_geometry function. For example, # Create sfc from sf st_geometry(mpoly.sf) Geometry set for 3 features Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -64 ymax: 47 Geodetic CRS: WGS 84 To extract coordinates from a single record in a WKT (well known text) format, type: st_geometry(mpoly.sf)[[1]] If you want the extract the coordinate pairs of the first element in a list format type: st_geometry(mpoly.sf)[[1]][] [[1]] [[1]][[1]] [,1] [,2] [1,] -66 43 [2,] -70 47 [3,] -70 43 [4,] -66 43 [[1]][[2]] [,1] [,2] [1,] -68 44 [2,] -69 44 [3,] -69 45 [4,] -68 44 [[2]] [[2]][[1]] [,1] [,2] [1,] -67 45 [2,] -67 47 [3,] -69 47 [4,] -67 45 Alternative syntax In this tutorial, you were instructed to define the coordinate pairs in matrices. This is probably the simplest way to enter coordinate values manually. You can, however, bypass the creation of a matrix and simply define the coordinate pairs using the WKT syntax. For example, to generate the POLYGON geometry object from above, you could simply type: st_as_sfc( "POLYGON ((-66 43, -70 47, -70 43, -66 43), (-68 44, -69 44, -69 45, -68 44))" ) Geometry set for 1 feature Geometry type: POLYGON Dimension: XY Bounding box: xmin: -70 ymin: 43 xmax: -66 ymax: 47 CRS: NA Note that the WKT syntax is that listed in the sfc and sf geometry columns. Also note that the function st_as_sfc is used as opposed to the st_sfc function used with matrices in earlier steps. Additional resources Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data”, The R Journal, pages 439-446. Pebesma, Edzar and Bivand, Roger. “Spatial Data Science: with applications in R”, https://keen-swartz-3146c4.netlify.app/ "],["vector-operations-in-r.html", "D Vector operations in R Dissolving geometries Subsetting by attribute Intersecting layers Clipping spatial objects using other spatial objects Unioning layers Buffering geometries", " D Vector operations in R R sf ggplot2 4.3.1 1.0.14 3.4.3 .scroll1 { max-height: 100px; overflow-y: auto; background-color: inherit; } Earlier versions of this tutorial made use of a combination of packages including raster and rgeos to perform most vector operations highlighted in this exercise. Many of these vector operations can now be performed using the sf package. As such, all code chunks in this tutorial make use sf for most vector operations. We’ll first load spatial objects used in this exercise. These include: A polygon layer that delineates Maine counties (USA), s1.sf; A polygon layer that delineates distances to Augusta (Maine) as concentric circles, s2.sf; A polyline layer of the interstate highway system that runs through Maine. These data are stored as sf objects. library(sf) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/Income_schooling_sf.rds")) s1.sf <- readRDS(z) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/Dist_sf.rds")) s2.sf <- readRDS(z) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/Highway_sf.rds")) l1.sf <- readRDS(z) A map of the above layers is shown below. We’ll use the ggplot2 package too generate this and subsequent maps in this tutorial. library(ggplot2) ggplot() + geom_sf(data = s1.sf) + geom_sf(data = s2.sf, alpha = 0.5, col = "red") + geom_sf(data = l1.sf, col = "blue") The attributes table for both polygon objects (s1.sf and s2.sf) are shown next. Note that each shape object has a unique set of attributes as well as a unique number of records Figure 2.6: Attribute tables for the Maine spatial object, s1.sf, (left table) and the distance to Augusta spatial object, s2.sf (right table). Dissolving geometries Dissolving by contiguous shape There are two different ways to dissolve geometries that share a common boundary. Both are presented next. Option 1 To dissolve all polygons that share at least one line segment, simply pass the object name to sf’s st_union function while making sure that the by_feature option is set to FALSE. In this example, we dissolve all polygons to create a single outline of the state of Maine. ME <- st_union(s1.sf, by_feature = FALSE) ggplot(ME) + geom_sf(fill = "grey") Note that the dissolving process removed all attributes from the original spatial object. You’ll also note that st_union returns an sfc object even though the input object is sf. You can convert the output to an sf object using the st_sf() function as in st_sf(ME). Option 2 Another approach is to make use of the dplyr package and its group_by/summarise functions. library(dplyr) ME <- s1.sf %>% group_by() %>% summarise() ggplot(ME) + geom_sf(fill = "grey") Note that this option will also remove any attributes associated with the input spatial object, however, the output remains an sf object (this differs from the st_union output). Dissolving by attribute You can also choose to dissolve based on an attribute’s values. First, we’ll create a new column whose value will be binary (TRUE/FALSE) depending on whether or not the county income is below the counties’ median income value. s1.sf$med <- s1.sf$Income > median(s1.sf$Income) ggplot(s1.sf) + geom_sf(aes(fill = med)) Next, we’ll dissolve all polygons by the med attribute. Any polygons sharing at least one line segment that have the same med value will be dissolved into a single polygon. Two approaches are presented here: one using sf’s aggregate function, the other using the dplyr approach adopted in the previous section. Option 1 ME.inc <- aggregate(s1.sf["med"], by = list(diss = s1.sf$med), FUN = function(x)x[1], do_union = TRUE) This option will create a new field defined in the by = parameter (diss in this working example). st_drop_geometry(ME.inc) # Print the layer's attributes table diss med 1 FALSE FALSE 2 TRUE TRUE Option 2 ME.inc <- s1.sf %>% group_by(med) %>% summarise() This option will limit the attributes to that/those listed in the group_by function. st_drop_geometry(ME.inc) # A tibble: 2 × 1 med * <lgl> 1 FALSE 2 TRUE A map of the resulting layer follows. ggplot(ME.inc) + geom_sf(aes(fill = med)) The dissolving (aggregating) operation will, by default, eliminate all other attribute values. If you wish to summarize other attribute values along with the attribute used for dissolving, use the dplyr piping operation option. For example, to compute the median Income value for each of the below/above median income groups type the following: ME.inc <- s1.sf %>% group_by(med) %>% summarize(medinc = median(Income)) ggplot(ME.inc) + geom_sf(aes(fill = medinc)) To view the attributes table with both the aggregate variable, med, and the median income variable, Income, type: st_drop_geometry(ME.inc) # A tibble: 2 × 2 med medinc * <lgl> <dbl> 1 FALSE 21518 2 TRUE 27955 Subsetting by attribute You can use conventional R dataframe manipulation operations to subset by attribute values. For example, to subset by county name (e.g. Kennebec county), type: ME.ken <- s1.sf[s1.sf$NAME == "Kennebec",] You can, of course, use piping operations to perform the same task as follows: ME.ken <- s1.sf %>% filter(NAME == "Kennebec") ggplot(ME.ken) + geom_sf() To subset by a range of attribute values (e.g. subset by income values that are less than the median value), type: ME.inc2 <- s1.sf %>% filter(Income < median(Income)) ggplot(ME.inc2) + geom_sf() Intersecting layers To intersect two polygon objects, use sf’s st_intersection function. clp1 <- st_intersection(s1.sf, s2.sf) ggplot(clp1) + geom_sf() st_intersection keeps all features that overlap along with their combined attributes. Note that new polygons are created which will increase the size of the attributes table beyond the size of the combined input attributes table. st_drop_geometry(clp1) NAME Income NoSchool NoSchoolSE IncomeSE med distance 8 Kennebec 25652 0.00570358 0.000917087 360.000 TRUE 20 12 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 20 14 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 20 1 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 50 5 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 50 6 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 50 7 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 50 8.1 Kennebec 25652 0.00570358 0.000917087 360.000 TRUE 50 9 Androscoggin 24268 0.00830953 0.001178660 460.606 TRUE 50 11 Knox 27141 0.00652269 0.001863920 684.849 TRUE 50 12.1 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 50 13 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 50 14.1 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 50 1.1 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 80 2 Piscataquis 21292 0.00633830 0.002128960 724.242 FALSE 80 3 Penobscot 23307 0.00684534 0.001025450 242.424 FALSE 80 5.1 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 80 6.1 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 80 7.1 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 80 9.1 Androscoggin 24268 0.00830953 0.001178660 460.606 TRUE 80 10 Hancock 28071 0.00238996 0.000784584 585.455 TRUE 80 11.1 Knox 27141 0.00652269 0.001863920 684.849 TRUE 80 12.2 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 80 13.1 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 80 14.2 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 80 1.2 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 120 2.1 Piscataquis 21292 0.00633830 0.002128960 724.242 FALSE 120 3.1 Penobscot 23307 0.00684534 0.001025450 242.424 FALSE 120 5.2 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 120 6.2 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 120 7.2 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 120 10.1 Hancock 28071 0.00238996 0.000784584 585.455 TRUE 120 13.2 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 120 15 York 28496 0.00529228 0.000737195 332.121 TRUE 120 Clipping spatial objects using other spatial objects The st_intersection can also be used to clip an input layer using another layer’s outer geometry boundaries as the “cookie cutter”. But note that the latter must be limited to its outer boundaries which may require that it be run through a dissolving operation (shown earlier in this tutorial) to dissolve internal boundaries. To clip s2.sf using the outline of s1.sf, type: clp2 <- st_intersection(s2.sf, st_union(s1.sf)) ggplot(clp2) + geom_sf() The order the layers are passed to the st_intersection function matters. Flipping the input layer in the last example will clip s1.sf to s2.sf’s bounding polygon(s). clp2 <- st_intersection(s1.sf, st_union(s2.sf)) ggplot(clp2) + geom_sf() Line geometries can also be clipped to polygon features. The output will be a line object that falls within the polygons of the input polygon object. For example, to output all line segments that fall within the concentric distance circles of s2.sf, type: clp3 <- st_intersection(l1.sf, st_union(s2.sf)) A plot of the clipped line features is shown with the outline of the clipping feature. ggplot(clp3) + geom_sf(data = clp3) + geom_sf(data = st_union(s2.sf), col = "red", fill = NA ) Unioning layers To union two polygon objects, use sf’s st_union function. For example, un1 <- st_union(s2.sf,s1.sf) ggplot(un1) + geom_sf(aes(fill = NAME), alpha = 0.4) This produces the following attributes table. distance NAME Income NoSchool NoSchoolSE IncomeSE med 1 20 Aroostook 21024 0.01338720 0.001406960 250.909 FALSE 2 50 Aroostook 21024 0.01338720 0.001406960 250.909 FALSE 3 80 Aroostook 21024 0.01338720 0.001406960 250.909 FALSE 4 120 Aroostook 21024 0.01338720 0.001406960 250.909 FALSE 1.1 20 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 2.1 50 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 3.1 80 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 4.1 120 Somerset 21025 0.00521153 0.001150020 390.909 FALSE 1.2 20 Piscataquis 21292 0.00633830 0.002128960 724.242 FALSE 2.2 50 Piscataquis 21292 0.00633830 0.002128960 724.242 FALSE 3.2 80 Piscataquis 21292 0.00633830 0.002128960 724.242 FALSE 4.2 120 Piscataquis 21292 0.00633830 0.002128960 724.242 FALSE 1.3 20 Penobscot 23307 0.00684534 0.001025450 242.424 FALSE 2.3 50 Penobscot 23307 0.00684534 0.001025450 242.424 FALSE 3.3 80 Penobscot 23307 0.00684534 0.001025450 242.424 FALSE 4.3 120 Penobscot 23307 0.00684534 0.001025450 242.424 FALSE 1.4 20 Washington 20015 0.00478188 0.000966036 327.273 FALSE 2.4 50 Washington 20015 0.00478188 0.000966036 327.273 FALSE 3.4 80 Washington 20015 0.00478188 0.000966036 327.273 FALSE 4.4 120 Washington 20015 0.00478188 0.000966036 327.273 FALSE 1.5 20 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 2.5 50 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 3.5 80 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 4.5 120 Franklin 21744 0.00508507 0.001641740 530.909 FALSE 1.6 20 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 2.6 50 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 3.6 80 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 4.6 120 Oxford 21885 0.00700822 0.001318160 536.970 FALSE 1.7 20 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 2.7 50 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 3.7 80 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 4.7 120 Waldo 23020 0.00498141 0.000918837 450.909 FALSE 1.8 20 Kennebec 25652 0.00570358 0.000917087 360.000 TRUE 2.8 50 Kennebec 25652 0.00570358 0.000917087 360.000 TRUE 3.8 80 Kennebec 25652 0.00570358 0.000917087 360.000 TRUE 4.8 120 Kennebec 25652 0.00570358 0.000917087 360.000 TRUE 1.9 20 Androscoggin 24268 0.00830953 0.001178660 460.606 TRUE 2.9 50 Androscoggin 24268 0.00830953 0.001178660 460.606 TRUE 3.9 80 Androscoggin 24268 0.00830953 0.001178660 460.606 TRUE 4.9 120 Androscoggin 24268 0.00830953 0.001178660 460.606 TRUE 1.10 20 Hancock 28071 0.00238996 0.000784584 585.455 TRUE 2.10 50 Hancock 28071 0.00238996 0.000784584 585.455 TRUE 3.10 80 Hancock 28071 0.00238996 0.000784584 585.455 TRUE 4.10 120 Hancock 28071 0.00238996 0.000784584 585.455 TRUE 1.11 20 Knox 27141 0.00652269 0.001863920 684.849 TRUE 2.11 50 Knox 27141 0.00652269 0.001863920 684.849 TRUE 3.11 80 Knox 27141 0.00652269 0.001863920 684.849 TRUE 4.11 120 Knox 27141 0.00652269 0.001863920 684.849 TRUE 1.12 20 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 2.12 50 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 3.12 80 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 4.12 120 Lincoln 27839 0.00278315 0.001030800 571.515 TRUE 1.13 20 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 2.13 50 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 3.13 80 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 4.13 120 Cumberland 32549 0.00494917 0.000683236 346.061 TRUE 1.14 20 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 2.14 50 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 3.14 80 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 4.14 120 Sagadahoc 28122 0.00285524 0.000900782 544.849 TRUE 1.15 20 York 28496 0.00529228 0.000737195 332.121 TRUE 2.15 50 York 28496 0.00529228 0.000737195 332.121 TRUE 3.15 80 York 28496 0.00529228 0.000737195 332.121 TRUE 4.15 120 York 28496 0.00529228 0.000737195 332.121 TRUE Note that the union operation can generate many overlapping geometries. This is because each geometry of the layers being unioned are paired up with one another creating unique combinations of each layer’s geometries. For example, the Aroostook County polygon from s1.sf is paired with each annulus of the s2.sf layer creating four new geometries. un1 %>% filter(NAME == "Aroostook") Simple feature collection with 4 features and 7 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: 318980.1 ymin: 4788093 xmax: 596500.1 ymax: 5255569 Projected CRS: +proj=utm +zone=19 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0 distance NAME Income NoSchool NoSchoolSE IncomeSE med 1 20 Aroostook 21024 0.0133872 0.00140696 250.909 FALSE 2 50 Aroostook 21024 0.0133872 0.00140696 250.909 FALSE 3 80 Aroostook 21024 0.0133872 0.00140696 250.909 FALSE 4 120 Aroostook 21024 0.0133872 0.00140696 250.909 FALSE geometry 1 MULTIPOLYGON (((438980 4928... 2 MULTIPOLYGON (((438980 4958... 3 MULTIPOLYGON (((438980 4988... 4 MULTIPOLYGON (((438980 5028... The union operation creates all possible pairs of geometries between both input objects (i.e. 4 circle geometries from s2.sf times 16 county geometries from s1.sf for a total of 64 geometries). Buffering geometries To buffer point, line or polygon geometries, use sf’s st_buffer function. For example, the following code chunk generates a 10 km (10,000 m) buffer around the polyline segments. l1.sf.buf <- st_buffer(l1.sf, dist = 10000) ggplot(l1.sf.buf) + geom_sf() + coord_sf(ndiscr = 1000) To create a continuous polygon geometry (i.e. to eliminate overlapping buffers), we’ll follow up with one of the dissolving techniques introduced earlier in this tutorial. l1.sf.buf.dis <- l1.sf.buf %>% group_by() %>% summarise() ggplot(l1.sf.buf.dis) + geom_sf() If you want to preserve an attribute value (such as highway number), modify the above code as follows: l1.sf.buf.dis <- l1.sf.buf %>% group_by(Number) %>% summarise() ggplot(l1.sf.buf.dis, aes(fill=Number) ) + geom_sf(alpha = 0.5) "],["mapping-rates-in-r.html", "E Mapping rates in R Raw Rates Standardized mortality ratios (relative risk) Dykes and Unwin’s chi-square statistic Unstable ratios", " E Mapping rates in R R spdep classInt RColorBrewer sf sp 4.3.1 1.2.8 0.4.10 1.1.3 1.0.14 2.0.0 In this exercise, we’ll make use of sf’s plot method instead of tmap to take advantage of sf’s scaled keys which will prove insightful when exploring rate mapping techniques that adopt none uniform classification schemes. The following libraries are used in the examples that follow. library(spdep) library(classInt) library(RColorBrewer) library(sf) library(sp) Next, we’ll initialize some color palettes. pal1 <- brewer.pal(6,"Greys") pal2 <- brewer.pal(8,"RdYlGn") pal3 <- c(brewer.pal(9,"Greys"), "#FF0000") The Auckland dataset from the spdep package will be used throughout this exercise. Some of the graphics that follow are R reproductions of Bailey and Gatrell’s book, Interactive Spatial Data Analysis (Bailey and Gatrell 1995). auckland <- st_read(system.file("shapes/auckland.shp", package="spData")[1]) Reading layer `auckland' from data source `C:\\Users\\mgimond\\AppData\\Local\\R\\win-library\\4.3\\spData\\shapes\\auckland.shp' using driver `ESRI Shapefile' Simple feature collection with 167 features and 4 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: 7.6 ymin: -4.3 xmax: 91.5 ymax: 99.3 CRS: NA The Auckland data represents total infant deaths (under the age of five) for Auckland, New Zealand, spanning the years 1977 through 1985 for different census area units. The following block of code maps these counts by area. Both equal interval and quantile classification schemes of the same data are mapped. brks1 <- classIntervals(auckland$M77_85, n = 6, style = "equal") brks2 <- classIntervals(auckland$M77_85, n = 6, style = "quantile") plot(auckland["M77_85"], breaks = brks1$brks, pal = pal1, at = round(brks1$brks,2), main = "Equal interval breaks", key.pos = 4, las = 1) plot(auckland["M77_85"], breaks = brks2$brks, pal = pal1, at = brks2$brks, main = "Quantile breaks", key.pos = 4, las = 1) These are examples of choropleth maps (choro = area and pleth = value) where some attribute (an enumeration of child deaths in this working example) is aggregated over a defined area (e.g. census area units) and displayed using two different classification schemes. Since the area units used to map death counts are not uniform in shape and area across Auckland, there is a tendency to assign more “visual weight” to polygons having larger areas than those having smaller areas. In our example, census units in the southern end of Auckland appear to have an “abnormally” large infant death count. Another perceptual interpretation of the map is one that flags those southern units as being “problematic” or of “great concern”. However, as we shall see in the following sections, this perception may not reflect reality. We therefore seek to produce perceptually tenable maps. Dykes and Unwin (Dykes and Unwin 2001) define a similar concept called map stability which seeks to produce maps that convey real effects. Raw Rates A popular approach for correcting for biased visual weights (due, for instance, to different unit area sizes) is to normalize the count data by area thus giving a count per unit area. Though this may make sense for population count data, it does not make a whole lot sense when applied to mortality counts; we are usually interested in the number of deaths per population count and not in the number of deaths per unit area. In the next chunk of code we extract population count under the age of 5 from the Auckland data set and assign this value to the variable pop. Likewise, we extract the under 5 mortality count and assign this value to the variable mor. Bear in mind that the mortality count spans a 9 year period. Since mortality rates are usually presented in rates per year, we need to multiply the population value (which is for the year 1981) by nine. This will be important in the subsequent code when we compute mortality rates. pop <- auckland$Und5_81 * 9 mor <- auckland$M77_85 Next, we will compute the raw rates (infant deaths per 1000 individuals per year) and map this rate by census unit area. Both quantile and equal interval classification schemes of the same data are mapped. auckland$raw.rate <- mor / pop * 1000 brks1 <- classIntervals(auckland$raw.rate, n = 6, style = "equal") brks2 <- classIntervals(auckland$raw.rate, n = 6, style = "quantile") plot(auckland["raw.rate"], breaks = brks1$brks, pal = pal1, at = round(brks1$brks,2), main = "Equal interval breaks", key.pos = 4, las = 1) plot(auckland["raw.rate"], breaks = brks2$brks, pal = pal1, at = round(brks2$brks,2), main = "Quantile breaks", key.pos = 4, las = 1) Note how our perception of the distribution of infant deaths changes when looking at mapped raw rates vs. counts. A north-south trend in perceived “abnormal” infant deaths is no longer apparent in this map. Standardized mortality ratios (relative risk) Another way to re-express the data is to map the Standardized Mortality Ratios (SMR)-a very popular form of representation in the field of epidemiology. Such maps map the ratios of the number of deaths to an expected death count. There are many ways to define an expected death count, many of which can be externally specified. In the following example, the expected death count \\(E_i\\) is estimated by multiplying the under 5 population count for each area by the overall death rate for Auckland: \\[E_i = {n_i}\\times{mortality_{Auckland} } \\] where \\(n_i\\) is the population count within census unit area \\(i\\) and \\(mortality_{Auckland}\\) is the overall death rate computed from \\(mortality_{Auckland} = \\sum_{i=1}^j O_i / \\sum_{i=1}^j n_i\\) where \\(O_i\\) is the observed death count for census unit \\(i\\). This chunk of code replicates Bailey and Gatrell’s figure 8.1 with the one exception that the color scheme is reversed (Bailey and Gatrell assign lighter hues to higher numbers). auck.rate <- sum(mor) / sum(pop) mor.exp <- pop * auck.rate # Expected count over a nine year period auckland$rel.rate <- 100 * mor / mor.exp brks <- classIntervals(auckland$rel.rate, n = 6, style = "fixed", fixedBreaks = c(0,47, 83, 118, 154, 190, 704)) plot(auckland["rel.rate"], breaks = brks$brks, at = brks$brks, pal = pal1, key.pos = 4, las = 1) Dykes and Unwin’s chi-square statistic Dykes and Unwin (Dykes and Unwin 2001) propose a similar technique whereby the rates are standardized following: \\[\\frac{O_i - E_i}{\\sqrt{E_i}} \\] This has the effect of creating a distribution of values closer to normal (as opposed to a Poisson distribution of rates and counts encountered thus far). We can therefore apply a diverging color scheme where green hues represent less than expected rates and red hues represent greater than expected rates. auckland$chi.squ = (mor - mor.exp) / sqrt(mor.exp) brks <- classIntervals(auckland$chi.squ, n = 6, style = "fixed", fixedBreaks = c(-5,-3, -1, -2, 0, 1, 2, 3, 5)) plot(auckland["chi.squ"], breaks = brks$brks, at = brks$brks, pal=rev(pal2), key.pos = 4, las = 1) Unstable ratios One problem with the various techniques used thus far is their sensitivity (hence instability) to small underlying population counts (i.e. unstable ratios). This next chunk of code maps the under 5 population count by census area unit. brks <- classIntervals(auckland$Und5_81, n = 6, style = "equal") plot(auckland["Und5_81"], breaks = brks$brks, at = brks$brks, pal = pal1, key.pos = 4, las = 1) Note the variability in population count with some areas encompassing fewer than 50 infants. If there is just one death in that census unit, the death rate would be reported as \\(1/50 * 1000\\) or 20 per thousand infants–far more than then the 2.63 per thousand rate for our Auckland data set. Interestingly, the three highest raw rates in Auckland (14.2450142, 18.5185185, 10.5820106 deaths per 1000) are associated with some of the smallest underlying population counts (39, 6, 21 infants under 5). One approach to circumventing this issue is to generate a probability map of the data. The next section highlights such an example. Global Empirical Bayes (EB) rate estimate The idea behind Bayesian approach is to compare the value in some area \\(i\\) to some a priori estimate of the value and to “stabilize” the values due to unstable ratios (e.g. where area populations are small). The a priori estimate can be based on some global mean. An example of the use on a global EB infant mortality rate map is shown below. The EB map is shown side-by-side with the raw rates map for comparison. aka Global moment estimator of infant mortality per 1000 per year EB.est <- EBest(auckland$M77_85, auckland$Und5_81 * 9 ) auckland$EBest <- EB.est$estmm * 1000 brks1 <- classIntervals(auckland$EBest, n = 10, style = "quantile") brks2 <- classIntervals(auckland$raw.rate, n = 10, style = "quantile") plot(auckland["EBest"], breaks = brks1$brks, at = round(brks1$brks, 2), pal = pal3, main="EB rates", key.pos = 4, las = 1) plot(auckland["raw.rate"], breaks = brks2$brks, at = round(brks2$brks, 2), pal = pal3, main="Raw Rates", key.pos = 4, las = 1) The census units with the top 10% rates are highlighted in red. Unstable rates (i.e. those associated with smaller population counts) are assigned lower weights to reduce their “prominence” in the mapped data. Notice how the three high raw rates highlighted in the last section are reduced from 14.2450142, 18.5185185, 10.5820106 counts per thousand to 3.6610133, 2.8672132, 3.0283279 counts per thousand. The “remapping” of these values along with others can be shown on the following plot: Local Empirical Bayes (EB) rate estimate The a priori mean and variance need not be aspatial (i.e. the prior distribution being the same for the entire Auckland study area). The adjusted estimated rates can be shrunk towards a local mean instead. Such technique is referred to as local empirical Bayes rate estimates. In the following example, we define local as consisting of all first order adjacent census unit areas. nb <- poly2nb(auckland) EBL.est <- EBlocal(auckland$M77_85, 9*auckland$Und5_81, nb) auckland$EBLest <- EBL.est$est * 1000 brks1 <- classIntervals(auckland$EBLest, n = 10, style = "quantile") brks2 <- classIntervals(auckland$raw.rate, n = 10, style = "quantile") plot(auckland["EBLest"], breaks = brks1$brks, at = round(brks1$brks,2), pal = pal3, main = "Local EB rates", key.pos = 4, las = 1) plot(auckland["raw.rate"], breaks = brks2$brks, at = round(brks2$brks,2), pal = pal3, main = "Raw Rates", key.pos = 4, las = 1) The census units with the top 10% rates are highlighted in red. References "],["raster-operations-in-r.html", "F Raster operations in R Sample files for this exercise Local operations and functions Focal operations and functions Zonal operations and functions Global operations and functions Computing cumulative distances", " F Raster operations in R R terra sf tmap gdistance ggplot2 rasterVis 4.3.1 1.7.55 1.0.14 3.3.3 1.6.4 3.4.3 0.51.5 Sample files for this exercise We’ll first load spatial objects used in this exercise from a remote website: an elevation SpatRaster object, a bathymetry SpatRaster object and a continents sf vector object library(terra) library(sf) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/elev_world.RDS")) elev <- unwrap(readRDS(z)) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/bath_world.RDS")) bath <- unwrap(readRDS(z)) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/continent_global.RDS")) cont <- readRDS(z) Both rasters cover the entire globe. Elevation below mean sea level are encoded as 0 in the elevation raster. Likewise, bathymetry values above mean sea level are encoded as 0. Note that most of the map algebra operations and functions covered in this tutorial are implemented using the terra package. See chapter 10 for a theoretical discussion of map algebra operations. Local operations and functions Unary operations and functions (applied to single rasters) Most algebraic operations can be applied to rasters as they would with any vector element. For example, to convert all bathymetric values in bath (currently recorded as positive values) to negative values simply multiply the raster by -1. bath2 <- bath * (-1) Another unary operation that can be applied to a raster is reclassification. In the following example, we will assign all bath2 values that are less than zero a 1 and all zero values will remain unchanged. A simple way to do this is to apply a conditional statement. bath3 <- bath2 < 0 Let’s look at the output. Note that all 0 pixels are coded as FALSE and all 1 pixels are coded as TRUE. library(tmap) tm_shape(bath3) + tm_raster(palette = "Greys") + tm_legend(outside = TRUE, text.size = .8) If a more elaborate form of reclassification is desired, you can use the classify function. In the following example, the raster object bath is reclassified to 4 unique values: 100, 500, 1000 and 11000 as follows: Original depth values Reclassified values 0 - 100 100 101 - 500 500 501 - 1000 1000 1001 - 11000 11000 The first step is to create a plain matrix where the first and second columns list the starting and ending values of the range of input values that are to be reclassified, and where the third column lists the new raster cell values. m <- c(0, 100, 100, 100, 500, 500, 500, 1000, 1000, 1000, 11000, 11000) m <- matrix(m, ncol=3, byrow = T) m [,1] [,2] [,3] [1,] 0 100 100 [2,] 100 500 500 [3,] 500 1000 1000 [4,] 1000 11000 11000 bath3 <- classify(bath, m, right = T) The right=T parameter indicates that the intervals should be closed to the right (i.e. the second column of the reclassification matrix is inclusive). tm_shape(bath3) + tm_raster(style="cat") + tm_legend(outside = TRUE, text.size = .8) You can also assign NA (missing) values to pixels. For example, to assign NA values to cells that are equal to 100, type bath3[bath3 == 100] <- NA The following chunk of code highlights all NA pixels in grey and labels them as missing. tm_shape(bath3) + tm_raster(showNA=TRUE, colorNA="grey") + tm_legend(outside = TRUE, text.size = .8) Binary operations and functions (where two rasters are used) In the following example, elev (elevation raster) is added to bath (bathymetry raster) to create a single elevation raster for the globe. Note that the bathymetric raster will need to be multiplied by -1 to differentiate above mean sea level elevation from below mean sea level depth. elevation <- elev - bath tm_shape(elevation) + tm_raster(palette="-RdBu") + tm_legend(outside = TRUE, text.size = .8) Focal operations and functions Operations or functions applied focally to rasters involve user defined neighboring cells. Focal operations can be performed using the focal function. For example, to smooth out the elevation raster by computing the mean cell values over a 11 by 11 cells window, type: f1 <- focal(elevation, w = 11 , fun = mean) The w argument defines the focal window. If it’s given a single number (as is the case in the above code chunk), that number will define the width and height (in cell counts) of the focal window with each cell assigned equal weight. w can also be passed a matrix with each element in that matrix defining the weight for each cell. The following code chunk will generate the same output as the previous code chunk: f1 <- focal(elevation, w = matrix(1, nrow = 11, ncol = 11), fun = mean) tm_shape(f1) + tm_raster(palette="-RdBu") + tm_legend(outside = TRUE, text.size = .8) By default edge cells are assigned a value of NA. This is because cells outside of the input raster extent have no value, so when the average value is computed for a cell along the raster’s edge, the kernel will include the NA values outside the raster’s extent. To see an example of this, we will first smooth the raster using a 3 by 3 focal window, then we’ll zoom in on a 3 by 3 portion of the elevation raster in the above left-hand corner of its extent. # Run a 3x3 smooth on the raster f_mean <- focal(elevation, w = 3 , fun = mean) Figure F.1: Upper left-hand corner of elevation raster Note the NA values in the upper row (shown in bisque color). You might have noticed the lack of edge effect issues along the western edge of the raster outputs. This is because the focal function will wrap the eastern edge of the raster to the western edge of that same raster if the input raster layer spans the entire globe (i.e from -180 ° to +180 °). To have the focal function ignore missing values, simply add the na.rm = TRUE option. # Run a 3x3 smooth on the raster f_mean_no_na <- focal(elevation, w = 3 , fun = mean, na.rm = TRUE) Figure F.2: Upper left-hand corner of elevation raster. Border edge ignored. In essence, the above row of values are computed using just 6 values instead of 9 values (the corner values still make use of the across-180° values). Another option is to expand the row edge beyond its extent by replicating the edge values. This can be done by setting exapnd to true. For example: # Run a 3x3 smooth on the raster f_mean_expand <- focal(elevation, w = 3, fun = mean, expand = TRUE) Figure F.3: Upper left-hand corner of elevation raster Note that if expand is set to TRUE, the na.rm argument is ignored. But, you must be careful in making use of the na.rm = TRUE if you are using a matrix to define the weights as opposed to using the fun functions. For example, the mean function can be replicated using the matrix operation as follows: f_mean <- focal(elevation, w = 3, fun = mean) f_mat <- focal(elevation, w = matrix(1/9, nrow = 3, ncol = 3)) Note that if fun is not defined, it will default to summing the weighted pixel values. Figure F.4: Upper left-hand corner of elevation raster Note the similar output. Now, if we set na.rm to TRUE to both functions, we get: f_mean <- focal(elevation, w = 3, fun = mean, na.rm = TRUE) f_mat <- focal(elevation, w = matrix(1/9, nrow = 3, ncol = 3), na.rm = TRUE) Figure F.5: Upper left-hand corner of elevation raster Note the smaller edge values from the matrix defined weights raster. This is because the matrix is assigning 1/9th the weight for each pixel regardless of the number of pixels used to compute the output pixel values. So the upper edge pixels are summing values from just 6 weighted pixels as opposed to eight. For example, the middle top pixel is computed from 1/9(-4113 -4113 -4112 -4107 -4104 -4103), which results in dividing the sum of six values by nine–hence the unbalanced weight effect. Note that we do not have that problem using the mean function. The neighbors matrix (or kernel) that defines the moving window can be customized. For example if we wanted to compute the average of all 8 neighboring cells excluding the central cell we could define the matrix as follows: m <- matrix(c(1,1,1,1,0,1,1,1,1)/8,nrow = 3) f2 <- focal(elevation, w=m, fun=sum) More complicated kernels can be defined. In the following example, a Sobel filter (used for edge detection in image processing) is defined then applied to the raster layer elevation. Sobel <- matrix(c(-1,0,1,-2,0,2,-1,0,1) / 4, nrow=3) f3 <- focal(elevation, w=Sobel, fun=sum) tm_shape(f3) + tm_raster(palette="Greys") + tm_legend(legend.show = FALSE) Zonal operations and functions A common zonal operation is the aggregation of cells. In the following example, raster layer elevation is aggregated to a 5x5 raster layer. z1 <- aggregate(elevation, fact=2, fun=mean, expand=TRUE) tm_shape(z1) + tm_raster(palette="-RdBu",n=6) + tm_legend(outside = TRUE, text.size = .8) The image may not look much different from the original, but a look at the image properties will show a difference in pixel sizes. res(elevation) [1] 0.3333333 0.3333333 res(z1) [1] 0.6666667 0.6666667 z1’s pixel dimensions are half of elevation’s dimensions. You can reverse the process by using the disaggregate function which will split a cell into the desired number of subcells while assigning each one the same parent cell value. Zonal operations can often involve two layers, one with the values to be aggregated, the other with the defined zones. In the next example, elevation’s cell values are averaged by zones defined by the cont polygon layer. The following chunk computes the mean elevation value for each unique polygon in cont, cont.elev <- extract(elevation, cont, fun=mean, bind = TRUE) The output is a SpatVector. If you want to output a dataframe, set bind to FALSE. cont.elev can be converted back to an sf object as follows: cont.elev.sf <- st_as_sf(cont.elev) The column of interest is automatically named band1. We can now map the average elevation by continent. tm_shape(cont.elev.sf) + tm_polygons(col="band1") + tm_legend(outside = TRUE, text.size = .8) Many custom functions can be applied to extract. For example, to extract the maximum elevation value by continent, type: cont.elev <- extract(elevation, cont, fun=max, bind = TRUE) As another example, we may wish to extract the number of pixels in each polygon using a customized function. cont.elev <- extract(elevation, cont, fun=function(x,...){length(x)}, bind = TRUE) Global operations and functions Global operations and functions may make use of all input cells of a grid in the computation of an output cell value. An example of a global function is the Euclidean distance function, distance, which computes the shortest distance between a pixel and a source (or destination) location. To demonstrate the distance function, we’ll first create a new raster layer with two non-NA pixels. r1 <- rast(ncols=100, nrows=100, xmin=0, xmax=100, ymin=0, ymax=100) r1[] <- NA # Assign NoData values to all pixels r1[c(850, 5650)] <- 1 # Change the pixels #850 and #5650 to 1 crs(r1) <- "+proj=ortho" # Assign an arbitrary coordinate system (needed for mapping with tmap) tm_shape(r1) + tm_raster(palette="red") + tm_legend(outside = TRUE, text.size = .8) Next, we’ll compute a Euclidean distance raster from these two cells. The output extent will default to the input raster extent. r1.d <- distance(r1) tm_shape(r1.d) + tm_raster(palette = "Greens", style="order", title="Distance") + tm_legend(outside = TRUE, text.size = .8) + tm_shape(r1) + tm_raster(palette="red", title="Points") You can also compute a distance raster using sf point objects. In the following example, distances to points (25,30) and (87,80) are computed for each output cell. However, since we are working off of point objects (and not an existing raster as was the case in the previous example), we will need to create a blank raster layer which will define the extent of the Euclidean distance raster output. r2 <- rast(ncols=100, nrows=100, xmin=0, xmax=100, ymin=0, ymax=100) crs(r2) <- "+proj=ortho" # Assign an arbitrary coordinate system # Create a point layer p1 <- st_as_sf(st_as_sfc("MULTIPOINT(25 30, 87 80)", crs = "+proj=ortho")) Now let’s compute the Euclidean distance to these points using the distance function. r2.d <- distance(r2, p1) |---------|---------|---------|---------| ========================================= Let’s plot the resulting output. tm_shape(r2.d) + tm_raster(palette = "Greens", style="order") + tm_legend(outside = TRUE, text.size = .8) + tm_shape(p1) + tm_bubbles(col="red") Computing cumulative distances This exercise demonstrates how to use functions from the gdistance package to generate a cumulative distance raster. One objective will be to demonstrate the influence “adjacency cells” wields in the final results. Load the gdistance package. library(gdistance) First, we’ll create a 100x100 raster and assign a value of 1 to each cell. The pixel value defines the cost (other than distance) in traversing that pixel. In this example, we’ll assume that the cost is uniform across the entire extent. r <- rast(nrows=100,ncols=100,xmin=0,ymin=0,xmax=100,ymax=100) r[] <- rep(1, ncell(r)) If you were to include traveling costs other than distance (such as elevation) you would assign those values to each cell instead of the constant value of 1. A translation matrix allows one to define a ‘traversing’ cost going from one cell to an adjacent cell. Since we are assuming there are no ‘costs’ (other than distance) in traversing from one cell to any adjacent cell we’ll assign a value of 1, function(x){1}, to the translation between a cell and its adjacent cells (i.e. translation cost is uniform in all directions). There are four different ways in which ‘adjacency’ can be defined using the transition function. These are showcased in the next four blocks of code. In this example, adjacency is defined as a four node (vertical and horizontal) connection (i.e. a “rook” move). h4 <- transition(raster(r), transitionFunction = function(x){1}, directions = 4) In this example, adjacency is defined as an eight node connection (i.e. a single cell “queen” move). h8 <- transition(raster(r), transitionFunction = function(x){1}, directions = 8) In this example, adjacency is defined as a sixteen node connection (i.e. a single cell “queen” move combined with a “knight” move). h16 <- transition(raster(r), transitionFunction=function(x){1},16,symm=FALSE) In this example, adjacency is defined as a four node diagonal connection (i.e. a single cell “bishop” move). hb <- transition(raster(r), transitionFunction=function(x){1},"bishop",symm=FALSE) The transition function treats all adjacent cells as being at an equal distance from the source cell across the entire raster. geoCorrection corrects for ‘true’ local distance. In essence, it’s adding an additional cost to traversing from one cell to an adjacent cell (the original cost being defined using the transition function). The importance of applying this correction will be shown later. Note: geoCorrection also corrects for distance distortions associated with data in a geographic coordinate system. To take advantage of this correction, make sure to define the raster layer’s coordinate system using the projection function. h4 <- geoCorrection(h4, scl=FALSE) h8 <- geoCorrection(h8, scl=FALSE) h16 <- geoCorrection(h16, scl=FALSE) hb <- geoCorrection(hb, scl=FALSE) In the “queen’s” case, the diagonal neighbors are \\(\\sqrt{2 x (CellWidth)^{2}}\\) times the cell width distance from the source cell. Next we will map the cumulative distance (accCost) from a central point (A) to all cells in the raster using the four different adjacency definitions. A <- c(50,50) # Location of source cell h4.acc <- accCost(h4,A) h8.acc <- accCost(h8,A) h16.acc <- accCost(h16,A) hb.acc <- accCost(hb,A) If the geoCorrection function had not been applied in the previous steps, the cumulative distance between point location A and its neighboring adjacent cells would have been different. Note the difference in cumulative distance for the 16-direction case as shown in the next two figures. Uncorrected (i.e. geoCorrection not applied to h16): Corrected (i.e. geoCorrection applied to h16): The “bishop” case offers a unique problem: only cells in the diagonal direction are identified as being adjacent. This leaves many undefined cells (labeled as Inf). We will change the Inf cells to NA cells. hb.acc[hb.acc == Inf] <- NA Now let’s compare a 7x7 subset (centered on point A) between the four different cumulative distance rasters. To highlight the differences between all four rasters, we will assign a red color to all cells that are within 20 cell units of point A. It’s obvious that the accuracy of the cumulative distance raster can be greatly influenced by how we define adjacent nodes. The number of red cells (i.e. area identified as being within a 20 units cumulative distance) ranges from 925 to 2749 cells. Working example In the following example, we will generate a raster layer with barriers (defined as NA cell values). The goal will be to identify all cells that fall within a 290 km traveling distance from the upper left-hand corner of the raster layer (the green point in the maps). Results between an 8-node and 16-node adjacency definition will be compared. # create an empty raster r <- rast(nrows=300,ncols=150,xmin=0,ymin=0,xmax=150000, ymax=300000) # Define a UTM projection (this sets map units to meters) crs(r) = "+proj=utm +zone=19 +datum=NAD83" # Each cell is assigned a value of 1 r[] <- rep(1, ncell(r)) # Generate 'baffles' by assigning NA to cells. Cells are identified by # their index and not their coordinates. # Baffles need to be 2 cells thick to prevent the 16-node # case from "jumping" a one pixel thick NA cell. a <- c(seq(3001,3100,1),seq(3151,3250,1)) a <- c(a, a+6000, a+12000, a+18000, a+24000, a+30000, a+36000) a <- c(a , a+3050) r[a] <- NA # Let's check that the baffles are properly placed tm_shape(r) + tm_raster(colorNA="red") + tm_legend(legend.show=FALSE) # Next, generate a transition matrix for the 8-node case and the 16-node case h8 <- transition(raster(r), transitionFunction = function(x){1}, directions = 8) h16 <- transition(raster(r), transitionFunction = function(x){1}, directions = 16) # Now assign distance cost to the matrices. h8 <- geoCorrection(h8) h16 <- geoCorrection(h16) # Define a point source and assign a projection A <- SpatialPoints(cbind(50,290000)) crs(A) <- "+proj=utm +zone=19 +datum=NAD83 +units=m +no_defs" # Compute the cumulative cost raster h8.acc <- accCost(h8, A) h16.acc <- accCost(h16,A) # Replace Inf with NA h8.acc[h8.acc == Inf] <- NA h16.acc[h16.acc == Inf] <- NA Let’s plot the results. Yellow cells will identify cumulative distances within 290 km. tm_shape(h8.acc) + tm_raster(n=2, style="fixed", breaks=c(0,290000,Inf)) + tm_facets() + tm_shape(A) + tm_bubbles(col="green", size = .5) + tm_legend(outside = TRUE, text.size = .8) tm_shape(h16.acc) + tm_raster(n=2, style="fixed", breaks=c(0,290000,Inf)) + tm_facets() + tm_shape(A) + tm_bubbles(col="green", size = .5) + tm_legend(outside = TRUE, text.size = .8) We can compute the difference between the 8-node and 16-node cumulative distance rasters: table(h8.acc[] <= 290000) FALSE TRUE 31458 10742 table(h16.acc[] <= 290000) FALSE TRUE 30842 11358 The number of cells identified as being within a 290 km cumulative distance of point A for the 8-node case is 10742 whereas it’s 11358 for the 16-node case, a difference of 5.4%. "],["coordinate-systems-in-r.html", "G Coordinate Systems in R A note about the changes to the PROJ environment Sample files for this exercise Loading the sf package Checking for a coordinate system Understanding the Proj4 coordinate syntax Assigning a coordinate system Transforming coordinate systems A note about containment Creating Tissot indicatrix circles", " G Coordinate Systems in R R terra sf tmap geosphere 4.3.1 1.7.55 1.0.14 3.3.3 1.5.18 A note about the changes to the PROJ environment Newer versions of sf make use of the PROJ 6.0 C library or greater. Note that the version of PROJ is not to be confused with the version of the proj4 R package–the proj4 and sf packages make use of the PROJ C library that is developed independent of R. You can learn more about the PROJ development at proj.org. There has been a significant change in the PROJ library since the introduction of version 6.0. This has had serious implications in the development of the R spatial ecosystem. As such, if you are using an older version of sf or proj4 that was developed with a version of PROJ older than 6.0, some of the input/output presented in this appendix may differ from yours. Sample files for this exercise Data used in this exercise can be loaded into your current R session by running the following chunk of code. library(terra) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/elev.RDS")) elev.r <- unwrap(readRDS(z)) z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/s_sf.RDS")) s.sf <- readRDS(z) We’ll make use of two data layers in this exercise: a Maine counties polygon layer (s.sf) and an elevation raster layer (elev.r). The former is in an sf format and the latter is in a SpatRaster format. Loading the sf package library(sf) Note the versions of GEOS, GDAL and PROJ the package sf is linked to. Different versions of these libraries may result in different outcomes than those shown in this appendix. You can check the linked library versions as follows: sf_extSoftVersion()[1:3] GEOS GDAL proj.4 "3.11.2" "3.7.2" "9.3.0" Checking for a coordinate system To extract coordinate system (CS) information from an sf object use the st_crs function. st_crs(s.sf) Coordinate Reference System: User input: EPSG:26919 wkt: PROJCRS["NAD83 / UTM zone 19N", BASEGEOGCRS["NAD83", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4269]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1]], USAGE[ SCOPE["Engineering survey, topographic mapping."], AREA["North America - between 72°W and 66°W - onshore and offshore. Canada - Labrador; New Brunswick; Nova Scotia; Nunavut; Quebec. Puerto Rico. United States (USA) - Connecticut; Maine; Massachusetts; New Hampshire; New York (Long Island); Rhode Island; Vermont."], BBOX[14.92,-72,84,-66]], ID["EPSG",26919]] With the newer version of the PROJ C library, the coordinate system is defined using the Well Known Text (WTK/WTK2) format which consists of a series of [...] tags. The WKT format will usually start with a PROJCRS[...] tag for a projected coordinate system, or a GEOGCRS[...] tag for a geographic coordinate system. The CRS output will also consist of a user defined CS definition which can be an EPSG code (as is the case in this example), or a string defining the datum and projection type. You can also extract CS information from a SpatRaster object use the st_crs function. st_crs(elev.r) Coordinate Reference System: User input: BOUNDCRS[ SOURCECRS[ PROJCRS["unknown", BASEGEOGCRS["unknown", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]], ID["EPSG",6269]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]], ID["EPSG",16019]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]]]], TARGETCRS[ GEOGCRS["WGS 84", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], CS[ellipsoidal,2], AXIS["geodetic latitude (Lat)",north, ORDER[1], ANGLEUNIT["degree",0.0174532925199433]], AXIS["geodetic longitude (Lon)",east, ORDER[2], ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4326]]], ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84", METHOD["Geocentric translations (geog2D domain)", ID["EPSG",9603]], PARAMETER["X-axis translation",0, ID["EPSG",8605]], PARAMETER["Y-axis translation",0, ID["EPSG",8606]], PARAMETER["Z-axis translation",0, ID["EPSG",8607]]]] wkt: BOUNDCRS[ SOURCECRS[ PROJCRS["unknown", BASEGEOGCRS["unknown", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]], ID["EPSG",6269]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]], ID["EPSG",16019]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]]]], TARGETCRS[ GEOGCRS["WGS 84", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], CS[ellipsoidal,2], AXIS["geodetic latitude (Lat)",north, ORDER[1], ANGLEUNIT["degree",0.0174532925199433]], AXIS["geodetic longitude (Lon)",east, ORDER[2], ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4326]]], ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84", METHOD["Geocentric translations (geog2D domain)", ID["EPSG",9603]], PARAMETER["X-axis translation",0, ID["EPSG",8605]], PARAMETER["Y-axis translation",0, ID["EPSG",8606]], PARAMETER["Z-axis translation",0, ID["EPSG",8607]]]] Up until recently, there has been two ways of defining a coordinate system: via the EPSG numeric code or via the PROJ4 formatted string. Both can be used with the sf and SpatRast objects. With the newer version of the PROJ C library, you can also define an sf object’s coordinate system using the Well Known Text (WTK/WTK2) format. This format has a more elaborate syntax (as can be seen in the previous outputs) and may not necessarily be the easiest way to manually define a CS. When possible, adopt an EPSG code which comes from a well established authority. However, if customizing a CS, it may be easiest to adopt a PROJ4 syntax. Understanding the Proj4 coordinate syntax The PROJ4 syntax consists of a list of parameters, each prefixed with the + character. For example, elev.r’s CS is in a UTM projection (+proj=utm) for zone 19 (+zone=19) and in an NAD 1983 datum (+datum=NAD83). A list of a few of the PROJ4 parameters used in defining a coordinate system follows. Click here for a full list of parameters. +a Semimajor radius of the ellipsoid axis +b Semiminor radius of the ellipsoid axis +datum Datum name +ellps Ellipsoid name +lat_0 Latitude of origin +lat_1 Latitude of first standard parallel +lat_2 Latitude of second standard parallel +lat_ts Latitude of true scale +lon_0 Central meridian +over Allow longitude output outside -180 to 180 range, disables wrapping +proj Projection name +south Denotes southern hemisphere UTM zone +units meters, US survey feet, etc. +x_0 False easting +y_0 False northing +zone UTM zone You can view the list of available projections +proj= here. Assigning a coordinate system A coordinate system definition can be passed to a spatial object. It can either fill a spatial object’s empty CS definition or it can overwrite its existing CS definition (the latter should only be executed if there is good reason to believe that the original definition is erroneous). Note that this step does not change an object’s underlying coordinate values (this process will be discussed in the next section). We’ll pretend that a CS definition was not assigned to s.sf and assign one manually using the st_set_crs() function. In the following example, we will define the CS using the proj4 syntax. s.sf <- st_set_crs(s.sf, "+proj=utm +zone=19 +ellps=GRS80 +datum=NAD83") Let’s now check the object’s CS. st_crs(s.sf) Coordinate Reference System: User input: +proj=utm +zone=19 +ellps=GRS80 +datum=NAD83 wkt: PROJCRS["unknown", BASEGEOGCRS["unknown", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]], ID["EPSG",6269]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]], ID["EPSG",16019]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]]] You’ll note that the User input: field now shows the proj4 string as defined in our call to the st_set_crs() function. But you’ll also note that some of the parameters in the WKT string such as the PROJCRS[...] and BASEGEOGCRS[...] tags are not defined (unknown). This is not necessarily a problem given that key datum and projection information are present in that WKT string (make sure to scroll down in the output box to see the other WKT parameters). Nonetheless, it’s not a bad idea to define the CS using EPSG code when one is available. We’ll do this next. The UTM NAD83 Zone 19N EPSG code equivalent is 26919. s.sf <- st_set_crs(s.sf, 26919) Let’s now check the object’s CS. st_crs(s.sf) Coordinate Reference System: User input: EPSG:26919 wkt: PROJCRS["NAD83 / UTM zone 19N", BASEGEOGCRS["NAD83", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4269]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1]], USAGE[ SCOPE["Engineering survey, topographic mapping."], AREA["North America - between 72°W and 66°W - onshore and offshore. Canada - Labrador; New Brunswick; Nova Scotia; Nunavut; Quebec. Puerto Rico. United States (USA) - Connecticut; Maine; Massachusetts; New Hampshire; New York (Long Island); Rhode Island; Vermont."], BBOX[14.92,-72,84,-66]], ID["EPSG",26919]] Key projection parameters remain the same. But additional information is added to the WKT header. You can use the PROJ4 string defined earlier for s.sf to define a raster’s CRS using the crs() function as follows (here too we’ll assume that the spatial object had a missing reference system or an incorrectly defined one). crs(elev.r) <- "+proj=utm +zone=19 +ellps=GRS80 +datum=NAD83" Note that we do not need to define all of the parameters so long as we know that the default values for these undefined parameters are correct. Also note that we do not need to designate a hemisphere since the NAD83 datum applies only to North America. Let’s check the raster’s CS: st_crs(elev.r) Coordinate Reference System: User input: PROJCRS["unknown", BASEGEOGCRS["unknown", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]], ID["EPSG",6269]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]], ID["EPSG",16019]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]]] wkt: PROJCRS["unknown", BASEGEOGCRS["unknown", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]], ID["EPSG",6269]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]], ID["EPSG",16019]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]]] To define a raster’s CS using an EPSG code, use the following PROJ4 syntax: crs(elev.r) <- "+init=EPSG:26919" st_crs(elev.r) Coordinate Reference System: User input: NAD83 / UTM zone 19N wkt: PROJCRS["NAD83 / UTM zone 19N", BASEGEOGCRS["NAD83", DATUM["North American Datum 1983", ELLIPSOID["GRS 1980",6378137,298.257222101, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], ID["EPSG",4269]], CONVERSION["UTM zone 19N", METHOD["Transverse Mercator", ID["EPSG",9807]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",-69, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["Scale factor at natural origin",0.9996, SCALEUNIT["unity",1], ID["EPSG",8805]], PARAMETER["False easting",500000, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]], ID["EPSG",16019]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]], USAGE[ SCOPE["unknown"], AREA["North America - between 72°W and 66°W - onshore and offshore. Canada - Labrador; New Brunswick; Nova Scotia; Nunavut; Quebec. Puerto Rico. United States (USA) - Connecticut; Maine; Massachusetts; New Hampshire; New York (Long Island); Rhode Island; Vermont."], BBOX[14.92,-72,84,-66]]] To recreate a CS defined in a software such as ArcGIS, it is best to extract the CS’ WKID/EPSG code, then use that number to look up the PROJ4 syntax on http://spatialreference.org/ref/. For example, in ArcGIS, the WKID number can be extracted from the coordinate system properties output. Figure G.1: An ArcGIS dataframe coordinate system properties window. Note the WKID/EPSG code of 26919 (highlighted in red) associated with the NAD 1983 UTM Zone 19 N CS. That number can then be entered in the http://spatialreference.org/ref/’s search box to pull the Proj4 parameters (note that you must select Proj4 from the list of syntax options). Figure G.2: Example of a search result for EPSG 26919 at http://spatialreference.org/ref/. Note that after clicking the EPSG:269191 link, you must then select the Proj4 syntax from a list of available syntaxes to view the projection parameters Here are examples of a few common projections: Projection WKID Authority Syntax UTM NAD 83 Zone 19N 26919 EPSG +proj=utm +zone=19 +ellps=GRS80 +datum=NAD83 +units=m +no_defs USA Contiguous albers equal area 102003 ESRI +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs Alaska albers equal area 3338 EPSG +proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs World Robinson 54030 ESRI +proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs Transforming coordinate systems The last section showed you how to define or modify the coordinate system definition. This section shows you how to transform the coordinate values associated with the spatial object to a different coordinate system. This process calculates new coordinate values for the points or vertices defining the spatial object. For example, to transform the s.sf vector object to a WGS 1984 geographic (long/lat) coordinate system, we’ll use the st_transform function. s.sf.gcs <- st_transform(s.sf, "+proj=longlat +datum=WGS84") st_crs(s.sf.gcs) Coordinate Reference System: User input: +proj=longlat +datum=WGS84 wkt: GEOGCRS["unknown", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]], ID["EPSG",6326]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]], CS[ellipsoidal,2], AXIS["longitude",east, ORDER[1], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]], AXIS["latitude",north, ORDER[2], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]]] Using the EPSG code equivalent (4326) instead of the proj4 string yields: s.sf.gcs <- st_transform(s.sf, 4326) st_crs(s.sf.gcs) Coordinate Reference System: User input: EPSG:4326 wkt: GEOGCRS["WGS 84", ENSEMBLE["World Geodetic System 1984 ensemble", MEMBER["World Geodetic System 1984 (Transit)"], MEMBER["World Geodetic System 1984 (G730)"], MEMBER["World Geodetic System 1984 (G873)"], MEMBER["World Geodetic System 1984 (G1150)"], MEMBER["World Geodetic System 1984 (G1674)"], MEMBER["World Geodetic System 1984 (G1762)"], MEMBER["World Geodetic System 1984 (G2139)"], ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]], ENSEMBLEACCURACY[2.0]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], CS[ellipsoidal,2], AXIS["geodetic latitude (Lat)",north, ORDER[1], ANGLEUNIT["degree",0.0174532925199433]], AXIS["geodetic longitude (Lon)",east, ORDER[2], ANGLEUNIT["degree",0.0174532925199433]], USAGE[ SCOPE["Horizontal component of 3D system."], AREA["World."], BBOX[-90,-180,90,180]], ID["EPSG",4326]] This approach may add a few more tags (These reflect changes in datum definitions in newer versions of the PROJ library) but, the coordinate values should be the same To transform a raster object, use the project() function. elev.r.gcs <- project(elev.r, y="+proj=longlat +datum=WGS84") st_crs(elev.r.gcs) Coordinate Reference System: User input: GEOGCRS["unknown", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]], ID["EPSG",6326]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]], CS[ellipsoidal,2], AXIS["longitude",east, ORDER[1], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]], AXIS["latitude",north, ORDER[2], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]]] wkt: GEOGCRS["unknown", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]], ID["EPSG",6326]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]], CS[ellipsoidal,2], AXIS["longitude",east, ORDER[1], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]], AXIS["latitude",north, ORDER[2], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]]] If an EPSG code is to be used, adopt the \"+init=EPSG: ...\" syntax used earlier in this tutorial. elev.r.gcs <- project(elev.r, y="+init=EPSG:4326") st_crs(elev.r.gcs) Coordinate Reference System: User input: WGS 84 wkt: GEOGCRS["WGS 84", ENSEMBLE["World Geodetic System 1984 ensemble", MEMBER["World Geodetic System 1984 (Transit)", ID["EPSG",1166]], MEMBER["World Geodetic System 1984 (G730)", ID["EPSG",1152]], MEMBER["World Geodetic System 1984 (G873)", ID["EPSG",1153]], MEMBER["World Geodetic System 1984 (G1150)", ID["EPSG",1154]], MEMBER["World Geodetic System 1984 (G1674)", ID["EPSG",1155]], MEMBER["World Geodetic System 1984 (G1762)", ID["EPSG",1156]], MEMBER["World Geodetic System 1984 (G2139)", ID["EPSG",1309]], ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1], ID["EPSG",7030]], ENSEMBLEACCURACY[2.0], ID["EPSG",6326]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]], CS[ellipsoidal,2], AXIS["longitude",east, ORDER[1], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]], AXIS["latitude",north, ORDER[2], ANGLEUNIT["degree",0.0174532925199433, ID["EPSG",9122]]], USAGE[ SCOPE["unknown"], AREA["World."], BBOX[-90,-180,90,180]]] A geographic coordinate system is often desired when overlapping a layer with a web based mapping service such as Google, Bing or OpenStreetMap (even though these web based services end up projecting to a projected coordinate system–most likely a Web Mercator projection). To check that s.sf.gcs was properly transformed, we’ll overlay it on top of an OpenStreetMap using the leaflet package. library(leaflet) leaflet(s.sf.gcs) %>% addPolygons() %>% addTiles() Next, we’ll explore other transformations using a tmap dataset of the world library(tmap) data(World) # The dataset is stored as an sf object # Let's check its current coordinate system st_crs(World) Coordinate Reference System: User input: EPSG:4326 wkt: GEOGCRS["WGS 84", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433]], CS[ellipsoidal,2], AXIS["geodetic latitude (Lat)",north, ORDER[1], ANGLEUNIT["degree",0.0174532925199433]], AXIS["geodetic longitude (Lon)",east, ORDER[2], ANGLEUNIT["degree",0.0174532925199433]], USAGE[ SCOPE["unknown"], AREA["World"], BBOX[-90,-180,90,180]], ID["EPSG",4326]] The following chunk transforms the world map to a custom azimuthal equidistant projection centered on latitude 0 and longitude 0. Here, we’ll use the proj4 syntax. World.ae <- st_transform(World, "+proj=aeqd +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") Let’s check the CRS of the newly created vector layer st_crs(World.ae) Coordinate Reference System: User input: +proj=aeqd +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs wkt: PROJCRS["unknown", BASEGEOGCRS["unknown", DATUM["World Geodetic System 1984", ELLIPSOID["WGS 84",6378137,298.257223563, LENGTHUNIT["metre",1]], ID["EPSG",6326]], PRIMEM["Greenwich",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8901]]], CONVERSION["unknown", METHOD["Modified Azimuthal Equidistant", ID["EPSG",9832]], PARAMETER["Latitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8801]], PARAMETER["Longitude of natural origin",0, ANGLEUNIT["degree",0.0174532925199433], ID["EPSG",8802]], PARAMETER["False easting",0, LENGTHUNIT["metre",1], ID["EPSG",8806]], PARAMETER["False northing",0, LENGTHUNIT["metre",1], ID["EPSG",8807]]], CS[Cartesian,2], AXIS["(E)",east, ORDER[1], LENGTHUNIT["metre",1, ID["EPSG",9001]]], AXIS["(N)",north, ORDER[2], LENGTHUNIT["metre",1, ID["EPSG",9001]]]] Here’s the mapped output: tm_shape(World.ae) + tm_fill() The following chunk transforms the world map to an Azimuthal equidistant projection centered on Maine, USA (69.8° West, 44.5° North) . World.aemaine <- st_transform(World, "+proj=aeqd +lat_0=44.5 +lon_0=-69.8 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") tm_shape(World.aemaine) + tm_fill() The following chunk transforms the world map to a World Robinson projection. World.robin <- st_transform(World,"+proj=robin +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") tm_shape(World.robin) + tm_fill() The following chunk transforms the world map to a World sinusoidal projection. World.sin <- st_transform(World,"+proj=sinu +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") tm_shape(World.sin) + tm_fill() The following chunk transforms the world map to a World Mercator projection. World.mercator <- st_transform(World,"+proj=merc +lon_0=0 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") tm_shape(World.mercator) + tm_fill() Reprojecting to a new meridian center An issue that can come up when transforming spatial data is when the location of the tangent line(s) or points in the CS definition forces polygon features to be split across the 180° meridian. For example, re-centering the Mercator projection to -69° will create the following output. World.mercator2 <- st_transform(World, "+proj=merc +lon_0=-69 +k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") tm_shape(World.mercator2) + tm_borders() The polygons are split and R does not know how to piece them together. One solution is to split the polygons at the new meridian using the st_break_antimeridian function before projecting to a new re-centered coordinate system. # Define new meridian meridian2 <- -69 # Split world at new meridian wld.new <- st_break_antimeridian(World, lon_0 = meridian2) # Now reproject to Mercator using new meridian center wld.merc2 <- st_transform(wld.new, paste("+proj=merc +lon_0=", meridian2 , "+k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") ) tm_shape(wld.merc2) + tm_borders() This technique can be applied to any other projections. Here’s an example of a Robinson projection. wld.rob.sf <- st_transform(wld.new, paste("+proj=robin +lon_0=", meridian2 , "+k=1 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs") ) tm_shape(wld.rob.sf) + tm_borders() A note about containment While in theory, a point completely enclosed by a bounded area should always remain bounded by that area in any projection, this is not always the case in practice. This is because the transformation applies to the vertices that define the line segments and not the lines themselves. So if a point is inside of a polygon and very close to one of its boundaries in its native projection, it may find itself on the other side of that line segment in another projection hence outside of that polygon. In the following example, a polygon layer and point layer are created in a Miller coordinate system where the points are enclosed in the polygons. # Define a few projections miller <- "+proj=mill +lat_0=0 +lon_0=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs" lambert <- "+proj=lcc +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs" # Subset the World data layer and reproject to Miller wld.mil <- subset(World, iso_a3 == "CAN" | iso_a3 == "USA") |> st_transform(miller) # Create polygon and point layers in the Miller projection sf1 <- st_sfc( st_polygon(list(cbind(c(-13340256,-13340256,-6661069, -6661069, -13340256), c(7713751, 5326023, 5326023,7713751, 7713751 )))), crs = miller) pt1 <- st_sfc( st_multipoint(rbind(c(-11688500,7633570), c(-11688500,5375780), c(-10018800,7633570), c(-10018800,5375780), c(-8348960,7633570), c(-8348960,5375780))), crs = miller) pt1 <- st_cast(pt1, "POINT") # Create single part points # Plot the data layers in their native projection tm_shape(wld.mil) +tm_fill(col="grey") + tm_graticules(x = c(-60,-80,-100, -120, -140), y = c(30,45, 60), labels.col = "white", col="grey90") + tm_shape(sf1) + tm_polygons("red", alpha = 0.5, border.col = "yellow") + tm_shape(pt1) + tm_dots(size=0.2) The points are close to the boundaries, but they are inside of the polygon nonetheless. To confirm, we can run st_contains on the dataset: st_contains(sf1, pt1) Sparse geometry binary predicate list of length 1, where the predicate was `contains' 1: 1, 2, 3, 4, 5, 6 All six points are selected, as expected. Now, let’s reproject the data into a Lambert conformal projection. # Transform the data wld.lam <- st_transform(wld.mil, lambert) pt1.lam <- st_transform(pt1, lambert) sf1.lam <- st_transform(sf1, lambert) # Plot the data in the Lambert coordinate system tm_shape(wld.lam) +tm_fill(col="grey") + tm_graticules( x = c(-60,-80,-100, -120, -140), y = c(30,45, 60), labels.col = "white", col="grey90") + tm_shape(sf1.lam) + tm_polygons("red", alpha = 0.5, border.col = "yellow") + tm_shape(pt1.lam) + tm_dots(size=0.2) Only three of the points are contained. We can confirm this using the st_contains function: st_contains(sf1.lam, pt1.lam) Sparse geometry binary predicate list of length 1, where the predicate was `contains' 1: 1, 3, 5 To resolve this problem, one should densify the polygon by adding more vertices along the line segment. The vertices density will be dictated by the resolution needed to preserve the map’s containment properties and is best determined experimentally. We’ll use the st_segmentize function to create vertices at 1 km (1000 m) intervals. # Add vertices every 1000 meters along the polygon's line segments sf2 <- st_segmentize(sf1, 1000) # Transform the newly densified polygon layer sf2.lam <- st_transform(sf2, lambert) # Plot the data tm_shape(wld.lam) + tm_fill(col="grey") + tm_graticules( x = c(-60,-80,-100, -120, -140), y = c(30,45, 60), labels.col = "white", col="grey90") + tm_shape(sf2.lam) + tm_polygons("red", alpha = 0.5, border.col = "yellow") + tm_shape(pt1.lam) + tm_dots(size=0.2) Now all points remain contained by the polygon. We can check via: st_contains(sf2.lam, pt1.lam) Sparse geometry binary predicate list of length 1, where the predicate was `contains' 1: 1, 2, 3, 4, 5, 6 Creating Tissot indicatrix circles Most projections will distort some aspect of a spatial property, especially area and shape. A nice way to visualize the distortion afforded by a projection is to create geodesic circles. First, create a point layer that will define the circle centers in a lat/long coordinate system. tissot.pt <- st_sfc( st_multipoint(rbind(c(-60,30), c(-60,45), c(-60,60), c(-80,30), c(-80,45), c(-80,60), c(-100,30), c(-100,45), c(-100,60), c(-120,30), c(-120,45), c(-120,60) )), crs = "+proj=longlat") tissot.pt <- st_cast(tissot.pt, "POINT") # Create single part points Next we’ll construct geodesic circles from these points using the geosphere package. library(geosphere) cr.pt <- list() # Create an empty list # Loop through each point in tissot.pt and generate 360 vertices at 300 km # from each point in all directions at 1 degree increment. These vertices # will be used to approximate the Tissot circles for (i in 1:length(tissot.pt)){ cr.pt[[i]] <- list( destPoint( as(tissot.pt[i], "Spatial"), b=seq(0,360,1), d=300000) ) } # Create a closed polygon from the previously generated vertices tissot.sfc <- st_cast( st_sfc(st_multipolygon(cr.pt ),crs = "+proj=longlat"), "POLYGON" ) We’ll check that these are indeed geodesic circles by computing the geodesic area of each polygon. We’ll use the st_area function from sf which will revert to geodesic area calculation if a lat/long coordinate system is present. tissot.sf <- st_sf( geoArea = st_area(tissot.sfc), tissot.sfc ) The true area of the circles should be \\(\\pi * r^2\\) or 2.8274334^{11} square meters in our example. Let’s compute the error in the tissot output. The values will be reported as fractions. ( (pi * 300000^2) - as.vector(tissot.sf$geoArea) ) / (pi * 300000^2) [1] -0.0008937164 0.0024530577 0.0057943110 -0.0008937164 [5] 0.0024530577 0.0057943110 -0.0008937164 0.0024530577 [9] 0.0057943110 -0.0008937164 0.0024530577 0.0057943110 In all cases, the error is less than 0.1%. The error is primarily due to the discretization of the circle parameter. Let’s now take a look at the distortions associated with a few popular coordinate systems. We’ll start by exploring the Mercator projection. # Transform geodesic circles and compute area error as a percentage tissot.merc <- st_transform(tissot.sf, "+proj=merc +ellps=WGS84") tissot.merc$area_err <- round((st_area(tissot.merc, tissot.merc$geoArea)) / tissot.merc$geoArea * 100 , 2) # Plot the map tm_shape(World, bbox = st_bbox(tissot.merc), projection = st_crs(tissot.merc)) + tm_borders() + tm_shape(tissot.merc) + tm_polygons(col="grey", border.col = "red", alpha = 0.3) + tm_graticules(x = c(-60,-80,-100, -120, -140), y = c(30,45, 60), labels.col = "white", col="grey80") + tm_text("area_err", size=.8, alpha=0.8, col="blue") The mercator projection does a good job at preserving shape, but the area’s distortion increases dramatically poleward. Next, we’ll explore the Lambert azimuthal equal area projection centered at 45 degrees north and 100 degrees west. # Transform geodesic circles and compute area error as a percentage tissot.laea <- st_transform(tissot.sf, "+proj=laea +lat_0=45 +lon_0=-100 +ellps=WGS84") tissot.laea$area_err <- round( (st_area(tissot.laea ) - tissot.laea$geoArea) / tissot.laea$geoArea * 100, 2) # Plot the map tm_shape(World, bbox = st_bbox(tissot.laea), projection = st_crs(tissot.laea)) + tm_borders() + tm_shape(tissot.laea) + tm_polygons(col="grey", border.col = "red", alpha = 0.3) + tm_graticules(x=c(-60,-80,-100, -120, -140), y = c(30,45, 60), labels.col = "white", col="grey80") + tm_text("area_err", size=.8, alpha=0.8, col="blue") The area error across the 48 states is near 0. But note that the shape does become slightly distorted as we move away from the center of projection. Next, we’ll explore the Robinson projection. # Transform geodesic circles and compute area error as a percentage tissot.robin <- st_transform(tissot.sf, "+proj=robin +ellps=WGS84") tissot.robin$area_err <- round( (st_area(tissot.robin ) - tissot.robin$geoArea) / tissot.robin$geoArea * 100, 2) # Plot the map tm_shape(World, bbox = st_bbox(tissot.robin), projection = st_crs(tissot.robin)) + tm_borders() + tm_shape(tissot.robin) + tm_polygons(col="grey", border.col = "red", alpha = 0.3) + tm_graticules(x=c(-60,-80,-100, -120, -140), y = c(30,45, 60), labels.col = "white", col="grey80") + tm_text("area_err", size=.8, alpha=0.8, col="blue") Both shape and area are measurably distorted for the north american continent. "],["point-pattern-analysis-in-r.html", "H Point pattern analysis in R Sample files for this exercise Prepping the data Density based analysis Distance based analysis Hypothesis tests", " H Point pattern analysis in R R spatstat 4.3.1 3.0.7 For a basic theoretical treatise on point pattern analysis (PPA) the reader is encouraged to review the point pattern analysis lecture notes. This section is intended to supplement the lecture notes by implementing PPA techniques in the R programming environment. Sample files for this exercise Data used in the following exercises can be loaded into your current R session by running the following chunk of code. load(url("https://github.com/mgimond/Spatial/raw/main/Data/ppa.RData")) The data objects consist of three spatial data layers: starbucks: A ppp point layer of Starbucks stores in Massachusetts; ma: An owin polygon layer of Massachusetts boundaries; pop: An im raster layer of population density distribution. All layers are in a format supported by the spatstat (Baddeley, Rubak, and Turner 2016) package. Note that these layers are not authoritative and are to be used for instructional purposes only. Prepping the data All point pattern analysis tools used in this tutorial are available in the spatstat package. These tools are designed to work with points stored as ppp objects and not SpatialPointsDataFrame or sf objects. Note that a ppp object may or may not have attribute information (also referred to as marks). Knowing whether or not a function requires that an attribute table be present in the ppp object matters if the operation is to complete successfully. In this tutorial we will only concern ourselves with the pattern generated by the points and not their attributes. We’ll therefore remove all marks from the point object. library(spatstat) marks(starbucks) <- NULL Many point pattern analyses such as the average nearest neighbor analysis should have their study boundaries explicitly defined. This can be done in spatstat by “binding” the Massachusetts boundary polygon to the Starbucks point feature object using the Window() function. Note that the function name starts with an upper case W. Window(starbucks) <- ma We can plot the point layer to ensure that the boundary is properly defined for that layer. plot(starbucks, main=NULL, cols=rgb(0,0,0,.2), pch=20) We’ll make another change to the dataset. Population density values for an administrative layer are usually quite skewed. The population density for Massachusetts is no exception. The following code chunk generates a histogram from the pop raster layer. hist(pop, main=NULL, las=1) Transforming the skewed distribution in the population density covariate may help reveal relationships between point distributions and the covariate in some of the point pattern analyses covered later in this tutorial. We’ll therefore create a log-transformed version of pop. pop.lg <- log(pop) hist(pop.lg, main=NULL, las=1) We’ll be making use of both expressions of the population density distribution in the following exercises. Density based analysis Quadrat density You can compute the quadrat count and intensity using spatstat’s quadratcount() and intensity() functions. The following code chunk divides the state of Massachusetts into a grid of 3 rows and 6 columns then tallies the number of points falling in each quadrat. Q <- quadratcount(starbucks, nx= 6, ny=3) The object Q stores the number of points inside each quadrat. You can plot the quadrats along with the counts as follows: plot(starbucks, pch=20, cols="grey70", main=NULL) # Plot points plot(Q, add=TRUE) # Add quadrat grid You can compute the density of points within each quadrat as follows: # Compute the density for each quadrat Q.d <- intensity(Q) # Plot the density plot(intensity(Q, image=TRUE), main=NULL, las=1) # Plot density raster plot(starbucks, pch=20, cex=0.6, col=rgb(0,0,0,.5), add=TRUE) # Add points The density values are reported as the number of points (stores) per square meters, per quadrat. The Length dimension unit is extracted from the coordinate system associated with the point layer. In this example, the length unit is in meters, so the density is reported as points per square meter. Such a small length unit is not practical at this scale of analysis. It’s therefore desirable to rescale the spatial objects to a larger length unit such as the kilometer. starbucks.km <- rescale(starbucks, 1000, "km") ma.km <- rescale(ma, 1000, "km") pop.km <- rescale(pop, 1000, "km") pop.lg.km <- rescale(pop.lg, 1000, "km") The second argument to the rescale function divides the current unit (meter) to get the new unit (kilometer). This gives us more sensible density values to work with. # Compute the density for each quadrat (in counts per km2) Q <- quadratcount(starbucks.km, nx= 6, ny=3) Q.d <- intensity(Q) # Plot the density plot(intensity(Q, image=TRUE), main=NULL, las=1) # Plot density raster plot(starbucks.km, pch=20, cex=0.6, col=rgb(0,0,0,.5), add=TRUE) # Add points Quadrat density on a tessellated surface We can use a covariate such as the population density raster to define non-uniform quadrats. We’ll first divide the population density covariate into four regions (aka tessellated surfaces) following an equal interval classification scheme. Recall that we are working with the log transformed population density values. The breaks will be defined as follows: Break Logged population density value 1 ] -Inf; 4 ] 2 ] 4 ; 6 ] 3 ] 3 ; 8 ] 4 ] 8 ; Inf ] brk <- c( -Inf, 4, 6, 8 , Inf) # Define the breaks Zcut <- cut(pop.lg.km, breaks=brk, labels=1:4) # Classify the raster E <- tess(image=Zcut) # Create a tesselated surface The tessellated object can be mapped to view the spatial distribution of quadrats. plot(E, main="", las=1) Next, we’ll tally the quadrat counts within each tessellated area then compute their density values (number of points per quadrat area). Q <- quadratcount(starbucks.km, tess = E) # Tally counts Q.d <- intensity(Q) # Compute density Q.d tile 1 2 3 4 0.0000000000 0.0003706106 0.0103132964 0.0889370933 Recall that the length unit is kilometer so the above density values are number of points per square kilometer within each quadrat unit. Plot the density values across each tessellated region. plot(intensity(Q, image=TRUE), las=1, main=NULL) plot(starbucks.km, pch=20, cex=0.6, col=rgb(1,1,1,.5), add=TRUE) Let’s modify the color scheme. cl <- interp.colours(c("lightyellow", "orange" ,"red"), E$n) plot( intensity(Q, image=TRUE), las=1, col=cl, main=NULL) plot(starbucks.km, pch=20, cex=0.6, col=rgb(0,0,0,.5), add=TRUE) Kernel density raster The spatstat package has a function called density which computes an isotropic kernel intensity estimate of the point pattern. Its bandwidth defines the kernel’s window extent. This next code chunk uses the default bandwidth. K1 <- density(starbucks.km) # Using the default bandwidth plot(K1, main=NULL, las=1) contour(K1, add=TRUE) In this next chunk, a 50 km bandwidth (sigma = 50) is used. Note that the length unit is extracted from the point layer’s mapping units (which was rescaled to kilometers earlier in this exercise). K2 <- density(starbucks.km, sigma=50) # Using a 50km bandwidth plot(K2, main=NULL, las=1) contour(K2, add=TRUE) The kernel defaults to a gaussian smoothing function. The smoothing function can be changed to a quartic, disc or epanechnikov function. For example, to change the kernel to a disc function type: K3 <- density(starbucks.km, kernel = "disc", sigma=50) # Using a 50km bandwidth plot(K3, main=NULL, las=1) contour(K3, add=TRUE) Kernel density adjusted for covariate In the following example, a Starbucks store point process’ intensity is estimated following the population density raster covariate. The outputs include a plot of \\(\\rho\\) vs. population density and a raster map of \\(\\rho\\) controlled for population density. # Compute rho using the ratio method rho <- rhohat(starbucks.km, pop.lg.km, method="ratio") # Generate rho vs covariate plot plot(rho, las=1, main=NULL, legendargs=list(cex=0.8, xpd=TRUE, inset=c(1.01, 0) )) It’s important to note that we are not fitting a parametric model to the data. Instead, a non-parametric curve is fit to the data. Its purpose is to describe/explore the shape of the relationship between point density and covariate. Note the exponentially increasing intensity of Starbucks stores with increasing population density values when the population density is expressed as a log. The grey envelope represents the 95% confidence interval. The following code chunk generates the map of the predicted Starbucks density if population density were the sole driving process. (Note the use of the gamma parameter to “stretch” the color scheme in the map). pred <- predict(rho) cl <- interp.colours(c("lightyellow", "orange" ,"red"), 100) # Create color scheme plot(pred, col=cl, las=1, main=NULL, gamma = 0.25) The predicted intensity’s spatial pattern mirrors the covariate’s population distribution pattern. The predicted intensity values range from 0 to about 5 stores per square kilometer. You’ll note that this maximum value does not match the maximum value of ~3 shown in the rho vs population density plot. This is because the plot did not show the full range of population density values (the max density value shown was 10). The population raster layer has a maximum pixel value of 11.03 (this value can be extracted via max(pop.lg.km)). We can compare the output of the predicted Starbucks stores intensity function to that of the observed Starbucks stores intensity function. We’ll use the variable K1 computed earlier to represent the observed intensity function. K1_vs_pred <- pairs(K1, pred, plot = FALSE) plot(K1_vs_pred$pred ~ K1_vs_pred$K1, pch=20, xlab = "Observed intensity", ylab = "Predicted intensity", col = rgb(0,0,0,0.1)) If the modeled intensity was comparable to the observed intensity, we would expect the points to cluster along a one-to-one diagonal. An extreme example is to compare the observed intensity with itself which offers a perfect match of intensity values. K1_vs_K1 <- pairs(K1, K1, labels = c("K1a", "K1b"), plot = FALSE) plot(K1_vs_K1$K1a ~ K1_vs_K1$K1b, pch=20, xlab = "Observed intensity", ylab = "Observed intensity") So going back to our predicted vs observed intensity plot, we note a strong skew in the predicted intensity values. We also note an overestimation of intensity around higher values. summary(as.data.frame(K1_vs_pred)) K1 pred Min. :8.846e-05 Min. :0.000000 1st Qu.:1.207e-03 1st Qu.:0.000282 Median :3.377e-03 Median :0.001541 Mean :8.473e-03 Mean :0.007821 3rd Qu.:1.078e-02 3rd Qu.:0.005904 Max. :5.693e-02 Max. :5.105112 The predicted maximum intensity value is two orders of magnitude greater than that observed. The overestimation of intenstity values can also be observed at lower values. The following plot limits the data to observed intensities less than 0.04. A red one-to-one line is added for reference. If intensities were similar, they would aggregate around this line. plot(K1_vs_pred$pred ~ K1_vs_pred$K1, pch=20, xlab = "Observed intensity", ylab = "Predicted intensity", col = rgb(0,0,0,0.1), xlim = c(0, 0.04), ylim = c(0, 0.1)) abline(a=0, b = 1, col = "red") Modeling intensity as a function of a covariate The relationship between the predicted Starbucks store point pattern intensity and the population density distribution can be modeled following a Poisson point process model. We’ll generate the Poisson point process model then plot the results. # Create the Poisson point process model PPM1 <- ppm(starbucks.km ~ pop.lg.km) # Plot the relationship plot(effectfun(PPM1, "pop.lg.km", se.fit=TRUE), main=NULL, las=1, legendargs=list(cex=0.8, xpd=TRUE, inset=c(1.01, 0) )) Note that this is not the same relationship as \\(\\rho\\) vs. population density shown in the previous section. Here, we’re fitting a well defined model to the data whose parameters can be extracted from the PPM1 object. PPM1 Nonstationary Poisson process Fitted to point pattern dataset 'starbucks.km' Log intensity: ~pop.lg.km Fitted trend coefficients: (Intercept) pop.lg.km -13.710551 1.279928 Estimate S.E. CI95.lo CI95.hi Ztest Zval (Intercept) -13.710551 0.46745489 -14.626746 -12.794356 *** -29.33021 pop.lg.km 1.279928 0.05626785 1.169645 1.390211 *** 22.74705 Problem: Values of the covariate 'pop.lg.km' were NA or undefined at 0.57% (4 out of 699) of the quadrature points The model takes on the form: \\[ \\lambda(i) = e^{-13.71 + 1.27(logged\\ population\\ density)} \\] Here, the base intensity is close to zero (\\(e^{-13.71}\\)) when the logged population density is zero and for every increase in one unit of the logged population density, the Starbucks point density increases by a factor of \\(e^{1.27}\\) units. Distance based analysis Next, we’ll explore three different distance based analyses: The average nearest neighbor, the \\(K\\) and \\(L\\) functions and the pair correlation function \\(g\\). Average nearest neighbor analysis Next, we’ll compute the average nearest neighbor (ANN) distances between Starbucks stores. To compute the average first nearest neighbor distance (in kilometers) set k=1: mean(nndist(starbucks.km, k=1)) [1] 3.275492 To compute the average second nearest neighbor distance set k=2: mean(nndist(starbucks.km, k=2)) [1] 5.81173 The parameter k can take on any order neighbor (up to n-1 where n is the total number of points). The average nearest neighbor function can be expended to generate an ANN vs neighbor order plot. In the following example, we’ll plot ANN as a function of neighbor order for the first 100 closest neighbors: ANN <- apply(nndist(starbucks.km, k=1:100),2,FUN=mean) plot(ANN ~ eval(1:100), type="b", main=NULL, las=1) The bottom axis shows the neighbor order number and the left axis shows the average distance in kilometers. K and L functions To compute the K function, type: K <- Kest(starbucks.km) plot(K, main=NULL, las=1, legendargs=list(cex=0.8, xpd=TRUE, inset=c(1.01, 0) )) The plot returns different estimates of \\(K\\) depending on the edge correction chosen. By default, the isotropic, translate and border corrections are implemented. To learn more about these edge correction methods type ?Kest at the command line. The estimated \\(K\\) functions are listed with a hat ^. The black line (\\(K_{pois}\\)) represents the theoretical \\(K\\) function under the null hypothesis that the points are completely randomly distributed (CSR/IRP). Where \\(K\\) falls under the theoretical \\(K_{pois}\\) line the points are deemed more dispersed than expected at distance \\(r\\). Where \\(K\\) falls above the theoretical \\(K_{pois}\\) line the points are deemed more clustered than expected at distance \\(r\\). To compute the L function, type: L <- Lest(starbucks.km, main=NULL) plot(L, main=NULL, las=1, legendargs=list(cex=0.8, xpd=TRUE, inset=c(1.01, 0) )) To plot the L function with the Lexpected line set horizontal: plot(L, . -r ~ r, main=NULL, las=1, legendargs=list(cex=0.8, xpd=TRUE, inset=c(1.01, 0) )) Pair correlation function g To compute the pair correlation function type: g <- pcf(starbucks.km) plot(g, main=NULL, las=1, legendargs=list(cex=0.8, xpd=TRUE, inset=c(1.01, 0) )) As with the Kest and Lest functions, the pcf function outputs different estimates of \\(g\\) using different edge correction methods (Ripley and Translate). The theoretical \\(g\\)-function \\(g_{Pois}\\) under a CSR process (green dashed line) is also displayed for comparison. Where the observed \\(g\\) is greater than \\(g_{Pois}\\) we can expect more clustering than expected and where the observed \\(g\\) is less than \\(g_{Pois}\\) we can expect more dispersion than expected. Hypothesis tests Test for clustering/dispersion First, we’ll run an ANN analysis for Starbucks locations assuming a uniform point density across the state (i.e. a completely spatially random process). ann.p <- mean(nndist(starbucks.km, k=1)) ann.p [1] 3.275492 The observed average nearest neighbor distance is 3.28 km. Next, we will generate the distribution of expected ANN values given a homogeneous (CSR/IRP) point process using Monte Carlo methods. This is our null model. n <- 599L # Number of simulations ann.r <- vector(length = n) # Create an empty object to be used to store simulated ANN values for (i in 1:n){ rand.p <- rpoint(n=starbucks.km$n, win=ma.km) # Generate random point locations ann.r[i] <- mean(nndist(rand.p, k=1)) # Tally the ANN values } In the above loop, the function rpoint is passed two parameters: n=starbucks.km$n and win=ma.km. The first tells the function how many points to randomly generate (starbucks.km$n extracts the number of points from object starbucks.km). The second tells the function to confine the points to the extent defined by ma.km. Note that the latter parameter is not necessary if the ma boundary was already defined as the starbucks window extent. You can plot the last realization of the homogeneous point process to see what a completely random placement of Starbucks stores could look like. plot(rand.p, pch=16, main=NULL, cols=rgb(0,0,0,0.5)) Our observed distribution of Starbucks stores certainly does not look like the outcome of a completely independent random process. Next, let’s plot the histogram of expected values under the null and add a blue vertical line showing where our observed ANN value lies relative to this distribution. hist(ann.r, main=NULL, las=1, breaks=40, col="bisque", xlim=range(ann.p, ann.r)) abline(v=ann.p, col="blue") It’s obvious from the test that the observed ANN value is far smaller than the expected ANN values one could expect under the null hypothesis. A smaller observed value indicates that the stores are far more clustered than expected under the null. Next, we’ll run the same test but control for the influence due to population density distribution. Recall that the ANN analysis explores the 2nd order process underlying a point pattern thus requiring that we control for the first order process (e.g. population density distribution). This is a non-homogeneous test. Here, we pass the parameter f=pop.km to the function rpoint telling it that the population density raster pop.km should be used to define where a point should be most likely placed (high population density) and least likely placed (low population density) under this new null model. Here, we’ll use the non-transformed representation of the population density raster, pop.km. n <- 599L ann.r <- vector(length=n) for (i in 1:n){ rand.p <- rpoint(n=starbucks.km$n, f=pop.km) ann.r[i] <- mean(nndist(rand.p, k=1)) } You can plot the last realization of the non-homogeneous point process to convince yourself that the simulation correctly incorporated the covariate raster in its random point function. Window(rand.p) <- ma.km # Replace raster mask with ma.km window plot(rand.p, pch=16, main=NULL, cols=rgb(0,0,0,0.5)) Note the cluster of points near the highly populated areas. This pattern is different from the one generated from a completely random process. Next, let’s plot the histogram and add a blue line showing where our observed ANN value lies. hist(ann.r, main=NULL, las=1, breaks=40, col="bisque", xlim=range(ann.p, ann.r)) abline(v=ann.p, col="blue") Even though the distribution of ANN values we would expect when controlled for the population density nudges closer to our observed ANN value, we still cannot say that the clustering of Starbucks stores can be explained by a completely random process when controlled for population density. Computing a pseudo p-value from the simulation A (pseudo) p-value can be extracted from a Monte Carlo simulation. We’ll work off of the last simulation. First, we need to find the number of simulated ANN values greater than our observed ANN value. N.greater <- sum(ann.r > ann.p) To compute the p-value, find the end of the distribution closest to the observed ANN value, then divide that count by the total count. Note that this is a so-called one-sided P-value. See lecture notes for more information. p <- min(N.greater + 1, n + 1 - N.greater) / (n +1) p [1] 0.001666667 In our working example, you’ll note that or simulated ANN value was nowhere near the range of ANN values computed under the null yet we don’t have a p-value of zero. This is by design since the strength of our estimated p will be proportional to the number of simulations–this reflects the chance that given an infinite number of simulations at least one realization of a point pattern could produce an ANN value more extreme than ours. Test for a poisson point process model with a covariate effect The ANN analysis addresses the 2nd order effect of a point process. Here, we’ll address the 1st order process using the poisson point process model. We’ll first fit a model that assumes that the point process’ intensity is a function of the logged population density (this will be our alternate hypothesis). PPM1 <- ppm(starbucks.km ~ pop.lg.km) PPM1 Nonstationary Poisson process Fitted to point pattern dataset 'starbucks.km' Log intensity: ~pop.lg.km Fitted trend coefficients: (Intercept) pop.lg.km -13.710551 1.279928 Estimate S.E. CI95.lo CI95.hi Ztest Zval (Intercept) -13.710551 0.46745489 -14.626746 -12.794356 *** -29.33021 pop.lg.km 1.279928 0.05626785 1.169645 1.390211 *** 22.74705 Problem: Values of the covariate 'pop.lg.km' were NA or undefined at 0.57% (4 out of 699) of the quadrature points Next, we’ll fit the model that assumes that the process’ intensity is not a function of population density (the null hypothesis). PPM0 <- ppm(starbucks.km ~ 1) PPM0 Stationary Poisson process Fitted to point pattern dataset 'starbucks.km' Intensity: 0.008268627 Estimate S.E. CI95.lo CI95.hi Ztest Zval log(lambda) -4.795287 0.07647191 -4.945169 -4.645405 *** -62.70651 In our working example, the null model (homogeneous intensity) takes on the form: \\[ \\lambda(i) = e^{-4.795} \\] \\(\\lambda(i)\\) under the null is nothing more than the observed density of Starbucks stores within the State of Massachusetts, or: starbucks.km$n / area(ma.km) [1] 0.008268627 The alternate model takes on the form: \\[ \\lambda(i) = e^{-13.71 + 1.27\\ (logged\\ population\\ density)} \\] The models are then compared using the likelihood ratio test which produces the following output: anova(PPM0, PPM1, test="LRT") Npar Df Deviance Pr(>Chi) 5 NA NA NA 6 1 537.218 0 The value under the heading PR(>Chi) is the p-value which gives us the probability that we would be wrong in rejecting the null. Here p~0 suggests that there is close to a 0% chance that we would be wrong in rejecting the base model in favor of the alternate model–put another way, the alternate model (that the logged population density can help explain the distribution of Starbucks stores) is a significant improvement over the null. Note that if you were to compare two competing non-homogeneous models such as population density and income distributions, you would need to compare the model with one of the covariates with an augmented version of that model using the other covariate. In other words, you would need to compare PPM1 <- ppm(starbucks.km ~ pop.lg.km) with something like PPM2 <- ppm(starbucks.km ~ pop.lg.km + income.km). References "],["spatial-autocorrelation-in-r.html", "I Spatial autocorrelation in R Sample files for this exercise Introduction Define neighboring polygons Computing the Moran’s I statistic: the hard way Computing the Moran’s I statistic: the easy way Moran’s I as a function of a distance band", " I Spatial autocorrelation in R R tmap spdep 4.3.1 3.3.3 1.2.8 For a basic theoretical treatise on spatial autocorrelation the reader is encouraged to review the lecture notes. This section is intended to supplement the lecture notes by implementing spatial autocorrelation techniques in the R programming environment. Sample files for this exercise Data used in the following exercises can be loaded into your current R session by running the following chunk of code. z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/s_sf.RDS")) s1 <- readRDS(z) The data object consists of an sf vector layer representing income and education data aggregated at the county level for the state of Maine. Introduction The spatial object s1 has five attributes. The one of interest for this exercise is Income (per capita, in units of dollars). Let’s map the income distribution using a quantile classification scheme. We’ll make use of the tmap package. library(tmap) tm_shape(s1) + tm_polygons(style="quantile", col = "Income") + tm_legend(outside = TRUE, text.size = .8) Define neighboring polygons The first step requires that we define “neighboring” polygons. This could refer to contiguous polygons, polygons within a certain distance band, or it could be non-spatial in nature and defined by social, political or cultural “neighbors”. Here, we’ll adopt a contiguous neighbor definition where we’ll accept any contiguous polygon that shares at least on vertex (this is the “queen” case and is defined by setting the parameter queen=TRUE). If we required that at least one edge be shared between polygons then we would set queen=FALSE. library(spdep) nb <- poly2nb(s1, queen=TRUE) For each polygon in our polygon object, nb lists all neighboring polygons. For example, to see the neighbors for the first polygon in the object, type: nb[[1]] [1] 2 3 4 5 Polygon 1 has 4 neighbors. The numbers represent the polygon IDs as stored in the spatial object s1. Polygon 1 is associated with the County attribute name Aroostook: s1$NAME[1] [1] Aroostook 16 Levels: Androscoggin Aroostook Cumberland Franklin Hancock Kennebec ... York Its four neighboring polygons are associated with the counties: s1$NAME[c(2,3,4,5)] [1] Somerset Piscataquis Penobscot Washington 16 Levels: Androscoggin Aroostook Cumberland Franklin Hancock Kennebec ... York Next, we need to assign weights to each neighboring polygon. In our case, each neighboring polygon will be assigned equal weight (style=\"W\"). This is accomplished by assigning the fraction \\(1/ (\\# of neighbors)\\) to each neighboring county then summing the weighted income values. While this is the most intuitive way to summaries the neighbors’ values it has one drawback in that polygons along the edges of the study area will base their lagged values on fewer polygons thus potentially over- or under-estimating the true nature of the spatial autocorrelation in the data. For this example, we’ll stick with the style=\"W\" option for simplicity’s sake but note that other more robust options are available, notably style=\"B\". lw <- nb2listw(nb, style="W", zero.policy=TRUE) The zero.policy=TRUE option allows for lists of non-neighbors. This should be used with caution since the user may not be aware of missing neighbors in their dataset however, a zero.policy of FALSE would return an error. To see the weight of the first polygon’s four neighbors type: lw$weights[1] [[1]] [1] 0.25 0.25 0.25 0.25 Each neighbor is assigned a quarter of the total weight. This means that when R computes the average neighboring income values, each neighbor’s income will be multiplied by 0.25 before being tallied. Finally, we’ll compute the average neighbor income value for each polygon. These values are often referred to as spatially lagged values. Inc.lag <- lag.listw(lw, s1$Income) The following table shows the average neighboring income values (stored in the Inc.lag object) for each county. Computing the Moran’s I statistic: the hard way We can plot lagged income vs. income and fit a linear regression model to the data. # Create a regression model M <- lm(Inc.lag ~ s1$Income) # Plot the data plot( Inc.lag ~ s1$Income, pch=20, asp=1, las=1) The slope of the regression line is the Moran’s I coefficient. coef(M)[2] s1$Income 0.2828111 To assess if the slope is significantly different from zero, we can randomly permute the income values across all counties (i.e. we are not imposing any spatial autocorrelation structure), then fit a regression model to each permuted set of values. The slope values from the regression give us the distribution of Moran’s I values we could expect to get under the null hypothesis that the income values are randomly distributed across the counties. We then compare the observed Moran’s I value to this distribution. n <- 599L # Define the number of simulations I.r <- vector(length=n) # Create an empty vector for (i in 1:n){ # Randomly shuffle income values x <- sample(s1$Income, replace=FALSE) # Compute new set of lagged values x.lag <- lag.listw(lw, x) # Compute the regression slope and store its value M.r <- lm(x.lag ~ x) I.r[i] <- coef(M.r)[2] } # Plot the histogram of simulated Moran's I values # then add our observed Moran's I value to the plot hist(I.r, main=NULL, xlab="Moran's I", las=1) abline(v=coef(M)[2], col="red") The simulation suggests that our observed Moran’s I value is not consistent with a Moran’s I value one would expect to get if the income values were not spatially autocorrelated. In the next step, we’ll compute a pseudo p-value from this simulation. Computing a pseudo p-value from an MC simulation First, we need to find the number of simulated Moran’s I values values greater than our observed Moran’s I value. N.greater <- sum(coef(M)[2] > I.r) To compute the p-value, find the end of the distribution closest to the observed Moran’s I value, then divide that count by the total count. Note that this is a so-called one-sided P-value. See lecture notes for more information. p <- min(N.greater + 1, n + 1 - N.greater) / (n + 1) p [1] 0.02166667 In our working example, the p-value suggests that there is a small chance (0.022%) of being wrong in stating that the income values are not clustered at the county level. Computing the Moran’s I statistic: the easy way To get the Moran’s I value, simply use the moran.test function. moran.test(s1$Income,lw) Moran I test under randomisation data: s1$Income weights: lw Moran I statistic standard deviate = 2.2472, p-value = 0.01231 alternative hypothesis: greater sample estimates: Moran I statistic Expectation Variance 0.28281108 -0.06666667 0.02418480 Note that the p-value computed from the moran.test function is not computed from an MC simulation but analytically instead. This may not always prove to be the most accurate measure of significance. To test for significance using the MC simulation method instead, use the moran.mc function. MC<- moran.mc(s1$Income, lw, nsim=599) # View results (including p-value) MC Monte-Carlo simulation of Moran I data: s1$Income weights: lw number of simulations + 1: 600 statistic = 0.28281, observed rank = 588, p-value = 0.02 alternative hypothesis: greater # Plot the distribution (note that this is a density plot instead of a histogram) plot(MC, main="", las=1) Moran’s I as a function of a distance band In this section, we will explore spatial autocorrelation as a function of distance bands. Instead of defining neighbors as contiguous polygons, we will define neighbors based on distances to polygon centers. We therefore need to extract the center of each polygon. coo <- st_centroid(s1) The object coo stores all sixteen pairs of coordinate values. Next, we will define the search radius to include all neighboring polygon centers within 50 km (or 50,000 meters) S.dist <- dnearneigh(coo, 0, 50000) The dnearneigh function takes on three parameters: the coordinate values coo, the radius for the inner radius of the annulus band, and the radius for the outer annulus band. In our example, the inner annulus radius is 0 which implies that all polygon centers up to 50km are considered neighbors. Note that if we chose to restrict the neighbors to all polygon centers between 50 km and 100 km, for example, then we would define a search annulus (instead of a circle) as dnearneigh(coo, 50000, 100000). Now that we defined our search circle, we need to identify all neighboring polygons for each polygon in the dataset. lw <- nb2listw(S.dist, style="W",zero.policy=T) Run the MC simulation. MI <- moran.mc(s1$Income, lw, nsim=599,zero.policy=T) Plot the results. plot(MI, main="", las=1) Display p-value and other summary statistics. MI Monte-Carlo simulation of Moran I data: s1$Income weights: lw number of simulations + 1: 600 statistic = 0.31361, observed rank = 597, p-value = 0.005 alternative hypothesis: greater "],["interpolation-in-r.html", "J Interpolation in R Thiessen polygons IDW 1st order polynomial fit 2nd order polynomial Kriging", " J Interpolation in R R sf tmap spatstat gstat terra sp 4.3.1 1.0.14 3.3.3 3.0.7 2.1.1 1.7.55 2.0.0 First, let’s load the data from the website. The data are vector layers stored as sf objects. library(sf) library(tmap) # Load precipitation data z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/precip.rds")) P <- readRDS(z) p <- st_as_sf(P) # Load Texas boudary map z <- gzcon(url("https://github.com/mgimond/Spatial/raw/main/Data/texas.rds")) W <- readRDS(z) w <- st_as_sf(W) # # Replace point boundary extent with that of Texas tm_shape(w) + tm_polygons() + tm_shape(p) + tm_dots(col="Precip_in", palette = "RdBu", auto.palette.mapping = FALSE, title="Sampled precipitation \\n(in inches)", size=0.7) + tm_text("Precip_in", just="left", xmod=.5, size = 0.7) + tm_legend(legend.outside=TRUE) The p point layer defines the sampled precipitation values. These points will be used to predict values at unsampled locations. The w polygon layer defines the boundary of Texas. This will be the extent for which we will interpolate precipitation data. Thiessen polygons The Thiessen polygons (or proximity interpolation) can be created using spatstat’s dirichlet function. Note that this function will require that the input point layer be converted to a spatstat ppp object–hence the use of the inline as.ppp(P) syntax in the following code chunk. library(spatstat) # Used for the dirichlet tessellation function # Create a tessellated surface th <- dirichlet(as.ppp(p)) |> st_as_sfc() |> st_as_sf() # The dirichlet function does not carry over projection information # requiring that this information be added manually st_crs(th) <- st_crs(p) # The tessellated surface does not store attribute information # from the point data layer. We'll join the point attributes to the polygons th2 <- st_join(th, p, fn=mean) # Finally, we'll clip the tessellated surface to the Texas boundaries th.clp <- st_intersection(th2, w) # Map the data tm_shape(th.clp) + tm_polygons(col="Precip_in", palette="RdBu", auto.palette.mapping=FALSE, title="Predicted precipitation \\n(in inches)") + tm_legend(legend.outside=TRUE) IDW Unlike the Thiessen method shown in the previous section, the IDW interpolation will output a raster. This requires that we first create an empty raster grid, then interpolate the precipitation values to each unsampled grid cell. An IDW power value of 2 (idp=2.0) will be used in this example. Many packages share the same function names. This can be a problem when these packages are loaded in a same R session. For example, the idw function is available in both spatstat.explore and gstat. Here, we make use of gstat’s idw function. This requires that we either detach the spatstat.explore package (this package was automatically installed when we installed spatstat) or that we explicitly identify the package by typing gstat::idw. Here, we opted for the former approach. detach("package:spatstat.explore", unload = TRUE, force=TRUE) library(gstat) library(terra) library(sp) # Create an empty grid where n is the total number of cells grd <- as.data.frame(spsample(P, "regular", n=50000)) names(grd) <- c("X", "Y") coordinates(grd) <- c("X", "Y") gridded(grd) <- TRUE # Create SpatialPixel object fullgrid(grd) <- TRUE # Create SpatialGrid object # Add P's projection information to the empty grid proj4string(P) <- proj4string(P) # Temp fix until new proj env is adopted proj4string(grd) <- proj4string(P) # Interpolate the grid cells using a power value of 2 (idp=2.0) P.idw <- idw(Precip_in ~ 1, P, newdata=grd, idp = 2.0) # Convert to raster object then clip to Texas r <- rast(P.idw) r.m <- mask(r, st_as_sf(W)) # Plot tm_shape(r.m["var1.pred"]) + tm_raster(n=10,palette = "RdBu", auto.palette.mapping = FALSE, title="Predicted precipitation \\n(in inches)") + tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) Fine-tuning the interpolation The choice of power function can be subjective. To fine-tune the choice of the power parameter, you can perform a leave-one-out validation routine to measure the error in the interpolated values. # Leave-one-out validation routine IDW.out <- vector(length = length(P)) for (i in 1:length(P)) { IDW.out[i] <- idw(Precip_in ~ 1, P[-i,], P[i,], idp=2.0)$var1.pred } # Plot the differences OP <- par(pty="s", mar=c(4,3,0,0)) plot(IDW.out ~ P$Precip_in, asp=1, xlab="Observed", ylab="Predicted", pch=16, col=rgb(0,0,0,0.5)) abline(lm(IDW.out ~ P$Precip_in), col="red", lw=2,lty=2) abline(0,1) par(OP) The RMSE can be computed from IDW.out as follows: # Compute RMSE sqrt( sum((IDW.out - P$Precip_in)^2) / length(P)) [1] 6.989294 Cross-validation In addition to generating an interpolated surface, you can create a 95% confidence interval map of the interpolation model. Here we’ll create a 95% CI map from an IDW interpolation that uses a power parameter of 2 (idp=2.0). # Create the interpolated surface (using gstat's idw function) img <- idw(Precip_in~1, P, newdata=grd, idp=2.0) n <- length(P) Zi <- matrix(nrow = length(img$var1.pred), ncol = n) # Remove a point then interpolate (do this n times for each point) st <- rast() for (i in 1:n){ Z1 <- gstat::idw(Precip_in~1, P[-i,], newdata=grd, idp=2.0) st <- c(st,rast(Z1)) # Calculated pseudo-value Z at j Zi[,i] <- n * img$var1.pred - (n-1) * Z1$var1.pred } # Jackknife estimator of parameter Z at location j Zj <- as.matrix(apply(Zi, 1, sum, na.rm=T) / n ) # Compute (Zi* - Zj)^2 c1 <- apply(Zi,2,'-',Zj) # Compute the difference c1 <- apply(c1^2, 1, sum, na.rm=T ) # Sum the square of the difference # Compute the confidence interval CI <- sqrt( 1/(n*(n-1)) * c1) # Create (CI / interpolated value) raster img.sig <- img img.sig$v <- CI /img$var1.pred # Clip the confidence raster to Texas r <- rast(img.sig, layer="v") r.m <- mask(r, st_as_sf(W)) # Plot the map tm_shape(r.m["var1.pred"]) + tm_raster(n=7,title="95% confidence interval \\n(in inches)") + tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) 1st order polynomial fit To fit a first order polynomial model of the form \\(precip = intercept + aX + bY\\) to the data, # Define the 1st order polynomial equation f.1 <- as.formula(Precip_in ~ X + Y) # Add X and Y to P P$X <- coordinates(P)[,1] P$Y <- coordinates(P)[,2] # Run the regression model lm.1 <- lm( f.1, data=P) # Use the regression model output to interpolate the surface dat.1st <- SpatialGridDataFrame(grd, data.frame(var1.pred = predict(lm.1, newdata=grd))) # Clip the interpolated raster to Texas r <- rast(dat.1st) r.m <- mask(r, st_as_sf(W)) # Plot the map tm_shape(r.m) + tm_raster(n=10, palette="RdBu", auto.palette.mapping=FALSE, title="Predicted precipitation \\n(in inches)") + tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) 2nd order polynomial To fit a second order polynomial model of the form \\(precip = intercept + aX + bY + dX^2 + eY^2 +fXY\\) to the data, # Define the 2nd order polynomial equation f.2 <- as.formula(Precip_in ~ X + Y + I(X*X)+I(Y*Y) + I(X*Y)) # Add X and Y to P P$X <- coordinates(P)[,1] P$Y <- coordinates(P)[,2] # Run the regression model lm.2 <- lm( f.2, data=P) # Use the regression model output to interpolate the surface dat.2nd <- SpatialGridDataFrame(grd, data.frame(var1.pred = predict(lm.2, newdata=grd))) # Clip the interpolated raster to Texas r <- rast(dat.2nd) r.m <- mask(r, st_as_sf(W)) # Plot the map tm_shape(r.m) + tm_raster(n=10, palette="RdBu", auto.palette.mapping=FALSE, title="Predicted precipitation \\n(in inches)") + tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) Kriging Fit the variogram model First, we need to create a variogram model. Note that the variogram model is computed on the de-trended data. This is implemented in the following chunk of code by passing the 1st order trend model (defined in an earlier code chunk as formula object f.1) to the variogram function. # Define the 1st order polynomial equation f.1 <- as.formula(Precip_in ~ X + Y) # Compute the sample variogram; note that the f.1 trend model is one of the # parameters passed to variogram(). This tells the function to create the # variogram on the de-trended data. var.smpl <- variogram(f.1, P, cloud = FALSE, cutoff=1000000, width=89900) # Compute the variogram model by passing the nugget, sill and range values # to fit.variogram() via the vgm() function. dat.fit <- fit.variogram(var.smpl, fit.ranges = FALSE, fit.sills = FALSE, vgm(psill=14, model="Sph", range=590000, nugget=0)) # The following plot allows us to assess the fit plot(var.smpl, dat.fit, xlim=c(0,1000000)) Generate Kriged surface Next, use the variogram model dat.fit to generate a kriged interpolated surface. The krige function allows us to include the trend model thus saving us from having to de-trend the data, krige the residuals, then combine the two rasters. Instead, all we need to do is pass krige the trend formula f.1. # Define the trend model f.1 <- as.formula(Precip_in ~ X + Y) # Perform the krige interpolation (note the use of the variogram model # created in the earlier step) dat.krg <- krige( f.1, P, grd, dat.fit) # Convert kriged surface to a raster object for clipping r <- rast(dat.krg) r.m <- mask(r, st_as_sf(W)) # Plot the map tm_shape(r.m["var1.pred"]) + tm_raster(n=10, palette="RdBu", auto.palette.mapping=FALSE, title="Predicted precipitation \\n(in inches)") + tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) Generate the variance and confidence interval maps The dat.krg object stores not just the interpolated values, but the variance values as well. These are also passed to the raster object for mapping as follows: tm_shape(r.m["var1.var"]) + tm_raster(n=7, palette ="Reds", title="Variance map \\n(in squared inches)") +tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) A more readily interpretable map is the 95% confidence interval map which can be generated from the variance object as follows (the map values should be interpreted as the number of inches above and below the estimated rainfall amount). r <- rast(dat.krg) r.m <- mask(sqrt(r["var1.var"])* 1.96, st_as_sf(W)) tm_shape(r.m) + tm_raster(n=7, palette ="Reds", title="95% CI map \\n(in inches)") +tm_shape(P) + tm_dots(size=0.2) + tm_legend(legend.outside=TRUE) "]]
diff --git a/spatial-autocorrelation-in-r.html b/spatial-autocorrelation-in-r.html
index e8dd4ad..fe390c2 100755
--- a/spatial-autocorrelation-in-r.html
+++ b/spatial-autocorrelation-in-r.html
@@ -734,8 +734,8 @@ Computing a pseudo p-value from an MC simulationp <- min(N.greater + 1, n + 1 - N.greater) / (n + 1)
p
-[1] 0.01333333
-
In our working example, the p-value suggests that there is a small chance (0.013%) of being wrong in stating that the income values are not clustered at the county level.
+[1] 0.02166667
+In our working example, the p-value suggests that there is a small chance (0.022%) of being wrong in stating that the income values are not clustered at the county level.
# Plot the distribution (note that this is a density plot instead of a histogram)
plot(MC, main="", las=1)
Let’s explore elements of the Moran’s I equation using the following sample dataset.
The first step in the computation of a Moran’s I index is the generation of weights. The weights can take on many different values. For example, one could assign a value of 1
to a neighboring cell as shown in the following matrix.
For example, cell ID 1
(whose value is 25 and whose standardized value is 0.21) has for neighbors cells 2
, 5
and 6
. Computationally (working with the standardized values), this gives us a summarized neighboring value (aka lagged value), \(y_1(lag)\) of:
For example, cell ID 1
(whose value is 25 and whose standardized value, \(z_1\), is 0.21) has for neighbors cells 2
, 5
and 6
. Computationally (working with the standardized values), this gives us a summarized neighboring value (aka lagged value), \(y_1(lag)\) of:
\[
\begin{align*}
y_1 = \sum\limits_j w_{1j} z_j {}={} & (0)(0.21)+(1)(1.17)+(0)(1.5)+ ... + \\
@@ -1689,9 +1689,9 @@ 13.4 Moran’s I equatio
\]
Computing the spatially lagged values for the other 15 cells generates the following scatterplot:
You’ll note that the range of neighboring values along the \(y\)-axis is much greater than that of the original values on the \(x\)-axis. This is not necessarily an issue given that the Moran’s \(I\) correlation coefficient standardizes the values by recentering them on the overall mean \((X - \bar{X})/s\). This is simply to re-emphasize that we are interested in how a neighboring value varies relative to a feature’s value, regardless of the scale of values in either batches.