-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathun-sdg-analysis-report.Rmd
717 lines (522 loc) · 29.8 KB
/
un-sdg-analysis-report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
---
title: "UN Sustainable Development Goals Data Analysis"
author: "Fredrik Randsborg Bølstad"
date: "June 16, 2019"
output:
pdf_document:
latex_engine: xelatex
toc: true
toc_depth: 3
number_sections: true
highlight: tango
df_print: kable
urlcolor: blue
---
```{r setup, include = F}
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(reshape2)
```
# Introduction
The purpose of this project is to explore the global progress towards the United Nations (UN) Sustainable Development Goals (SDG). The main dataset is the UN SDG Indicators dataset provided by the UN Statistics Division. It is complimented by data from the Gapminder Foundation and the World Bank. Finding powerful ways of visualizing the dataset will be of particular importance to the project. With the help of the maps and mapdata R packages, a key ambition is to visualize reported progress for each country on a world map. Finally, the report will touch on what the current data may say about future results. Key goals:
- Programmatically download data from the relevant sources
- Clean and merge data into unified datasets
- Communicate reported results graphically
- Use trend analysis to predict future results
# Data analysis
## Downloading data
The main data for this project is provided by the UN Statistics Division via Google BigQuery. The geographic areas in the UN dataset includes countries, regions and continents. The geographical scope of this analysis is limited to the country list used by the Gapminder Foundation. The UN dataset will thus have to be filtered to only keep the countries that are in the Gapminder's data. Population data is separately downloaded from The World Bank's repository. Country coordinate data is extracted from the maps and mapsdata R packages.
**Data source** | **Fetch method**
----------------|--------------
[UN Sustainable Development Goals Indicators](https://console.cloud.google.com/marketplace/details/un-statistics-division/un-sdgs) | [bigrquery](https://cran.r-project.org/web/packages/bigrquery/) - R interface to [BigQuery](https://cloud.google.com/bigquery/)
[Gapminder geography data](https://www.gapminder.org/data/geo/) | [googlesheets](https://cran.r-project.org/web/packages/googlesheets/)
World Bank population data | [wbstats](https://cran.r-project.org/web/packages/wbstats/) - R interface to World Bank API
Spatial map data | [maps](https://cran.r-project.org/web/packages/maps/index.html) and [mapdata](https://cran.r-project.org/web/packages/mapdata/index.html)
Table: Data sources
The raw data is downloaded and saved for processing. The raw and clean data are thus stored separately to avoid having to download the raw data over again, in case an unexpected or undesired result flow from modification of the dataset during data cleaning.
The code used to download and save the above data from each source can be found below. The code is commented to explain exactly what it does.
### UN SDG Indicators data from Google BigQuery
```{r dowbload-bigquery, eval = F}
library(bigrquery) # Query Google's BigQuery database
library(DBI) # Interface for communication with relational database systems
# Local destination for downloaded raw data
un_sdg_file = "./raw_data/bigquery_un_sdg.rda"
# Google Cloud project ID required for BigQuery interactions
# Create your own project ID at https://console.cloud.google.com/
google_cloud_project_id <- "edx-ds-capstone-2019"
# BigQuery connection interface
dbi_connection <- dbConnect(
bigrquery::bigquery(),
project = "publicdata",
dataset = "un_sdg",
billing = google_cloud_project_id
)
# SQL query for entire dataset (~300 MB)
sql_query <- "SELECT * FROM `bigquery-public-data.un_sdg.indicators`"
# Download dataset
un_sdg_data <- dbGetQuery(dbi_connection, sql_query)
# Save dataframe to file
save(un_sdg_data, file = un_sdg_file)
```
### Gapminder country list
```{r download-gapminder, eval = F}
library(googlesheets) # Google spreadsheet API
# Local destination for downloaded raw data
gapminder_file = "./raw_data/gapminder_country_list.rda"
# Connect to spreadsheet (URL from https://www.gapminder.org/data/geo/)
gapminder_spreadsheet <- gs_url(paste("https://docs.google.com/spreadsheets/",
"d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/",
sep = "") # formatted to fit on page
# Download list-of-countries worksheet
gapminder_country_list <- gapminder_spreadsheet %>%
gs_read(ws = "list-of-countries-etc")
# Keep only country and region.
#> Mutate colname to 'geoareaname' to match UN SDG data colname
#> Make geo uppercase to match World Bank data
gapminder_country_list <- gapminder_country_list %>%
mutate(geoareaname = name, region = four_regions) %>%
select(geo, geoareaname, region) %>%
mutate(geo = toupper(geo))
# Save dataframe to file
save(gapminder_country_list, file = gapminder_file)
```
### World Bank population data
```{r download-worldbank, eval = F}
library(wbstats) # World Bank API interface
# Local destination for downloaded raw data
wb_pop_file = "./raw_data/world_bank_population_data.rda"
# Create list of ISO 3 country codes from Gapminder
# which will be used to query World Bank's data
iso3c_codes <- gapminder_country_list$geo
# Remove HOS (Holy See) which isn't in World Bank data
# https://datahelpdesk.worldbank.org/knowledgebase/articles/898590-country-api-queries
iso3c_codes <- iso3c_codes[-which(iso3c_codes == 'HOS', )]
# Get year interval for which there is UN SDG data
if(!exists("un_sdg_data")) load(un_sdg_file);
first_year <- min(un_sdg_data$timeperiod)
last_year <- max(un_sdg_data$timeperiod)
# Download population size from the World Bank
wb_pop <- wb(country = iso3c_codes,
indicator = 'SP.POP.TOTL',
startdate = first_year, enddate = last_year)
# Save dataframe to file
save(wb_pop, file = wb_pop_file)
```
## Cleaning data
### Merging datasets
The data from the UN, Gapminder Foundation and the World Bank are merged into one unified dataset. There are two main goals for this process:
1. Join UN SDG data with Gapminder data, such that all rows relating to geographical areas not in the Gapminder country list are dropped.
2. Append the World Bank population data for each country and the respective years in the dataset.
```{r tidy-merge, eval = F}
### Join UN SDG and Gapminder dataset such that:
# > Drop from UN SDG all countries not in Gapminder
# > Append to UN SDG the region column from Gapminder
dat <- inner_join(un_sdg_data, gapminder_country_list)
### Append World Bank population data, by country and year
# Rename and select World Bank columns to match UN SDG dataset
wb_pop <- wb_pop %>%
rename(geo = iso3c, timeperiod = date, population = value) %>%
select(geo, timeperiod, population)
# Append population column to UN SDG dataset
dat <- left_join(dat, wb_pop)
```
### Selecting relevant data
After preliminary exploration of the dataset, including manual inspection, it became clear that the reported UN data were severely lacking. Many indicators had no, little or inconsistent data reporting. Only a minority of indicators were consistently or semi-consistently reported on by a broad community of countries. A selection of indicators with sufficient data were therefore handpicked to be used for the continued analysis. The target description for each indicator were manually added to the dataset for convenience (so that the goal of the indicator can quickly be queried). The complete description of all indicators in the dataset can be found in the annex of the UN General Assembly Resolution [A/RES/71/313](https://undocs.org/A/RES/71/313). This is also the source of the text descriptions copied into the the code below.
**Indicator** | **Description**
--------------|--------------
6.1.1 | Population proportion using safe drinking water services
7.1.1 | Population proportion with access to electricity
7.2.1 | Renewable energy (as share of total consumption)
8.1.1 | Annual growth rate, real GDP per capita
9.4.1 | CO~2~ emission per unit of value added
12.2.1 | Material footprint
12.2.2 | Domestic material consumption
17.6.2 | Population proportion with broadband subscriptions
17.8.1 | Population proportion using the internet
Table: Selected UN SDG Indicators
```{r tidy-select-indicators, eval = F}
### Indicators to keep
indicators <- c("6.1.1", "7.1.1", "7.2.1", "8.1.1", "9.4.1",
"12.2.1", "12.2.2", "17.6.2", "17.8.1")
### Text description of targets for indicators (https://undocs.org/A/RES/71/313)
target_description <-
c('By 2030, achieve universal and equitable access to
safe and affordable drinking water for all',
'By 2030, ensure universal access to affordable, reliable and modern energy services',
'By 2030, increase substantially the share of
renewable energy in the global energy mix',
'Sustain per capita economic growth in accordance with
national circumstances and, in particular, at least 7 per cent
gross domestic product growth per annum in the least developed countries',
'By 2030, upgrade infrastructure and retrofit industries to make them sustainable,
with increased resource-use efficiency and greater adoption of clean and
environmentally sound technologies and industrial processes, with all countries
taking action in accordance with their respective capabilities',
'By 2030, achieve the sustainable management and efficient use of natural resources',
'By 2030, achieve the sustainable management and efficient use of natural resources',
'Enhance North-South, South-South and triangular regional and international
cooperation on and access to science, technology and innovation and
enhance knowledge-sharing on mutually agreed terms, including through
improved coordination among existing mechanisms, in particular at the
United Nations level, and through a global technology facilitation mechanism',
'Fully operationalize the technology bank and science, technology and innovation
apacity-building mechanism for least developed countries by 2017 and enhance the
use of enabling technology, in particular information and communications technology')
# Data frame of indicators and text descriptions (later appended to the main dataset)
descriptions_dat <- data.frame(indicators, target_description) %>%
rename(indicator = indicators)
```
### Dropping, renaming and coercing data
The entire UN SDG Indicators dataset were downloaded to begin with. After the preliminary exploration of the data and the selection of indicators, many variables are irrelevant to this project. Those are dropped from the data frame that will later be used for visualization and analysis. A few columns are also renamed to something more descriptive (e.g. _geoareaname_ has been filtered to only include countries and is thus renamed to _country_). Finally, each column is coerced into either a numeric (e.g. reported values), integer (e.g. years) or factor (e.g. region) to optimize performance.
```{r tidy-column-manipulation, eval = F}
### Manipulate columns with dplyr
dat <- dat %>%
# Filter for chosen indicators
filter(indicator %in% indicators) %>%
# Remove seriescode, geoareacode, time_detail, freq
select(-c('goal', 'seriescode', 'geoareacode', 'time_detail', 'freq')) %>%
# Rename geoareaname to country, timeperiod to year
rename(country = geoareaname, year = timeperiod) %>%
# Capitalize region column
mutate(region = sub("(.)", "\\U\\1", region, perl=TRUE)) %>%
# Remove empty columns
select_if(~sum(!is.na(.)) > 0) %>%
# Coerce the type of each column
mutate(target = as.factor(target),
indicator = as.factor(indicator),
year = as.integer(year),
value = as.numeric(value),
nature = as.factor(nature),
location = as.factor(location),
type_of_product = as.factor(type_of_product),
type_of_speed = as.factor(type_of_speed),
units = as.factor(units),
region = as.factor(region))
```
### Calculate absolute population numbers
The UN SDG data only lists a value for each indicator, e.g. the population proportion that has access to safe drinking water. With the population data from the World Bank, it is also possible to estimate the absolute number of people with access to safe drinking water. This is interesting in itself and could be relevant for future analysis. Future analysis could e.g. weigh global progress according to population numbers instead of treating each country as equally important to aggregate global progress.
```{r tidy-add-population-nrs, eval = F}
# Use value and population to create column containing the absolute number of people
dat <- dat %>%
mutate(nr_people = as.integer(floor(population*value/100)))
```
### Append target descriptions for SDG indicators
The target descriptions for each indicator, which were previously saved in a separate data frame, is appended to the main data frame.
```{r tidy-add-sdg-targets, eval = F}
# Append target descriptions to dat
dat <- left_join(dat, descriptions_dat)
```
### Harmonizing map data with the main dataset
Map data from the *map_data* R package is saved to a variable and the country column is renamed to match with the main dataset.
```{r tidy-map-data-intro, eval = F}
### Save map coordinates in dataframe
map_world <- map_data('world')
# Rename region as country to harmonize with 'dat'
map_world <- map_world %>%
rename(country = region)
```
Find and list all instances of spelling discrepancies for country names.
```{r tidy-map-find-discrepancies, eval = F}
# List all mismatches in country names between dat and map_world
dat %>%
anti_join(map_world, by = 'country') %>%
select(country) %>%
unique()
```
Rename countries for which there are spelling discrepancies in order to cooperate fully with the main dataset.
```{r tidy-map-renaming, eval = F}
# Rename countries in dat to match those of map_world
dat <- dat %>%
mutate(country = recode(country,
'Antigua and Barbuda' = 'Antigua',
'Holy See' = 'Vatican',
'Trinidad and Tobago' = 'Trinidad')) %>%
# Remove rows with country Tuvalu
filter(country != 'Tuvalu')
```
### Save tidy data to disk
```{r tidy-save-data, eval = F}
### Save data frame to file
save(dat, file = tidy_file)
save(map_world, file = world_map_file)
```
## Exploring and visualizing data
```{r viz-load-tidy-data, include = F}
# Load tidy data to namespace
tidy_file = "../tidy_data/tidy_dat.rda"
world_map_file = "../tidy_data/world_map.rda"
load(tidy_file); rm(tidy_file)
load(world_map_file); rm(world_map_file)
```
Most of the work on this data analysis project ended up being dedicated to visualizing the data. A significant amount of thought and research went into different options. A key achievement of this project was the successful creation of a function that generates a progress heatmap of the world, given any UN SDG indicator and year. Moreover, data for population proportion indicators were separated into the rural and urban population; a side-by-side comparison of the world heatmaps for each population category, and with a scatterplot showing progress per year, was the culmination of this visualization effort. Several functions were programmed to achieve this result in am intuitive and user friendly manner.
Functions for generating GIFs of these maps were also created. This enables a year-by-year animation of progress. Because the GIFs can't be displayed in this PDF report, the code is not included here. Examples of GIF animations are however uploaded to this project's github repository and linked to in the appropriate section below. The full code for the GIF creation is in the main code file (also available on github).
Titles, subtitles, units, year and regions are programmatically added to the plots.
### World heatmaps
The functions needed to generate a world heatmap for any indicator:
```{r viz-function-heatmap}
### Function to filter for given indicator and year; delete empty columns
select_ind_year <- function(data, ind, yr) {
data %>%
filter(indicator == ind, year %in% yr) %>%
select_if(~sum(!is.na(.)) > 0)
}
### Function to create world heatmap
plot_world_heatmap <- function(data, ind, year) {
# Create filtered dataset
data_filtered <- select_ind_year(data, ind, year)
# Join filtered dataset with world map
map_data <- left_join(map_world, data_filtered, by = 'country')
# Text for title and legend
title <- map_data$seriesdescription %>% na.omit() %>% unique()
unit <- map_data$units %>% na.omit() %>% unique() %>% as.character()
# Plot graph
ggplot(map_data, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = value)) +
# Titles and axis
labs(title = paste(title),
subtitle = paste('Year', year),
fill = paste(unit),
x = NULL, y = NULL) +
# Legend
scale_fill_gradient(low = "#56B1F7", high = "#132B43", na.value = 'lightgrey') +
# Theme adjustments
theme(plot.title = element_text(size = 9, face = 'bold'),
plot.subtitle = element_text(size = 8, hjust = 0),
axis.ticks = element_blank(),
axis.text = element_blank(),
legend.key.size = unit(14, 'pt'),
legend.title = element_text(size = 7),
legend.text = element_text(size = 7),
panel.background = element_rect(fill = 'white'),
plot.background = element_rect(fill = 'white'),
panel.border = element_rect(size = 1, colour = 'gray50', fill = NA))
}
```
```{r viz-indicators-variable, include = F}
indicators <- c("6.1.1", "7.1.1", "7.2.1", "8.1.1", "9.4.1",
"12.2.1", "12.2.2", "17.6.2", "17.8.1")
```
Looping through the above function for all selected indicators:
```{r viz-heatmap-all-indicators}
for (ind in indicators) {
heatmap <- plot_world_heatmap(dat, ind, 2000)
print(heatmap)
}
```
### World heatmap - rural vs urban grid
Function to create the world heatmap grid of urban vs rural.
```{r viz-function-heatmap-grid}
### Filter for given indicator, year and loc; delete empty columns
select_ind_year_loc <- function(data, ind, yr, loc) {
data %>%
filter(indicator == ind, year %in% yr, location == loc) %>%
select_if(~sum(!is.na(.)) > 0)
}
### Create world heatmap, rural vs. urban 1x2 grid
facet_heatmap <- function(data, ind, yr) {
# Create filtered datasets for each location
data_urban <- select_ind_year_loc(data, ind, yr, "URBAN")
data_rural <- select_ind_year_loc(data, ind, yr, "RURAL")
# Joined filtered data with world map
map_urban <- left_join(map_world, data_urban, by = 'country')
map_rural <- left_join(map_world, data_rural, by = 'country')
# Bind all data together
map <- rbind(map_urban, map_rural)
# Create data frame with facet property
map <- data.frame(map, Facet = rep(c("map_urban","map_rural"),
times=c(nrow(map_urban),nrow(map_rural))))
# Text for plot title
title <- map$seriesdescription %>% na.omit() %>% unique()
unit <- map$units %>% na.omit() %>% unique() %>% as.character()
# Plot graph
ggplot(map, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = value)) +
# Labels and legend
labs(title = paste(title),
subtitle = paste('Year', yr),
fill = paste(unit),
x = NULL, y = NULL) +
scale_fill_gradient(low = "#56B1F7", high = "#132B43", na.value = 'lightgrey',
breaks = seq(0, 100, 20)) +
# Theme adjustments
theme(plot.title = element_text(size = 12, face = 'bold'),
plot.subtitle = element_text(size = 9, hjust = 0),
axis.ticks = element_blank(),
axis.text = element_blank(),
legend.key.size = unit(14, 'pt'),
panel.background = element_rect(fill = 'white'),
panel.border = element_rect(size = 1, colour = 'gray50', fill = NA),
plot.margin = unit(c(0,0,0,11),"mm")) + # top, right, bottom, left
# Faceting
facet_wrap(.~Facet)
}
```
Example of world heatmap grid (this plot is more useful in larger dimensions but the concept is still illustrated).
```{r viz-heatmap-grid-711}
facet_heatmap(dat, "7.1.1", 2000)
```
### Scatterplot - rural vs urban grid
Function for year-to-year scatterplot of the above development, comparing rural to urban. Requires both a start year and an end year as input.
```{r viz-function-scatterplot-grid}
### Create scatterplot, rural vs. urban 1x2 grid
facet_scatterplot <- function(data, ind, start, end, x_max = end) {
# Create function that filters data used for plotting
filter_data <- function(data, ind, start, end, loc) {
dat %>%
# Filter for indicator, given year interval and location
filter(indicator == ind, year %in% c(start:end), location == loc) %>%
# Add column for nr. of people covered (in millions) for all countries
group_by(year) %>%
mutate(total_people_million = sum(nr_people/10^6, na.rm = TRUE))
}
# Create filtered datasets for each location
rural <- filter_data(data, ind, start, end, "RURAL")
urban <- filter_data(data, ind, start, end, "URBAN")
# Joined filtered data with world map
plot <- rbind(rural, urban)
# Create data frame with facet property
plot <- data.frame(plot, Facet = rep(c("rural","urban"),
times=c(nrow(rural),nrow(urban))))
# Specify limits for plot
x_min <- start
#x_max defaults to 'end' but can be overriden via user argument
y_min <- 3400
y_max <- 6750
# Plot graph
plot %>%
ggplot(aes(year, total_people_million)) +
geom_point(size = 2) +
geom_line(size = 0.1) +
# Set limits
scale_x_continuous("Year", breaks = seq(2000, 2015, by = 5), limits = c(x_min, x_max)) +
scale_y_continuous("Million people", breaks = seq(3500, 6500, by = 500), limits = c(y_min, y_max)) +
# Labels
labs(x = "Year") +
# Theme adjustments
theme(panel.background = element_rect(linetype = 5),
panel.border = element_rect(size = 1, colour = 'gray50', fill = NA),
plot.margin = unit(c(0,26,0,0),"mm"), # top, right, bottom, left
axis.title = element_text(size = 10),
axis.text = element_text(size = 8)) +
# Faceting
facet_grid(~Facet)
}
```
Example of scatterplot.
```{r viz-scatterplot-grid-711}
facet_scatterplot(dat, "7.1.1", 2000, 2015)
```
### Combined heatmap and scatterplot - rural vs urban grid
It is possible to combine the two grids of plots. Here follows an example for the year 2016 and with the scatterplot starting from year 2000.
```{r viz-heatmap-scatterplot-grid-711}
heatmap <- facet_heatmap(dat, "7.1.1", 2016)
scatterplot <- facet_scatterplot(dat, "7.1.1", 2000, 2016)
grid.arrange(heatmap, scatterplot, nrow = 2, heights = c(0.6, 0.4))
```
### Animated heatmaps and scatterplots
It is possible to stitch together the above plot to create a time series animation. This is accomplished by stitching the PNGs together into a GIF. An example of the above plot animated in high resolution can be found [here](https://raw.githubusercontent.com/randsborg/un-sdg-data-analysis/master/figs/electricity_access_urban_vs_rural.gif). While the functions can be used to create many more animations, a larger selection of examples are found [here](https://github.com/randsborg/un-sdg-data-analysis/tree/master/figs).
### Country bar chart comparison
Below is a function that takes indicator, year and continent as inputs and returns an ordered bar chart of country performance (with the best performing countries first).
```{r viz-function-bar-chart-country-comparison}
country_barchart <- function(data, ind, yr, continent) {
# Filter data accoring to input arguments and select relevant columns
filtered_data <- filter(data, year == yr, indicator == ind, region == continent) %>%
select(c(indicator, seriesdescription, country, year, value, units, location, region))
# Select only ALLAREA so as to not add up RURAL and URBAN for indicators 6.1.1 and 7.1.1.
if (ind %in% c("6.1.1", "7.1.1")) {
filtered_data <- filtered_data %>% filter(location == "ALLAREA")
}
# Get bar chart title and unit
title <- filtered_data$seriesdescription %>% na.omit() %>% unique()
unit <- filtered_data$units %>% na.omit() %>% unique()
# Plot graph
filtered_data %>%
ggplot(aes(x = reorder(country, -value), y = value)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = paste(title),
subtitle = paste(continent, '- Year', yr),
x = "Country",
y = paste(unit))
}
```
Here are the example charts generated for indicator 8.1.1 (annual RGDP growth) for year 2000, for each continent.
```{r viz-bar-chart-country-comparison}
continents <- dat$region %>% unique()
for (cont in continents) {
plot <- country_barchart(dat, "8.1.1", 2000, cont)
print(plot)
}
```
### Wide form time-series table
To inspect indicators numerically, a function was created that takes indicator and continent as input, returning a table in long form for all years there is data.
```{r viz-function-long-form}
long_form_table <- function(data, ind, continent) {
# Filter data accoring to input arguments and select relevant columns
filtered_data <- data %>%
select(indicator, seriesdescription, country, year, value, units, location, region) %>%
filter(indicator == ind, region == continent)
# Select only ALLAREA so as to not add up RURAL and URBAN for indicators 6.1.1 and 7.1.1.
if (ind %in% c("6.1.1", "7.1.1")) {
filtered_data <- filtered_data %>% filter(location == "ALLAREA")
}
# Cast data frame to wide form
dcast(filtered_data, country ~ year)
}
```
Below follows a selection of examples using this function. While all years are recorded in the variable, a smaller selection of years is displayed in the report such that the tables fit on the page.
**Proportion of individuals using the Internet in the Americas**
```{r viz-long-form-internet-americas}
internet_long_americas <- long_form_table(dat, "17.8.1", "Americas")
internet_long_americas[,c(1, 7:17)]
```
**Proportion of individuals with access to electricity in Asia**
```{r viz-long-form-electricity-asia}
electricity_long_asia <- long_form_table(dat, "7.1.1", "Asia")
electricity_long_asia[,c(1, 7:17)]
```
**Renewable energy share as percentage of total consumption in Africa**
```{r viz-long-form-renewables-africa}
renewables_long_africa <- long_form_table(dat, "7.2.1", "Africa")
renewables_long_africa[,c(1, 7:17)]
```
## Forecasting
Due to the perhaps over-ambitious visualization efforts given my skills and limited time, this report regrettably has not contributed much in terms of forecasting. I believe a solid foundation has been created to enable powerful forecasting techniques for future analysis. It is certainly my intention to keep improving this project, especially with regard to forecasting values, as well as visualizing those forecasts. A simple example, and perhaps the most obvious starting point, has been included in the report to demonstrate some of the elementary forecasting techniques available.
### Fitting a linear model using `lm`
Example for Afghanistan on indicator 7.1.1 (population proportion with access to electricity).
```{r forecast-lm-711-afg-model}
lm_711_afg <- dat %>%
filter(country == "Afghanistan", indicator == "7.1.1", location == "ALLAREA") %>%
select(country, indicator, year, value)
fit <- lm(value ~ year, data = lm_711_afg)
summary(fit)$coeff
```
We can use this linear model to predict the future values until 2030, under the assumption of linearity:
```{r forecast-lm-711-afg-prediction}
new <- data.frame(year = 2001:2030)
predict(fit, new)
```
The best fit of the model suggests that 100% of Afghanistan's will have access to electricity by 2019. Using the lower confidence bound gives us a more conservative prediction of 2023:
```{r forecast-lm-711-afg-prediction-confidence}
predict(fit, new, interval = "prediction")
```
A plot of the `lm` model can be created with `ggplot`.
```{r forecast-lm-711-afg-plot}
ggplot(lm_711_afg, aes(x = year, y = value)) +
geom_point() +
stat_smooth(method = "lm", col = "red")
```
# Results
This project has successfully fetched data from all the desired sources and created a unified dataset that is convenient for exploration and analysis. In my opinion, the key achievement of this project is the creation of functions that visualize the results from the dataset. These functions also serves as a baseline for communicating the results of analysis and forecasting. Due to time constraints and the significant effort I have needed to put into the project, I have unfortunately not been able to produce much forecasting. One of the main goals of the projects were to use trend analysis to predict future results. While a simple example of such analysis is included, there is much potential to improve this aspect of the project.
**Goal** | **Status**
--------------|--------------
Programmatically download data from the relevant sources | Achieved
Clean and merge data into unified datasets | Achieved
Communicate reported results graphically | Achieved^(1)^
Use trend analysis to predict future results | Lacking
Table: Achievement of goals
^(1)^ More visualization techniques can always be added.
# Conclusion
The desired data has been downloaded, cleaned and explored in detail. Visualization of this dataset were considered important from the outset and much effort has been spent on creating good graphics. This has come at the expense of forecasting, which will require more work. The project required more time than anticipated but has been very fun and immensely rewarding; I have learned a lot and only regret not having more time to further the project before submission. I intend to continue working on improving and adding visualizations, and developing forecasting methods including the visualization of those forecasts. I also plan to add a function to pull new data and record the accuracy of the forecasting methods against the new data.