Author: Brad Boyle ([email protected])
The GNRS is a batch application for resolving & standardizing political division names against the GADM Global Administrative Divisions Database (https://gadm.org/), with additional names and codes from Geonames (http://www.geonames.org/) and Natural Earth (https://www.naturalearthdata.com/). The GNRS resolves political division names at three levels: country (admin_0), state/province (admin_1) and county/parish (admin_2). Resolution is performed in a series steps, beginning with direct matching to standard names, followed by direct matching to alternate names in different languages, followed by direct matching to standard codes (such as ISO and FIPS codes). If direct matching fails, the GNRS attempts to match to standard and then alternate names using fuzzy matching, but does not perform fuzzing matching of political division codes. The GNRS works down the political division hierarchy, stopping at the current level if all matches fail. In other words, if a country cannot be matched, the GNRS does not attempt to match state or county.
Results returned by the GNRS include the original political division names, the resolved political division names and IDs from GADM and Geonames, with additional information on how each name was resolved and the quality of the overal match.
Ubuntu 16.04 or higher
PostgreSQL 12.2 or higher
Perl v5.26.1 or higher
Perl module Text::CSV
PHP 7.2.24 or higher
PHP extensions:
- php-cli
- php-mbstring
- php-curl
- php-xml
- php-json
- php-services-json
- php-pgsql
- Local installation of database
geonames
- Required for building the GNRS database
- See repo: `https://github.com/ojalaquellueva/geonames.git'
- Local installation of database
gadm
- Required for building the GNRS database
- See repo: `https://github.com/ojalaquellueva/gadm.git'
I recommend the following setup:
# Create application base directory (call it whatever you want)
mkdir -p gnrs
cd gnrs
# Create application code directory
mkdir src
# Install application code to application code directory
cd src
git clone https://github.com/ojalaquellueva/gnrs
# Move data and sensitive parameters directories outside of code directory
# Be sure to change paths to these directories (in params.sh) accordingly
mv data ../
mv config ../
Note: temporary data directory in /tmp/gnrs (used by gnrs api) is installed on the fly by the application.
To avoid filling up the gnrs temp directory, consider adding a crontab entry to delete files older than a certain number of days. For example, the following cron job find and deletes all files older than 7 days, every day at 4:02 am:
02 4 * * * find /tmp/gnrs/* -type f -mtime +7 -print0 | xargs -0 rm
Another version for systems that don't support -print0:
02 4 * * * find /tmp/gnrs/* -type f -mtime +7 -exec rm {} \;
Whichever you use, be sure to test first to verify that the list of files makes sense:
find /tmp/gnrs/* -type f -mtime +7
The input file for the TNRS must be utf-8 plain text file name with the following fields:
Field name | Required? | Meaning |
---|---|---|
user_id | No | User-supplied integer id for each row, if desired |
country | Yes | Country name |
state_province | No | State/province name |
county_parish | No | County/parish name |
Header user_id,country,state_province,county_parish
must be the first line of the file. Place this file in the GNRS user data directory (data/user/
; path and directory name set in file params.sh).
- Numeric IDs optional but must include header & all tabs
user_id<tab>country<tab>state_province<tab>county_parish
1<tab>Russia<tab>Lipetsk<tab>Dobrovskiy rayon
2<tab>Mexico<tab>Sonora, Estado de<tab>Huépac
3<tab>Guatemala<tab>Izabal<tab>
4<tab>USA<tab>Arizona<tab>Pima County
5<tab>U.S.A<tab>Arizona<tab>Pima<tab>
6<tab>Mexico<tab>Quintana Roo<tab>Lázaro Cárdenas
gnrspar.pl
: must be tab delimitedgnrs_batch.sh
: tab delimited or comma delimited. Specify on command line (see below).
GNRS output is saved as a utf-8 CSV file with header. By default, the name of the output file is the basename of the input file, plus suffix "gnrs_results.csv". Fields are as follows:
Field name | Meaning |
---|---|
id | gnrs ID of each record |
poldiv_full | Verbatim country, state/province and county/parish, concatenated with '@' dellimiter |
country_verbatim | Verbatim country |
state_province_verbatim | Verbatim state/province |
county_parish_verbatim | Verbatim county/parish |
country | Resolved country |
state_province | Resolved state/province |
county_parish | Resolve couny/parish |
country_id | Geonames ID of resolved country |
state_province_id | Geonames ID of resolved state/province |
county_parish_id | Geonames ID of resolve county/parish |
match_method_country | Method used to match country |
match_method_state_province | Method used to match state/province |
match_method_county_parish | Method used to match county/parish |
match_score_country | Country match score (if fuzzy matched) |
match_score_state_province | State/province match score (if fuzzy matched) |
match_score_county_parish | County/parish match score (if fuzzy matched) |
poldiv_submitted | Lowest political division submitted |
poldiv_matched | Lowest political division matched |
match_status | Completeness of overall match |
user_id | User id, if supplied |
Place your input file in the gnrs user data directory (path and directory name set in param file). input file must be named "gnrs_submitted.csv".
- This should be considered the default application as it is by far the fastest
- Splits submitted file into batches, removing duplicates, and processes several batches at once using multiple cores.
- Reassembles batches into single file when all batches complete
- Invokes
gnrs_batch.sh
(see below)
./gnrspar.pl -in <input_filename_and_path> -nbatch <batches> -opt <makeflow_options> <other options>
Option | Meaning | Required? | Default value | Values |
---|---|---|---|---|
-in | Input file and path | Yes | ||
-out | Output file and path | No | /path/to/_gnrs_results.tsv | |
-nbatch | Number of batches | Yes | ||
-opt | Makeflow options | No | ||
-d | Output file delimiter | No | t | c (CSV), t (TSV) |
./gnrspar.pl -in "../data/user/gnrs_testfile.csv" -nbatch 3
Import, name resolution and export of results are run as a single operation by invoking the following script:
./gnrs_batch.sh [-option1] [-option2] ...
Option | Purpose | Required? | Default value | Comments |
---|---|---|---|---|
-f | Input file and path | Yes | ||
-o | Output file and path | No | /path/to/_gnrs_results.csv | |
-d | Output file delimiter | No | c | c=comma (CSV), t=tab (TSV) |
-n | No header | No | FALSE | Input file does not contain header. Default value (FALSE) means file contains header as first line. |
-a | Api call | No (yes for api) | invokes other options such as -s and -p | |
-s | Silent mode: suppress all (confirmations & progress messages) | No | ||
-m | Send notification emails | No | Must be followed by valid email | |
-r | Remove from cache | No | FALSE | Remove any results corresponding to submitted political divisions from cache. Forces resolution from scratch of all values in current batch. |
-c | Clear cache | No | FALSE | Clear entire cache |
Example:
./gnrs_batch.sh -f "../data/user/gnrs_testfile.csv" -o "/home/boyle/testing/gnrs_testfile_scrubbed.csv"
- The above assumes command is being run from same directory as target script,
gnrs_batch.sh
. - If running from a different directory, pre-prend the command with path to
gnrs_batch.sh
, unless you have added this path to your environment - In this example, path to data directory is relative to working directory. Yoiu could also use the full path.
- Output file "gnrs_testfile_scrubbed.csv" will be dumped to directory "/home/boyle/testing/"
For up-to-date examples of API usage in php and R, see the following example files in the api/
subdirectory of this reposotory:
gnrs_api_example.php
gnrs_api_example.R
Also see API documentation at http://bien.nceas.ucsb.edu/bien/tools/gnrs/gnrs-api/