Skip to content

Commit

Permalink
feat(dedupe): Handle Geonames 'City of' prefixes
Browse files Browse the repository at this point in the history
A common cause of deduplication errors is Geonames locality/localadmin
records that start with 'City of'.

Our name comparison logic is fairly conservative: it only looks at
things like punctuation, diacriticals, etc. Otherwise, we have to
consider names that are different meaning the underlying records
represent genuinely different places.

Getting too far away from this general stance could be dangerous, but we
can handle specific outliers just fine.

Geonames records that start with 'City of' are one of these cases.
Often, there is a Geonames `locality` record with just the name, (like
'New York'), and then a Geonames `localadmin` record with the 'City of'
prefix. Usually only one of those records will have a WOF concordance,
so this is still helpful even combined with
#1606
  • Loading branch information
orangejulius committed Feb 28, 2022
1 parent 6aa997d commit 5619a12
Show file tree
Hide file tree
Showing 2 changed files with 45 additions and 2 deletions.
32 changes: 30 additions & 2 deletions helper/diffPlaces.js
Original file line number Diff line number Diff line change
Expand Up @@ -100,13 +100,41 @@ function isParentHierarchyDifferent(item1, item2){
});
}

/* Generate a 'name' value for comparison
* This includes normalizations for specific dataset features
*/
function nameForComparison(name) {
// recurse into object properties if this is an object
if (_.isPlainObject(name)) {
const new_object = {};
Object.keys(name).forEach((key) => {
new_object[key] = nameForComparison(name[key]);
});

return new_object;
}

// otherwise, only handle strings
if (!_.isString(name)) {
return name;
}

const city_of_regex = new RegExp(/City of (.*)/, 'i');
const matches = name.match(city_of_regex);
if (matches) {
return matches[1];
}

return name;
}

/**
* Compare the name properties if they exist.
* Returns false if the objects are the same, else true.
*/
function isNameDifferent(item1, item2, requestLanguage){
let names1 = _.get(item1, 'name');
let names2 = _.get(item2, 'name');
let names1 = nameForComparison(_.get(item1, 'name'));
let names2 = nameForComparison(_.get(item2, 'name'));

// check if these are plain 'ol javascript objects
let isPojo1 = _.isPlainObject(names1);
Expand Down
15 changes: 15 additions & 0 deletions test/unit/helper/diffPlaces.js
Original file line number Diff line number Diff line change
Expand Up @@ -539,6 +539,21 @@ module.exports.tests.isNameDifferent = function (test, common) {
});
};

module.exports.tests.nameForcomparison = function (test, common) {
test('geonames City of', function (t) {
t.false(isNameDifferent(
{ name: { default: 'City of New York' } },
{ name: { default: 'New York' } }
), 'Geonames \'City of\' prefix is ignored');

t.false(isNameDifferent(
{ name: { en: 'City of New York' } },
{ name: { default: 'New York' } }
), 'Geonames \'City of\' prefix is ignored across languages');
t.end();
});
};

module.exports.tests.normalizeString = function (test, common) {
test('lowercase', function (t) {
t.equal(normalizeString('Foo Bar'), 'foo bar');
Expand Down

0 comments on commit 5619a12

Please sign in to comment.