Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate Geonames 'City of' prefixes #1609

Closed
wants to merge 1 commit into from
Closed

Conversation

orangejulius
Copy link
Member

@orangejulius orangejulius commented Feb 28, 2022

A common cause of missed deduplication is Geonames locality/localadmin records that start with 'City of'.

Our name comparison logic is fairly conservative: it only looks at things like punctuation, diacriticals, etc. Otherwise, we have to consider names that are different meaning the underlying records represent genuinely different places.

Getting too far away from this general stance could be dangerous, but we can handle specific exceptions just fine.

Geonames records that start with 'City of' are one of these cases. Often, there is a Geonames locality record with just the name, (like 'New York'), and then a Geonames localadmin record with the 'City of' prefix. Usually only one of those records will have a WOF concordance, so this is still helpful even combined with #1606

@missinglink
Copy link
Member

FYI there is some similar logic and IIRC tests too here
https://github.com/pelias/placeholder/blob/master/lib/analysis.js#L87

@orangejulius
Copy link
Member Author

Ah very nice. That logic is quite a bit simpler so I'll bring it into this PR.

I think it's ok to deduplicate across all of those differences in name, since things like county and locality will not (generally) be deduped since they have different layers (unless it hits one of the exceptions like one being a parent of the other).

A common cause of deduplication errors is Geonames locality/localadmin
records that start with 'City of'.

Our name comparison logic is fairly conservative: it only looks at
things like punctuation, diacriticals, etc. Otherwise, we have to
consider names that are different meaning the underlying records
represent genuinely different places.

Getting too far away from this general stance could be dangerous, but we
can handle specific outliers just fine.

Geonames records that start with 'City of' are one of these cases.
Often, there is a Geonames `locality` record with just the name, (like
'New York'), and then a Geonames `localadmin` record with the 'City of'
prefix. Usually only one of those records will have a WOF concordance,
so this is still helpful even combined with
#1606
@orangejulius
Copy link
Member Author

I just realized this PR basically re-implements #1371. They solve the same problem and even in almost exactly the same way.

#1371 is a bit more sophisticated, so I'm actually tempted to merge that one.

@orangejulius
Copy link
Member Author

Closing in favor of #1371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants