Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedupe geonames self parent #1615

Merged
merged 1 commit into from
Mar 11, 2022
Merged

Conversation

orangejulius
Copy link
Member

@orangejulius orangejulius commented Mar 8, 2022

this PR is branched off of #1606, since we'll likely merge that one first. comparison for only this PR is here

Background

In pelias/geonames#93 we added some special case logic to the Geonames importer that ensures Geonames records in the locality and localadmin layer have themselves as parents in that layer.

Before this change, they would have a Who's on First parent, but these parents didn't always line up perfectly. Sometimes it would lead to broken labels, and as I recall it could also break search queries that rely on locality/localadmin names.

Hierarchy checks

This special logic causes problems with our hierarchy checks, which expect records that can be considered duplicates to share all parents higher than the lower record.

So for example, if a locality and localadmin are to be considered duplicates, the hierarchy must be the same from the country layer down to localadmin.

Geonames localadmins

Geonames seems to have a penchant for having both a locality and a localadmin record for a given city, even when the local administrative divisions don't really support such nuance.

These records often have a name following the format 'City of X', which makes them very disruptive and confusing when shown in a list of results.

Deduplication

Our deduplication code can handle minor name differences like 'City of' after #1371, but can't handle the hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:

  • A WOF locality record for the city can't deduplicate with the Geonames localadmin because the WOF record is parented by a WOF localadmin
  • A WOF locality record for the city can't deduplicate with the Geonames localadmin because the WOF record has no localadmin parent at all

Concordances (from #1606) generally don't help either, since there often isn't a localadmin in WOF to even have a concordance to the Geonames localadmin.

Adding a hierarchy exception

This PR works by skipping the hierarchy checks for any layer where a Geonames record has itself as a parent. This means that assuming all the other layers are the same, the names are compatible, etc, deduplication is still possible.

Impact

Of the 314 cities in our top_us_cities fuzzy tests, most of them (125) had a 'City of X' record somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions

Theoretically, this could allow records that aren't actually duplicates to be deduped, but they would have to have a similar name and likely share at least a county, so it feels like the chance for error is
limited.

That said, there are no regressions in our acceptance tests, and quite a few improvements.

Background
==========

In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.

Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.

Hierarchy checks
================

This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.

So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.

Geonames localadmins
====================

Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.

These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.

Deduplication
=============

Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.

Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
  localadmin beause the WOF record has no localadmin parent at all

Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.

Adding a hierarchy exception
============================

This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.

Impact
======

Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.

With this PR, there are only 15 cases.

Potential regressions
=====================

Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.

That said, there are no regressions in our acceptance tests, and quite a
few improvements.
@orangejulius orangejulius force-pushed the dedupe-geonames-self-parent branch from 3eadc2e to 0040864 Compare March 10, 2022 17:19
@orangejulius orangejulius merged commit c2eac2c into master Mar 11, 2022
@orangejulius orangejulius deleted the dedupe-geonames-self-parent branch March 11, 2022 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant