Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficult names match to higher-order taxa #235

Open
charvolant opened this issue Jan 24, 2023 · 3 comments
Open

Difficult names match to higher-order taxa #235

charvolant opened this issue Jan 24, 2023 · 3 comments
Assignees

Comments

@charvolant
Copy link
Contributor

For example,

https://lists.ala.org.au/speciesListItem/list/dr884?q=Caladenia+dilatata

where Caladenia dilatata is matched to Caladenia. The taxonomy of C. dilatata is complex and messy, with many misapplications, resulting in the higher order match.

Suggested fix: allow exact match parameters to be passed to the namematching service, with it choosing accepted taxa over synonyms if there are multiple possibilities.

@charvolant charvolant self-assigned this Jan 24, 2023
@charvolant
Copy link
Contributor Author

All SA examples have a supplied name of "Arachnorchis dilatata (R.Br.) D.L.Jones & M.A.Clem." Check that the SDS is correctly identifying these names.

@charvolant
Copy link
Contributor Author

charvolant commented Jan 24, 2023

Search in namematching library returns 10 results. There are 17 results, only one of which is accepted, which gets left off.

Accepted value has a lower score than the misapplications 6.9 vs 7.7. Explain, please.

Misapplied score:

7.71838 sum of:
  4.915918 weight(name:caladenia dilatata in 597110) [BM25Similarity], result of:
    4.915918 score(freq=1.0), product of:
      10.559289 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        17 n, number of documents containing term
        674339 N, total number of documents with field
      0.46555385 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 freq, occurrences of term within document
        1.2 k1, term saturation parameter
        0.75 b, length normalization parameter
        2.0 dl, length of field
        2.1226935 avgdl, average length of field
  2.8024619 weight(genus:caladenia in 597110) [BM25Similarity], result of:
    2.8024619 score(freq=1.0), product of:
      6.1654162 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        1299 n, number of documents containing term
        618560 N, total number of documents with field
      0.45454544 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 freq, occurrences of term within document
        1.2 k1, term saturation parameter
        0.75 b, length normalization parameter
        1.0 dl, length of field
        1.0 avgdl, average length of field

Accepted score

6.9079895 sum of:
  4.1055274 weight(name:caladenia dilatata in 329202) [BM25Similarity], result of:
    4.1055274 score(freq=1.0), product of:
      10.559289 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        17 n, number of documents containing term
        674339 N, total number of documents with field
      0.38880718 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 freq, occurrences of term within document
        1.2 k1, term saturation parameter
        0.75 b, length normalization parameter
        3.0 dl, length of field
        2.1226935 avgdl, average length of field
  2.8024619 weight(genus:caladenia in 329202) [BM25Similarity], result of:
    2.8024619 score(freq=1.0), product of:
      6.1654162 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
        1299 n, number of documents containing term
        618560 N, total number of documents with field
      0.45454544 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
        1.0 freq, occurrences of term within document
        1.2 k1, term saturation parameter
        0.75 b, length normalization parameter
        1.0 dl, length of field
        1.0 avgdl, average length of field

The key element here is dl / avgdl in the name field. The accepted document has two entries Caladenia dilatata and Caladenia dilatata Caladenia dilatata R.Br. (what?!) The synonym document has just Caladenia dilatata - dl is document length (field length for a specific field, really) and avgdl is the average document length in the corpus.

@charvolant
Copy link
Contributor Author

Complete name Caladenia dilatata R.Br. is correctly supplied in combined-20210811-4 which means that the erroneous name is being constructed during index creation.

charvolant added a commit to charvolant/specieslist-webapp that referenced this issue Mar 7, 2023
Note this will work with the 1.9 version of the namematching client and its inherited name-matching library
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant