Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolution success highly sensitive to Unicode character choice #52

Open
pkgw opened this issue Feb 28, 2022 · 3 comments
Open

Resolution success highly sensitive to Unicode character choice #52

pkgw opened this issue Feb 28, 2022 · 3 comments
Assignees

Comments

@pkgw
Copy link

pkgw commented Feb 28, 2022

In the process of working on the new Arxiv reference extraction pipeline (adsabs/arxiv-reference-extractor), I've been using the reference resolution microservice to test out its success. It seems that the results that I'm getting out of the resolver can depend sensitively on different forms of Unicode punctuation.

Here is a file with some examples: examples.txt. It can be processed with:

python test_reference_service.py -i examples.txt

It has a bunch of pairs of nearly-identical references. I find that for each pair, the first one resolves but the second one doesn't. In some cases, the only difference is a nearly-invisible change of an ASCII hyphen - to a Unicode en-dash , or vice versa. In some bulk tests, I find that pure-ASCII references resolve the most successfully on average, but sometimes Unicode resolves when a plain-ASCII version doesn't.

@pkgw
Copy link
Author

pkgw commented Oct 6, 2022

Here's another interesting example. In this file:

2017SoPh__292__117L.txt

about half of the refstrings (24 out of 50) fail to resolve as-is. If I simply replace Alfvén with Alfven everywhere, nearly all of the problems are fixed — only 8 out of 50 fail. There are 17 occurrences of accented "Alfvén" in this particular file, which means that its presence turns a resolvable refstring into an unresolvable one virtually every time, here.

@aaccomazzi
Copy link
Member

This is still an issue in production at least, for example:

Hasegawa, A., Chen, L.: 1975, Kinetic process of plasma heating due to Alfvén wave excitation. Phys. Rev. Lett. 35, 370. DOI. ADS.

I believe there is a fix, any reason not to put it in production @golnazads?

@golnazads
Copy link
Collaborator

golnazads commented Oct 14, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants