-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolution success highly sensitive to Unicode character choice #52
Comments
Here's another interesting example. In this file: about half of the refstrings (24 out of 50) fail to resolve as-is. If I simply replace |
This is still an issue in production at least, for example:
I believe there is a fix, any reason not to put it in production @golnazads? |
Alberto, the latest release is v2.0.17 and is running everywhere. However,
last week, I did address what Peter brought up here and it is now fixed
locally:
{
"resolved": [
{
"refstring": "Hasegawa, A., Chen, L.: 1975, Kinetic process of
plasma heating due to Alfvén wave excitation. Phys. Rev. Lett. 35, 370.
DOI. ADS.",
"score": "1.0",
"bibcode": "1975PhRvL..35..370H"
}
]
}
I have not released it yet, since I am still trying to find one other issue
that Jenny has reported. Sorry. I shall release it on Monday.
…On Fri, Oct 14, 2022 at 5:18 PM Alberto Accomazzi ***@***.***> wrote:
This is still an issue in production at least, for example:
Hasegawa, A., Chen, L.: 1975, Kinetic process of plasma heating due to
Alfvén wave excitation. Phys. Rev. Lett. 35, 370. DOI. ADS.
I believe there is a fix, any reason not to put it in production
@golnazads <https://github.com/golnazads>?
—
Reply to this email directly, view it on GitHub
<#52 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG3M4CCAQPQD4D4TTK24FCDWDHEZHANCNFSM5PRNZPPQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
In the process of working on the new Arxiv reference extraction pipeline (
adsabs/arxiv-reference-extractor
), I've been using the reference resolution microservice to test out its success. It seems that the results that I'm getting out of the resolver can depend sensitively on different forms of Unicode punctuation.Here is a file with some examples: examples.txt. It can be processed with:
It has a bunch of pairs of nearly-identical references. I find that for each pair, the first one resolves but the second one doesn't. In some cases, the only difference is a nearly-invisible change of an ASCII hyphen
-
to a Unicode en-dash–
, or vice versa. In some bulk tests, I find that pure-ASCII references resolve the most successfully on average, but sometimes Unicode resolves when a plain-ASCII version doesn't.The text was updated successfully, but these errors were encountered: