-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider capitalization #162
base: dev
Are you sure you want to change the base?
Changes from all commits
f61b04d
e0b0596
47fc9ff
368435b
94a6afd
8095df2
d11c1cd
17237f7
7ad9ecb
22e163c
336fae5
5cd0192
4567137
ec38051
4d70c3d
d387e79
633a4b2
b941b15
0e8cafa
378ef5f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,41 +3,95 @@ | |
:mod:`Quantulum` disambiguation functions. | ||
""" | ||
|
||
import logging | ||
|
||
from . import classifier as clf | ||
from . import load | ||
from . import no_classifier as no_clf | ||
|
||
_LOGGER = logging.getLogger(__name__) | ||
|
||
############################################################################### | ||
|
||
|
||
def disambiguate_unit(unit_surface, text, lang="en_US"): | ||
""" | ||
Resolve ambiguity between units with same names, symbols or abbreviations. | ||
:returns (str) unit name of the resolved unit | ||
""" | ||
if clf.USE_CLF: | ||
base = clf.disambiguate_unit(unit_surface, text, lang).name | ||
else: | ||
base = ( | ||
load.units(lang).symbols[unit_surface] | ||
or load.units(lang).surfaces[unit_surface] | ||
or load.units(lang).surfaces_lower[unit_surface.lower()] | ||
or load.units(lang).symbols_lower[unit_surface.lower()] | ||
) | ||
units = attempt_disambiguate_unit(unit_surface, text, lang) | ||
if units and len(units) == 1: | ||
return next(iter(units)).name | ||
|
||
# Change the capitalization of the last letter to find a better match. | ||
# Capitalization is sometimes cause of confusion, but the | ||
# capitalization of the prefix is too important to alter. | ||
|
||
if len(base) > 1: | ||
base = no_clf.disambiguate_no_classifier(base, text, lang) | ||
elif len(base) == 1: | ||
base = next(iter(base)) | ||
# If the unit is longer than two prefixes, we set everything to lower | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The whole following flow is convoluted and not concise. I don't really see how there is not a ton of issues here.
In general, this whole flow needs more explanation like "units that have more than two letters are most likely case insensitive. Only the capitalization of the prefix matters. We therefore only try out lowering the last letters" |
||
# except the first letter. | ||
if len(unit_surface) > 2: | ||
unit_changed = unit_surface[0] + unit_surface[1:].lower() | ||
if unit_changed == unit_surface: | ||
return resolve_ambiguity(units, unit_surface, text) | ||
text_changed = text.replace(unit_surface, unit_changed) | ||
new_units = attempt_disambiguate_unit(unit_changed, text_changed, lang) | ||
units = get_a_better_one(units, new_units) | ||
return resolve_ambiguity(units, unit_surface, text) | ||
|
||
if base: | ||
base = base.name | ||
if not unit_surface or unit_surface[0] not in load.METRIC_PREFIXES.keys(): | ||
# Only apply next work around if the first letter is a SI-prefix | ||
return resolve_ambiguity(units, unit_surface, text) | ||
|
||
unit_changed = unit_surface[:-1] + unit_surface[-1].swapcase() | ||
text_changed = text.replace(unit_surface, unit_changed) | ||
new_units = attempt_disambiguate_unit(unit_changed, text_changed, lang) | ||
units = get_a_better_one(units, new_units) | ||
return resolve_ambiguity(units, unit_surface, text) | ||
|
||
|
||
def attempt_disambiguate_unit(unit_surface, text, lang): | ||
"""Returns list of possibilities""" | ||
try: | ||
if clf.USE_CLF: | ||
return clf.attempt_disambiguate_unit(unit_surface, text, lang) | ||
else: | ||
base = "unk" | ||
return no_clf.attempt_disambiguate_no_classifier(unit_surface, text, lang) | ||
except KeyError: | ||
return None | ||
|
||
|
||
def get_a_better_one(old, new): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This method name is really not saying anything. I also don't get what it does - Why is the length of units important? Is there not a more concise way to filter out None (potentially at the source of creating it)? |
||
"""Decide if we pick new over old, considering them being None, and | ||
preferring the smaller one.""" | ||
if not new: | ||
return old | ||
if not old: | ||
return new | ||
if len(new) < len(old): | ||
return new | ||
return old | ||
|
||
|
||
return base | ||
def resolve_ambiguity(units, unit, text): | ||
if not units: | ||
if unit and clf.USE_CLF: | ||
raise KeyError('Could not find unit "%s" from "%s"' % (unit, text)) | ||
else: | ||
return "unk" | ||
if len(units) == 1: | ||
return next(iter(units)).name | ||
_LOGGER.warning( | ||
"Could not resolve ambiguous units: '{}'. For unit '{}' in text '{}'. ".format( | ||
", ".join(str(u) for u in units), unit, text | ||
) | ||
) | ||
# Deterministically getting something out of units. | ||
return next(iter(sorted(u.name for u in units))) | ||
|
||
|
||
############################################################################### | ||
|
||
|
||
def disambiguate_entity(key, text, lang="en_US"): | ||
""" | ||
Resolve ambiguity between entities with same dimensionality. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need way more test cases to check if this (rather big) change in the flow works as expected. Please include one test case for every edge case you have considered (i.e. more than two letters, metric prefix, no metric prefix, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be nice, but that would mean that units.json needs to be updated. It misses for example MegaLitre. For the MLitre vs MLitrE vs mLitre cases.
It is weird that some units don't have all prefixes, and some units have more prefixes defined than others. Is it really a good idea to hardcode what prefixes belong to what units? Why don't have all prefixes available to all units?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because one of the criteria for a unit to be listed is for it to be common.
In our case, that is defined by having a Wikipedia page (or being redirected to it).
This is not the case for all units (Take for example "Deci-tonne").
If you find a prefix-unit combination that has a Wikipedia page and is missing from units.json, feel free to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Megalitre
is correctly recognized by quantulum.Mlitre
and the such don't make sense IMO. However, I am easily convinced otherwise by some sensible body of text that uses "MLitre" (Please post one here as a comment). Adding<prefix><surface>
is then trivial by manipulating the load methodThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about
mbar
andMbar
? Please add test cases for them.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because a tonne is a different kind of unit. It is just MegaGram. We don't have to support Tonne -> MG conversion, that is clearly a non-goal, but we can support full prefix for units for which this is generally expected, such as SI units. If you don't want to fully support SI units with prefixes, specify that as a non-goal on the README so that people won't have wrong expectations from using this. (But I think a lot of people expect all SI units to just work with all prefixes.)
Sure, but you have asked me in the past to support such "nonsensical" cases, so that's why I added support for that and wanted to add test cases for that:
Well apparently Mlitre isn't as nonsensical as it me seems, here you have it:
Source: https://www.climate-policy-watcher.org/water-quality/v-1.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will look into that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the issue of recognizing units like "Mlitre" to #183
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of expectations for supported units is pretty clear in my opinion. But it may be considered to add options like "all_SI" or "all_Bin" for adding a certain set of prefixes to a unit. I moved this to #184
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I will probably work on improving the flow for this PR when we I can also get these test cases available (because, yes, the flow is sometimes quite chaotic). Ping me when it gets ready.