Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider capitalization #162

Open
wants to merge 20 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions quantulum3/_lang/en_US/tests/quantities.json
Original file line number Diff line number Diff line change
Expand Up @@ -1373,5 +1373,15 @@
"surface": "three million, two hundred & forty"
}
]
},
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need way more test cases to check if this (rather big) change in the flow works as expected. Please include one test case for every edge case you have considered (i.e. more than two letters, metric prefix, no metric prefix, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be nice, but that would mean that units.json needs to be updated. It misses for example MegaLitre. For the MLitre vs MLitrE vs mLitre cases.

It is weird that some units don't have all prefixes, and some units have more prefixes defined than others. Is it really a good idea to hardcode what prefixes belong to what units? Why don't have all prefixes available to all units?

Copy link
Owner

@nielstron nielstron Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't have all prefixes available to all units?

Because one of the criteria for a unit to be listed is for it to be common.
In our case, that is defined by having a Wikipedia page (or being redirected to it).
This is not the case for all units (Take for example "Deci-tonne").
If you find a prefix-unit combination that has a Wikipedia page and is missing from units.json, feel free to add it.

Copy link
Owner

@nielstron nielstron Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It misses for example MegaLitre. For the MLitre vs MLitrE vs mLitre cases.

Megalitre is correctly recognized by quantulum. Mlitre and the such don't make sense IMO. However, I am easily convinced otherwise by some sensible body of text that uses "MLitre" (Please post one here as a comment). Adding <prefix><surface> is then trivial by manipulating the load method

Copy link
Owner

@nielstron nielstron Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about mbar and Mbar? Please add test cases for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the case for all units (Take for example "Deci-tonne").

Yes, because a tonne is a different kind of unit. It is just MegaGram. We don't have to support Tonne -> MG conversion, that is clearly a non-goal, but we can support full prefix for units for which this is generally expected, such as SI units. If you don't want to fully support SI units with prefixes, specify that as a non-goal on the README so that people won't have wrong expectations from using this. (But I think a lot of people expect all SI units to just work with all prefixes.)

Megalitre is correctly recognized by quantulum. Mlitre and the such don't make sense IMO. However, I am easily convinced otherwise by some sensible body of text that uses "MLitre"

Sure, but you have asked me in the past to support such "nonsensical" cases, so that's why I added support for that and wanted to add test cases for that:

(what about the unit "Km", we should swap the first letter here for a perfect match. What about "litre", is it useful to check for "litrE"?)

.... sensible body of text that uses "MLitre" (Please post one here as a comment)

Well apparently Mlitre isn't as nonsensical as it me seems, here you have it:

Secondary sewage from the Hyperion works is now used in a variety of ways to alleviate this. In one application, commissioned in 1997, Memcor processes are used in 15 Mlitre/day CMF/RO system to supply the nearby Mobil and Chevron refineries with feed for their boiler water plants. In an earlier project Memcor supplied an 11.5 Mlitre/day CMF plant as pretreatment for RO to provide water used for injection in a barrier scheme to keep out further sea water.

Source: https://www.climate-policy-watcher.org/water-quality/v-1.html

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will look into that :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the issue of recognizing units like "Mlitre" to #183

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't want to fully support SI units with prefixes, specify that as a non-goal on the README so that people won't have wrong expectations from using this.

The list of expectations for supported units is pretty clear in my opinion. But it may be considered to add options like "all_SI" or "all_Bin" for adding a certain set of prefixes to a unit. I moved this to #184

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I will probably work on improving the flow for this PR when we I can also get these test cases available (because, yes, the flow is sometimes quite chaotic). Ping me when it gets ready.

{
"req": "The battery has 2nw.",
"res": [
{
"value": 2,
"unit": "nanowatt",
"surface": "2nw"
}
]
}
]
54 changes: 26 additions & 28 deletions quantulum3/classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,6 @@ def _clean_text_lang(lang):
def train_classifier(
parameters=None, ngram_range=(1, 1), store=True, lang="en_US", n_jobs=None
):

"""
Train the intent classifier
TODO auto invoke if sklearn version is new or first install or sth
Expand Down Expand Up @@ -240,38 +239,37 @@ def disambiguate_entity(key, text, lang="en_US"):


###############################################################################
def disambiguate_unit(unit, text, lang="en_US"):
"""
Resolve ambiguity between units with same names, symbols or abbreviations.
"""


def attempt_disambiguate_unit(unit, text, lang):
"""Resolve ambiguity between units with same names, symbols or abbreviations.
Returns list of possibilities"""
new_unit = (
load.units(lang).symbols.get(unit)
or load.units(lang).surfaces.get(unit)
or load.units(lang).surfaces_lower.get(unit.lower())
or load.units(lang).symbols_lower.get(unit.lower())
)
if not new_unit:
return load.units(lang).names.get("unk")

if len(new_unit) > 1:
transformed = classifier(lang).tfidf_model.transform([clean_text(text, lang)])
scores = classifier(lang).classifier.predict_proba(transformed).tolist()[0]
scores = zip(scores, classifier(lang).target_names)

# Filter for possible names
names = [i.name for i in new_unit]
scores = [i for i in scores if i[1] in names]

# Sort by rank
scores = sorted(scores, key=lambda x: x[0], reverse=True)
try:
final = load.units(lang).names[scores[0][1]]
_LOGGER.debug('\tAmbiguity resolved for "%s" (%s)' % (unit, scores))
except IndexError:
_LOGGER.debug('\tAmbiguity not resolved for "%s"' % unit)
final = next(iter(new_unit))
else:
final = next(iter(new_unit))

return final
raise KeyError('Could not find unit "%s" from "%s"' % (unit, text))
if len(new_unit) == 1:
return new_unit

# Start scoring
transformed = classifier(lang).tfidf_model.transform([clean_text(text, lang)])
scores = classifier(lang).classifier.predict_proba(transformed).tolist()[0]
scores = zip(scores, classifier(lang).target_names)

# Filter for possible names
names = [i.name for i in new_unit]
scores = [i for i in scores if i[1] in names]

# Sort by rank
scores = sorted(scores, key=lambda x: x[0], reverse=True)
try:
new_unit = [load.units(lang).names[scores[0][1]]]
_LOGGER.debug('\tAmbiguity resolved for "%s" (%s)' % (unit, scores))
return new_unit
except IndexError:
_LOGGER.debug('\tAmbiguity not resolved for "%s"' % unit)
return new_unit
88 changes: 71 additions & 17 deletions quantulum3/disambiguate.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,41 +3,95 @@
:mod:`Quantulum` disambiguation functions.
"""

import logging

from . import classifier as clf
from . import load
from . import no_classifier as no_clf

_LOGGER = logging.getLogger(__name__)

###############################################################################


def disambiguate_unit(unit_surface, text, lang="en_US"):
"""
Resolve ambiguity between units with same names, symbols or abbreviations.
:returns (str) unit name of the resolved unit
"""
if clf.USE_CLF:
base = clf.disambiguate_unit(unit_surface, text, lang).name
else:
base = (
load.units(lang).symbols[unit_surface]
or load.units(lang).surfaces[unit_surface]
or load.units(lang).surfaces_lower[unit_surface.lower()]
or load.units(lang).symbols_lower[unit_surface.lower()]
)
units = attempt_disambiguate_unit(unit_surface, text, lang)
if units and len(units) == 1:
return next(iter(units)).name

# Change the capitalization of the last letter to find a better match.
# Capitalization is sometimes cause of confusion, but the
# capitalization of the prefix is too important to alter.

if len(base) > 1:
base = no_clf.disambiguate_no_classifier(base, text, lang)
elif len(base) == 1:
base = next(iter(base))
# If the unit is longer than two prefixes, we set everything to lower
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole following flow is convoluted and not concise. I don't really see how there is not a ton of issues here.

  • If a unit has no prefix but starts with "m" (like meter) this messes up stuff
  • Why do we selectively lower/swapcase certain parts of the surface? Should we not try out all possible capitalizations?
  • Why set everything to lower for units longer than two letters (nothing known about prefixes)?

In general, this whole flow needs more explanation like "units that have more than two letters are most likely case insensitive. Only the capitalization of the prefix matters. We therefore only try out lowering the last letters"

# except the first letter.
if len(unit_surface) > 2:
unit_changed = unit_surface[0] + unit_surface[1:].lower()
if unit_changed == unit_surface:
return resolve_ambiguity(units, unit_surface, text)
text_changed = text.replace(unit_surface, unit_changed)
new_units = attempt_disambiguate_unit(unit_changed, text_changed, lang)
units = get_a_better_one(units, new_units)
return resolve_ambiguity(units, unit_surface, text)

if base:
base = base.name
if not unit_surface or unit_surface[0] not in load.METRIC_PREFIXES.keys():
# Only apply next work around if the first letter is a SI-prefix
return resolve_ambiguity(units, unit_surface, text)

unit_changed = unit_surface[:-1] + unit_surface[-1].swapcase()
text_changed = text.replace(unit_surface, unit_changed)
new_units = attempt_disambiguate_unit(unit_changed, text_changed, lang)
units = get_a_better_one(units, new_units)
return resolve_ambiguity(units, unit_surface, text)


def attempt_disambiguate_unit(unit_surface, text, lang):
"""Returns list of possibilities"""
try:
if clf.USE_CLF:
return clf.attempt_disambiguate_unit(unit_surface, text, lang)
else:
base = "unk"
return no_clf.attempt_disambiguate_no_classifier(unit_surface, text, lang)
except KeyError:
return None


def get_a_better_one(old, new):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method name is really not saying anything. I also don't get what it does - Why is the length of units important? Is there not a more concise way to filter out None (potentially at the source of creating it)?

"""Decide if we pick new over old, considering them being None, and
preferring the smaller one."""
if not new:
return old
if not old:
return new
if len(new) < len(old):
return new
return old


return base
def resolve_ambiguity(units, unit, text):
if not units:
if unit and clf.USE_CLF:
raise KeyError('Could not find unit "%s" from "%s"' % (unit, text))
else:
return "unk"
if len(units) == 1:
return next(iter(units)).name
_LOGGER.warning(
"Could not resolve ambiguous units: '{}'. For unit '{}' in text '{}'. ".format(
", ".join(str(u) for u in units), unit, text
)
)
# Deterministically getting something out of units.
return next(iter(sorted(u.name for u in units)))


###############################################################################


def disambiguate_entity(key, text, lang="en_US"):
"""
Resolve ambiguity between entities with same dimensionality.
Expand Down
17 changes: 17 additions & 0 deletions quantulum3/no_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,20 @@ def disambiguate_no_classifier(entities, text, lang="en_US"):
if relative > max_relative or (relative == max_relative and count > max_count):
max_entity, max_count, max_relative = entity, count, relative
return max_entity


def attempt_disambiguate_no_classifier(unit_surface, text, lang):
"""Returns list of possibilities"""
base = (
load.units(lang).symbols[unit_surface]
or load.units(lang).surfaces[unit_surface]
or load.units(lang).surfaces_lower[unit_surface.lower()]
or load.units(lang).symbols_lower[unit_surface.lower()]
)
if not base:
raise KeyError('Could not find unit "%s" from "%s"' % (unit_surface, text))
if len(base) > 1:
possible_base = disambiguate_no_classifier(base, text, lang)
if possible_base:
return [possible_base]
return base