Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/pluralization #28

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

NeonJarbas
Copy link

port of the following PRs

MycroftAI#167
MycroftAI#36
MycroftAI#37

Rename translations to pluralizations

Add documentation about pluralization

Add new format for duration translations

jarbas changes:
- revert func rename + revert file deletions (keep fallback logic)
- add enums
@codecov
Copy link

codecov bot commented Jul 13, 2022

Codecov Report

❗ No coverage uploaded for pull request base (dev@392cc37). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 860d1ca differs from pull request most recent head 3e7dd61. Consider uploading reports for the commit 3e7dd61 to get more accurate results

@@          Coverage Diff          @@
##             dev     #28   +/-   ##
=====================================
  Coverage       ?   0.00%           
=====================================
  Files          ?      65           
  Lines          ?   16472           
  Branches       ?       0           
=====================================
  Hits           ?       0           
  Misses         ?   16472           
  Partials       ?       0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 392cc37...3e7dd61. Read the comment docs.

@JarbasAl JarbasAl added the enhancement New feature or request label Jul 13, 2022
port MycroftAI#36 + MycroftAI#37

add pt pluralizations.json

add tests
return word


def get_plural_form_pt(word, amount, type=PluralCategory.CARDINAL):
Copy link
Member

@emphasize emphasize Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be better worded "get_numerus_XX" (although latin it hints at the specific task - get me the correct numerus from word given the amount -, english: grammatical number)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like that change! if this ever gets added in mycroft we can add back a get_plural_form method for compat with imports

Copy link
Member

@emphasize emphasize Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since grammar is a bitch, above is only working if the sentence is nominative case. In english language this might be not that relevant to the numerus (in general: flexion) of a noun, but unfortunately in german (depending on case, gender, ...)

This is a problem, since we don't know the sentence at this point. It seems to me that a localized "grammar checker" should be a part of the dialog renderer, checking the mustache replacements (and their direct dependencies [POS]) at construction time.

To give an example: Während ein(es) Monat(s) - singular, genitive - during one month
the case is here determined by "während".
If the dialog is written like "in {num} {period}" the correct flexion would be "in ein(em) Monat"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these new functions could be used internally for the dialog renderer in core, would you like to open an issue/PR in there for that discussion? :)

in regards to LF do you feel we need more helper functions? these more or less map to a standard so they are good to have, but if they are not enough let's explore further!

@NeonDaniel
Copy link
Member

@NeonMariia is this related to the numeric cases we were talking about last week?

@NeonMariia
Copy link

NeonMariia commented Aug 31, 2022

In Ukrainian and Russian languages numerals change their form not only depending on the case or plural/singular form of the dependent word, but also on the numeral itself. There are several numeral groups (7 in Ukrainian, 4 in Russian) each of them has its own rules of getting flexions. That's why putting numerals in correct form in Ukrainian, Russian and some other synthetic languages requires dependency sentence parsing and POS tags determination. Below there are some examples of those:

  • Group I:
  1. один (one) - changes flexions by gender, case and plural/singular forms of the following noun: одне, одна, одні, одного, однієї (1)

e.g. Ми бачили одного(1) хлопця. (We saw one guy.)
Numeral(1) depends on singular noun in genitive case (хлопця), according to this we put numeral (один -1) in certain form in genitive case (одного).

  • Group II:
  1. Actually quantitative numerals (власне кількісні: два, три, чотири), (2, 3, 4), (two, three, four).
    Change their flexions according to the case of the noun in plural form on which it depends.

e.g. По вулиці йдуть два(2) хлопці. (Two boys are walking on the street.)
Numeral (два - 2) depends on plural noun in nominative case (хлопці). That's why we put quantitative numeral (2 - два)

  1. Composite numerals (збірні числівники: двоє, троє, четверо, п’ятеро, …) (2 – 20 і 30) (two, three, four -, twenty, thirty).
    Change their flexions according to the case of the noun in plural form on which it depends. Can not be dependent on nouns in the nominative case.

e.g. По вулиці йшли двоє(2) хлопців. (Two boys were walking on the street.)
Numeral (двоє - 2) depends on plural noun in genitive case(хлопців). That's why we put composite numeral (2 - двоє)

@JarbasAl
Copy link
Member

JarbasAl commented Sep 10, 2022

Getting back to this, I will be refactoring the new methods to have a different signature and naming according to feedback above

this

def get_plural_form_XX(word, amount, type=PluralCategory.CARDINAL):
     ...

becomes

def get_numerus_XX(tokens, idx, amount, type=PluralCategory.CARDINAL):
     word = tokens[idx]
     ...

@NeonMariia would this be satisfactory? you could then use the tokens to perform postag or anything else you need

initially i thought about sending the raw utterance instead of tokens, but that makes code a little more complex since we need to account for multiple instances of same word, find it's index and length etc.

Is there any caveat to receiving a pre tokenized utterance?

I want to introduce this pattern in the other places we pass a single word as argument, so let's discuss what the ideal signature would look like that allows any language to do it's own thing. I think the tokens + location of word allow to do any utterance parsing needed. but the devil is in the details, what about multi-word arguments?

def get_numerus_XX(utterance, substr, idx, amount, type=PluralCategory.CARDINAL):
     word = utterance[idx:idx+len(substr)]
     ...

@emphasize
Copy link
Member

emphasize commented Sep 13, 2022

i don't get the last proposal. idx could be also of type range?

But another thought. Is get_numerus needed as a user facing function? Or just an internal helper, without the Is there a need to index at all? This only applies to nouns and given a correct POS tag we could return a dict of index/numerus pairs.

@JarbasAl
Copy link
Member

JarbasAl commented Sep 13, 2022

Is there a need to index at all? This only applies to nouns and given a correct POS tag we could return a dict of index/numerus pairs.

the aim is to future proof this for other random languages, is a postag enough? what if we dont have a postag model for this lang? does klingon need the previous 4 words? is there any language or better algorithm that needs the follow up words not the previous? passing along utterance + word position provides the language with everything it needs, then its up to the lang to use postags or something else if it needs it, some langs might get away with much simpler checks. in portuguese and english postag is not needed or can be worked around

related, LF does not have the concept of postag yet, but this is being pluginified and already used in neon, so we will need to discuss OPM integration in LF at some point, probably using the global config to assign a plugin per language i suppose

@NeonMariia
Copy link

Yes, I think this is great implementation (when we have whole utterance and span as certain word) for the future grammatical analysis

@emphasize
Copy link
Member

emphasize commented Sep 13, 2022

is a postag enough?

Certain language implementations would need a dependency model as well.

I don't think a span/range is practical, since different languages would need different spans (unless the class Span is also localized - "span" would equate to the depencency parsed chunk up until the root). How should it be used in skills?

The class Token should pipeline existing analytic models. If no plugin is existant, only pass the bare tokens.

@JarbasAl
Copy link
Member

JarbasAl commented Sep 13, 2022

Certain language implementations would need a dependency model as well.

I don't think a span/range is practical, since different languages would need different spans (unless the class Span is also localized - "span" would equate to the depencency parsed chunk up until the root). How should it be used in skills?

postag/dependency/whatever is internal to LF and used by the method, users/skills dont need to care about any of that

a skill only passes along utterance, word, idx, amount,

index is needed because the word may appear multiple times in the sentence, we can default index to None and in that case pick the first word

maybe we can reorder args a bit, below could all be valid

word = ...
utterance = ...
idx = N
plural = get_numerus(word, 2)
plural = get_numerus(word, 2, utterance)
plural = get_numerus(word, 2, utterance, idx)

but if we support that then english skill devs will never pass utterance and idx, meaning lang support will be defective until someone sends a patch to the skill. maybe we should force all args even if its a little more cumbersome ?

to make things simple maybe the signature should be range, utterance, amount and all of those required

@NeonMariia
Copy link

Certain language implementations would need a dependency model as well.
I don't think a span/range is practical, since different languages would need different spans (unless the class Span is also localized - "span" would equate to the depencency parsed chunk up until the root). How should it be used in skills?

postag/dependency/whatever is internal to LF and used by the method, users/skills dont need to care about any of that

a skill only passes along utterance, word, idx, amount,

index is needed because the word may appear multiple times in the sentence, we can default index to None and in that case pick the first word

maybe we can reorder args a bit, below could all be valid

word = ...
utterance = ...
idx = N
plural = get_numerus(word, 2)
plural = get_numerus(word, 2, utterance)
plural = get_numerus(word, 2, utterance, idx)

but if we support that then english skill devs will never pass utterance and idx, meaning lang support will be defective until someone sends a patch to the skill. maybe we should force all args even if its a little more cumbersome ?

to make things simple maybe the signature should be range, utterance, amount and all of those required

I think we should have an opportunity to pass just word itself for more simple cases?

@JarbasAl
Copy link
Member

So you think the snippet above is a good public facing api with all the optional arguments? It will require more docs and work in skills for lang support, as a dev i like it, but thinking in a larger scope im concerned skill devs will be lazy and this will result in hardcoded english support most of the times.

@NeonMariia
Copy link

So you think the snippet above is a good public facing api with all the optional arguments? It will require more docs and work in skills for lang support, as a dev i like it, but thinking in a larger scope im concerned skill devs will be lazy and this will result in hardcoded english support most of the times.

Idk for me it seems that these optional arguments should not cause any problems (but i don't have such experience in skills lang support)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants