Ordinal number support #31

arnavkapoor · 2020-08-04T14:22:52Z

Enhacing library I have made changes in the base structure of the files to support ordinals.

arnavkapoor · 2020-08-04T15:36:18Z

scripts/write_complete_language_data.py

        ordered_language_data["SKIP_TOKENS"] = skip_tokens
+        if is_long:


So initially i was just writing boolen true and false into the dict. However there was an issue while writing to python files using json dumps.

json has the boolean as 'false' which was being written into python files but python has boolean as 'False'. I couldn't see an elegant way so used 1/0 instead of True/False.
morse-simulator/morse#562 might help to expand what exactly the issue is.

You could rewrite this:

if is_long: ordered_language_data["IS_LONG"] = 1 else: ordered_language_data["IS_LONG"] = 0

like this:

ordered_language_data["IS_LONG"] = 1 if is_long else 0

However, I would do ordered_language_data["IS_LONG"] = data["IS_LONG"] and I would use true/false in the json files (and True/False in the python files). You can fix the translation_data by overriding the true and false values:

translation_data = re.sub(r'\bfalse\b', 'False', translation_data) translation_data = re.sub(r'\btrue\b', 'True', translation_data)

Said this, why did you changed from LONG_SCALE (#6 (comment)) to IS_LONG? For me it's really hard to understand what means "IS_LONG". I think "LONG_SCALE" would work, but if you did it because it is a boolean, you can change it by USE_LONG_SCALE or HAS_LONG_SCALE or IS_LONG_SCALE.

arnavkapoor · 2020-08-04T15:43:10Z

number_parser/parser.py

-        self.tens = _normalize_dict(language_info["TENS"])
-        self.hundreds = _normalize_dict(language_info["HUNDREDS"])
-        self.big_powers_of_ten = _normalize_dict(language_info["BIG_POWERS_OF_TEN"])
+        self.unit_numbers = _normalize_dict(language_info["NUMBERS"]["UNIT_NUMBERS"])


I was planning to now create more variables like self.unit_number_ordinal and so on. The logic part would also be updated first checking cardinal and then ordinal.
It might be a bit messy with all these instance variables, will try to remove any duplicacy in code. Is there a better approach ?

I'm not sure about the best approach, you could do somehting like this:

self.__dict__.update(language_info['NUMBERS']) self.__dict__.update({f'{key}_ORDINAL': value for key, value in language_info['ORDINAL_NUMBERS'].items()})

and then access the properties by doing self.DIRECT_NUMBERS, etc.

noviluni

Hmm... If you remember when coding the cardinal parser we first built the parser and once we know what we needed we updated the files.

That's making me think about this: Are you sure that we need the same fields as in the cardinal parser?

    "ORDINAL_NUMBERS": {
        "UNIT_NUMBERS": {},
        "DIRECT_NUMBERS": {},
        "TENS": {},
        "HUNDREDS": {},
        "BIG_POWERS_OF_TEN": {}
    },

I think that we would need first to build first a simple logic in different languages to see what we need and then decide the fields we need.

As the ordinal numbers usually are really similar to cardinal numbers, I think that a better approach would be doing: ordinal_string --> cardinal_string --> parse.

However, it's also important to do some investigation first. From my perspective, the logic behind the English language (suffixes) is really different than chinese (prefixes) or Spanish (different words), for example. Don't worry about this, I think the next approach could fit all the cases.

In the case of English (and Russian), the ordinal numbers work with suffixes. We have some "direct numbers" and the other numbers are built by adding "th" to the cardinal number.

So I would build a function like this (only working for English):

CARDINAL_DIRECT_NUMBERS = {'first': 'one', 'second': 'two', 'third': 'three', 'fifth': 'five', 'eighth': 'eight', 'ninth': 'nine', 'twelfth': 'twelve'}

def parse_ordinal(input_string, language='en'): 
     input_string = input_string.lower() 
     for word, number in CARDINAL_DIRECT_NUMBERS.items(): 
         input_string = input_string.replace(word, number) 
     input_string = re.sub(r'ieth$', 'y', input_string) 
     input_string = re.sub(r'th$', '', input_string) 
     return parse_number(input_string)

I tried this with the 100 first ordinal numbers (https://github.com/noviluni/numbers-data/blob/master/ordinals/en_ordinals_calculators_ro.csv) and the "ordinal permutations" (https://github.com/noviluni/numbers-data/blob/master/ordinals_permutations/en_ordinals_permutations_calculators_ro.csv) in English and it worked for all of them.

So, what I would like you to do?

Open a new, different PR and, without editing the JSON files, add this function with the basic logic as well as the CSV tests for the ordinals in English.

From that starting point, I think we can decide what to put in the JSON files. I think using only a new field (example: CARDINAL_DIRECT_NUMBERS) could work, but we also need some other regex rules, so we could maybe change it to work with regex or add another field CARDINAL_RE_EQUIVALENCES containing something like: {r'ieth$': 'y', r'th$': ''}. The order is important here (we should first override "ieth" and then "th"), so we have to consider it.

For now, don't worry about the other languages (especially for Spanish).

On the other side, the "scale" issue should be handled in a different PR.

@arnavkapoor Let me know if you think what I'm saying makes sense for you :)

noviluni · 2020-08-05T09:26:47Z

scripts/write_complete_language_data.py

        ordered_language_data["SKIP_TOKENS"] = skip_tokens
+        if is_long:


You could rewrite this:

if is_long: ordered_language_data["IS_LONG"] = 1 else: ordered_language_data["IS_LONG"] = 0

like this:

ordered_language_data["IS_LONG"] = 1 if is_long else 0

However, I would do ordered_language_data["IS_LONG"] = data["IS_LONG"] and I would use true/false in the json files (and True/False in the python files). You can fix the translation_data by overriding the true and false values:

translation_data = re.sub(r'\bfalse\b', 'False', translation_data) translation_data = re.sub(r'\btrue\b', 'True', translation_data)

Said this, why did you changed from LONG_SCALE (#6 (comment)) to IS_LONG? For me it's really hard to understand what means "IS_LONG". I think "LONG_SCALE" would work, but if you did it because it is a boolean, you can change it by USE_LONG_SCALE or HAS_LONG_SCALE or IS_LONG_SCALE.

noviluni · 2020-08-05T09:31:45Z

number_parser/parser.py

@@ -175,6 +174,7 @@ def parse_number(input_string, language='en'):
        if token in lang_data.skip_tokens and index != 0:
            continue
        return None
+    print(normalized_tokens)


don't forget to remove this :)

noviluni · 2020-08-05T09:41:36Z

number_parser/parser.py

-        self.tens = _normalize_dict(language_info["TENS"])
-        self.hundreds = _normalize_dict(language_info["HUNDREDS"])
-        self.big_powers_of_ten = _normalize_dict(language_info["BIG_POWERS_OF_TEN"])
+        self.unit_numbers = _normalize_dict(language_info["NUMBERS"]["UNIT_NUMBERS"])


I'm not sure about the best approach, you could do somehting like this:

self.__dict__.update(language_info['NUMBERS']) self.__dict__.update({f'{key}_ORDINAL': value for key, value in language_info['ORDINAL_NUMBERS'].items()})

and then access the properties by doing self.DIRECT_NUMBERS, etc.

arnavkapoor · 2020-08-06T07:54:35Z

Hi @noviluni sorry for the delayed response , I had composed a reply last night but forgot to hit send it seems. 😬

I did consider using the approach ordinal_numbers -> cardinal_numbers -> number , however I thought of certain cases that might give some issue. If we have something like
twentieth seventh fiftieth third this approach would give us 27 53 as output. As opposed to
20 7 50 3. A presence of an ordinal number indicates the end of the number so this would need to be incorporated. This along with the need to create a complete mapping for other languages like Spanish made me feel that this expanded structure (with same categories for ordinal numbers) would be more robust in the long term.

However if the primary goal is to have a basic integration with date-parser , definitely we can have some sort of mapping in place as suggested so that most of the dates and simple cases in English work. I will create a PR with the code that you have given.

Thanks for the inputs with respect to long scale implementation , I will create a separate PR with the suggested changes. I did change the variable because it was boolean , however wanted to change it to IS_LONG_SCALE didn't realize I had missed the scale. Going forward USE_LONG_SCALE seems to be the most verbose and accurate one , so will use it.

noviluni · 2020-08-06T13:04:33Z

If we have something like twentieth seventh fiftieth third this approach would give us 27 53 as output. As opposed to
20 7 50 3

Hmm... that's only true if you implement the parse() method, but I was talking about the parse_ordinal method and this is not happening.

Let's see (using the function I wrote):

>>> parse_ordinal('twentieth')
20

>>> parse_ordinal('twentieth seventh')
None

>>> parse_ordinal('twenty seventh')
27

So, this is what I want you to do:

Open a new draft PR and add these new functions:

def  is_cardinal_token(token, language):
    ...  # Using the LanguageData for the language detects if the word is part of a number

def is_ordinal_token(token, language):
    token = _apply_cardinal_converstion(input_string, language)
    retutrn is_ordinal_token(token, language)

def is_number_token(token, language):
    return is_cardinal_token(token, language) or is_ordinal_token(token, language)

def parse_ordinal(input_string, language='en'):
    _apply_cardinal_converstion(input_string, language)
    return parse_number(input_string)

def _apply_cardinal_converstion(input_string, language):
    CARDINAL_DIRECT_NUMBERS = {'first': 'one', 'second': 'two', 'third': 'three', 'fifth': 'five', 'eighth': 'eight', 'ninth': 'nine', 'twelfth': 'twelve'}  # thiw will be coming from the LanguageData
    input_string = input_string.lower() 
    for word, number in CARDINAL_DIRECT_NUMBERS.items(): 
        input_string = input_string.replace(word, number) 
    input_string = re.sub(r'ieth$', 'y', input_string) 
    input_string = re.sub(r'th$', '', input_string) 
    return input_string

We will rewrite completely the _apply_cardinal_converstion() but for now it's ok.

After building this, we will change the parse() method to use directly the parse_number() and parse_ordinal() instead of the _build_number() by checking the tokens with is_number_token() (and/or is_cardinal_token() and is_ordinal_token()) and we will refactor the _build_number() too, but let's discuss this after submitting the draft PR.

I think this approach will be better for testing and will ease a lot the integration with dateparser. 😄

arnavkapoor added 4 commits August 4, 2020 17:12

updating structure for ordinal support

fab230f

updating parser for ordinal number support

00bc9a3

updated structure of py files

1422130

removing temporary file

ec1d0bc

arnavkapoor commented Aug 4, 2020

View reviewed changes

noviluni suggested changes Aug 5, 2020

View reviewed changes

arnavkapoor mentioned this pull request Aug 7, 2020

Restructure code for ordinal support #35

Merged

arnavkapoor mentioned this pull request Aug 26, 2020

Support ordinals #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ordinal number support #31

Ordinal number support #31

arnavkapoor commented Aug 4, 2020 •

edited

Loading

arnavkapoor Aug 4, 2020

noviluni Aug 5, 2020

arnavkapoor Aug 4, 2020

noviluni Aug 5, 2020

noviluni left a comment

noviluni Aug 5, 2020

noviluni Aug 5, 2020

noviluni Aug 5, 2020

arnavkapoor commented Aug 6, 2020

noviluni commented Aug 6, 2020 •

edited

Loading

		ordered_language_data["SKIP_TOKENS"] = skip_tokens
		if is_long:

Ordinal number support #31

Are you sure you want to change the base?

Ordinal number support #31

Conversation

arnavkapoor commented Aug 4, 2020 • edited Loading

arnavkapoor Aug 4, 2020

Choose a reason for hiding this comment

noviluni Aug 5, 2020

Choose a reason for hiding this comment

arnavkapoor Aug 4, 2020

Choose a reason for hiding this comment

noviluni Aug 5, 2020

Choose a reason for hiding this comment

noviluni left a comment

Choose a reason for hiding this comment

noviluni Aug 5, 2020

Choose a reason for hiding this comment

noviluni Aug 5, 2020

Choose a reason for hiding this comment

noviluni Aug 5, 2020

Choose a reason for hiding this comment

arnavkapoor commented Aug 6, 2020

noviluni commented Aug 6, 2020 • edited Loading

arnavkapoor commented Aug 4, 2020 •

edited

Loading

noviluni commented Aug 6, 2020 •

edited

Loading