-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ordinal number support #31
base: master
Are you sure you want to change the base?
Conversation
ordered_language_data["SKIP_TOKENS"] = skip_tokens | ||
if is_long: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So initially i was just writing boolen true and false into the dict. However there was an issue while writing to python files using json dumps.
json has the boolean as 'false' which was being written into python files but python has boolean as 'False'. I couldn't see an elegant way so used 1/0 instead of True/False.
morse-simulator/morse#562 might help to expand what exactly the issue is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could rewrite this:
if is_long:
ordered_language_data["IS_LONG"] = 1
else:
ordered_language_data["IS_LONG"] = 0
like this:
ordered_language_data["IS_LONG"] = 1 if is_long else 0
However, I would do ordered_language_data["IS_LONG"] = data["IS_LONG"]
and I would use true/false
in the json files (and True/False
in the python files). You can fix the translation_data
by overriding the true
and false
values:
translation_data = re.sub(r'\bfalse\b', 'False', translation_data)
translation_data = re.sub(r'\btrue\b', 'True', translation_data)
Said this, why did you changed from LONG_SCALE
(#6 (comment)) to IS_LONG
? For me it's really hard to understand what means "IS_LONG". I think "LONG_SCALE" would work, but if you did it because it is a boolean, you can change it by USE_LONG_SCALE
or HAS_LONG_SCALE
or IS_LONG_SCALE
.
self.tens = _normalize_dict(language_info["TENS"]) | ||
self.hundreds = _normalize_dict(language_info["HUNDREDS"]) | ||
self.big_powers_of_ten = _normalize_dict(language_info["BIG_POWERS_OF_TEN"]) | ||
self.unit_numbers = _normalize_dict(language_info["NUMBERS"]["UNIT_NUMBERS"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planning to now create more variables like self.unit_number_ordinal
and so on. The logic part would also be updated first checking cardinal and then ordinal.
It might be a bit messy with all these instance variables, will try to remove any duplicacy in code. Is there a better approach ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the best approach, you could do somehting like this:
self.__dict__.update(language_info['NUMBERS'])
self.__dict__.update({f'{key}_ORDINAL': value for key, value in language_info['ORDINAL_NUMBERS'].items()})
and then access the properties by doing self.DIRECT_NUMBERS
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... If you remember when coding the cardinal parser we first built the parser and once we know what we needed we updated the files.
That's making me think about this: Are you sure that we need the same fields as in the cardinal parser?
"ORDINAL_NUMBERS": {
"UNIT_NUMBERS": {},
"DIRECT_NUMBERS": {},
"TENS": {},
"HUNDREDS": {},
"BIG_POWERS_OF_TEN": {}
},
I think that we would need first to build first a simple logic in different languages to see what we need and then decide the fields we need.
As the ordinal numbers usually are really similar to cardinal numbers, I think that a better approach would be doing: ordinal_string --> cardinal_string --> parse
.
However, it's also important to do some investigation first. From my perspective, the logic behind the English language (suffixes) is really different than chinese (prefixes) or Spanish (different words), for example. Don't worry about this, I think the next approach could fit all the cases.
In the case of English (and Russian), the ordinal numbers work with suffixes. We have some "direct numbers" and the other numbers are built by adding "th" to the cardinal number.
So I would build a function like this (only working for English):
CARDINAL_DIRECT_NUMBERS = {'first': 'one', 'second': 'two', 'third': 'three', 'fifth': 'five', 'eighth': 'eight', 'ninth': 'nine', 'twelfth': 'twelve'}
def parse_ordinal(input_string, language='en'):
input_string = input_string.lower()
for word, number in CARDINAL_DIRECT_NUMBERS.items():
input_string = input_string.replace(word, number)
input_string = re.sub(r'ieth$', 'y', input_string)
input_string = re.sub(r'th$', '', input_string)
return parse_number(input_string)
I tried this with the 100 first ordinal numbers (https://github.com/noviluni/numbers-data/blob/master/ordinals/en_ordinals_calculators_ro.csv) and the "ordinal permutations" (https://github.com/noviluni/numbers-data/blob/master/ordinals_permutations/en_ordinals_permutations_calculators_ro.csv) in English and it worked for all of them.
So, what I would like you to do?
Open a new, different PR and, without editing the JSON files, add this function with the basic logic as well as the CSV tests for the ordinals in English.
From that starting point, I think we can decide what to put in the JSON files. I think using only a new field (example: CARDINAL_DIRECT_NUMBERS
) could work, but we also need some other regex rules, so we could maybe change it to work with regex or add another field CARDINAL_RE_EQUIVALENCES
containing something like: {r'ieth$': 'y', r'th$': ''}
. The order is important here (we should first override "ieth" and then "th"), so we have to consider it.
For now, don't worry about the other languages (especially for Spanish).
On the other side, the "scale" issue should be handled in a different PR.
@arnavkapoor Let me know if you think what I'm saying makes sense for you :)
ordered_language_data["SKIP_TOKENS"] = skip_tokens | ||
if is_long: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could rewrite this:
if is_long:
ordered_language_data["IS_LONG"] = 1
else:
ordered_language_data["IS_LONG"] = 0
like this:
ordered_language_data["IS_LONG"] = 1 if is_long else 0
However, I would do ordered_language_data["IS_LONG"] = data["IS_LONG"]
and I would use true/false
in the json files (and True/False
in the python files). You can fix the translation_data
by overriding the true
and false
values:
translation_data = re.sub(r'\bfalse\b', 'False', translation_data)
translation_data = re.sub(r'\btrue\b', 'True', translation_data)
Said this, why did you changed from LONG_SCALE
(#6 (comment)) to IS_LONG
? For me it's really hard to understand what means "IS_LONG". I think "LONG_SCALE" would work, but if you did it because it is a boolean, you can change it by USE_LONG_SCALE
or HAS_LONG_SCALE
or IS_LONG_SCALE
.
@@ -175,6 +174,7 @@ def parse_number(input_string, language='en'): | |||
if token in lang_data.skip_tokens and index != 0: | |||
continue | |||
return None | |||
print(normalized_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't forget to remove this :)
self.tens = _normalize_dict(language_info["TENS"]) | ||
self.hundreds = _normalize_dict(language_info["HUNDREDS"]) | ||
self.big_powers_of_ten = _normalize_dict(language_info["BIG_POWERS_OF_TEN"]) | ||
self.unit_numbers = _normalize_dict(language_info["NUMBERS"]["UNIT_NUMBERS"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the best approach, you could do somehting like this:
self.__dict__.update(language_info['NUMBERS'])
self.__dict__.update({f'{key}_ORDINAL': value for key, value in language_info['ORDINAL_NUMBERS'].items()})
and then access the properties by doing self.DIRECT_NUMBERS
, etc.
Hi @noviluni sorry for the delayed response , I had composed a reply last night but forgot to hit send it seems. 😬
However if the primary goal is to have a basic integration with date-parser , definitely we can have some sort of mapping in place as suggested so that most of the dates and simple cases in English work. I will create a PR with the code that you have given.
|
Hmm... that's only true if you implement the Let's see (using the function I wrote):
So, this is what I want you to do: Open a new draft PR and add these new functions: def is_cardinal_token(token, language):
... # Using the LanguageData for the language detects if the word is part of a number
def is_ordinal_token(token, language):
token = _apply_cardinal_converstion(input_string, language)
retutrn is_ordinal_token(token, language)
def is_number_token(token, language):
return is_cardinal_token(token, language) or is_ordinal_token(token, language)
def parse_ordinal(input_string, language='en'):
_apply_cardinal_converstion(input_string, language)
return parse_number(input_string)
def _apply_cardinal_converstion(input_string, language):
CARDINAL_DIRECT_NUMBERS = {'first': 'one', 'second': 'two', 'third': 'three', 'fifth': 'five', 'eighth': 'eight', 'ninth': 'nine', 'twelfth': 'twelve'} # thiw will be coming from the LanguageData
input_string = input_string.lower()
for word, number in CARDINAL_DIRECT_NUMBERS.items():
input_string = input_string.replace(word, number)
input_string = re.sub(r'ieth$', 'y', input_string)
input_string = re.sub(r'th$', '', input_string)
return input_string We will rewrite completely the After building this, we will change the I think this approach will be better for testing and will ease a lot the integration with |
Enhacing library I have made changes in the base structure of the files to support ordinals.