Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logic to normalize comma-delimited decimals #69

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ChanceNCounter
Copy link
Contributor

@ChanceNCounter ChanceNCounter commented Jan 11, 2020

Closes #65

All languages, including English, will now normalize 34,5 to 34.5 before beginning to extract numbers.

Decimal markers can now be specified through extract_number() and extract_numbers() function calls, using a new parameter decimal='.'

Note that these functions will only normalize decimals if they are called as such. Individual parsers, such as extractnumber_es(), have not been modified in any way, but will produce correct output when called via extract_number(lang='es', decimal=',')

For those who don't speak regex, though I encourage you to run the regex through a regex tester, it means this:

\b\d+,{1}\d+\b:
\b a word boundary
\d+ any number of digits, followed by
,{1} exactly one comma, followed by
\d+ any number of digits
\b another word boundary

@devs-mycroft devs-mycroft added the CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors) label Jan 11, 2020
@ChanceNCounter
Copy link
Contributor Author

run test

@ChanceNCounter
Copy link
Contributor Author

A few minutes ago, I realized the way I was trying to distinguish comma-delimited decimals from comma-delimited thousands wasn't gonna work.

So I came here to mark this WIP, and I now see that commit didn't make it to the branch in any case 😳

@ChanceNCounter
Copy link
Contributor Author

It occurs to me that this argument will almost always be used to parse ',' as a decimal point, and that users who want ',' for decimal points will almost always want '.' for thousands separators.

Should this PR address that?

If so, should it be an additional keyword, thousands=',', or should the decimal keyword be scrapped in favor of actual keywords?

Although the function signatures may become bloated, I'm partial toward two keywords. Standards exist other than full stops or commas for decimal separators. Indeed, it might be worth handling spaces, if specifically indicated in the function call, but that's another layer of complexity, probably involving a while loop.

@JarbasAl
Copy link
Collaborator

JarbasAl commented Feb 4, 2020

in a computer a decimal number is always represented with a . regardless of language.

I don't think we will get any stt transcription ever where this isn't the case

It's still good to think about this, chat usage is also important !

@ChanceNCounter
Copy link
Contributor Author

in a computer a decimal number is always represented with a . regardless of language.

I don't think we will get any stt transcription ever where this isn't the case

It's still good to think about this, chat usage is also important !

Exactly. This PR is so that people can parse data written the opposite way. @TheLastProject asked and shall receive.

@ChanceNCounter
Copy link
Contributor Author

Alright, here we go, now that I'm free and at a desktop. Say you wanted TTS to read lines from a document, containing yearly something, and you knew the file's formatting: YYYY: #,#

With this PR, you could do:

>>> foo = parse.extract_numbers("1942: 4,5".replace(":",""), decimal=",")
>>> foo
[1942.0, 4.5]
>>> d = format.nice_year(datetime(year=int(foo[0]), month=1, day=1))
>>> d
'nineteen forty two'
>>> bar = d + ", " + format.pronounce_number(foo[1])
>>> bar
'nineteen forty two, four point five'

Pass that along to TTS and you've got some nice behavior.

@JarbasAl JarbasAl added enhancement New feature or request multi_lang relates to several languages labels Nov 2, 2020
JarbasAl added a commit to HelloChatterbox/lingua-nostra that referenced this pull request May 9, 2021
JarbasAl added a commit to HelloChatterbox/lingua-nostra that referenced this pull request May 9, 2021
rebase of MycroftAI#69

Co-authored-by: jarbasal <[email protected]>
JarbasAl added a commit to OpenVoiceOS/ovos-lingua-franca that referenced this pull request Nov 27, 2022
rebase of MycroftAI#69

Co-authored-by: jarbasal <[email protected]>
JarbasAl added a commit to OpenVoiceOS/ovos-lingua-franca that referenced this pull request Nov 27, 2022
port lingua_nostra/pull/20 - support decimal markers

rebase of MycroftAI#69

Co-authored-by: jarbasal <[email protected]>
JarbasAl added a commit to OpenVoiceOS/ovos-lingua-franca that referenced this pull request Nov 27, 2022
feat/normalize_decimals

port lingua_nostra/pull/20 - support decimal markers

rebase of MycroftAI#69

Co-authored-by: jarbasal <[email protected]>
JarbasAl added a commit to OpenVoiceOS/ovos-lingua-franca that referenced this pull request Nov 27, 2022
JarbasAl added a commit to OpenVoiceOS/ovos-lingua-franca that referenced this pull request Nov 27, 2022
JarbasAl added a commit to OpenVoiceOS/ovos-lingua-franca that referenced this pull request Nov 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors) enhancement New feature or request multi_lang relates to several languages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

extract_number() functions need to convert '#,#' to float('#.#')
3 participants