Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/pronounce_digits #150

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

ChanceNCounter
Copy link
Contributor

No description provided.

works for smaller numbers, but sucks as soon as they're big.

*massive* WIP
@devs-mycroft devs-mycroft added the CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors) label Nov 3, 2020
treating each pair as a single number.

Examples:
>>> pronounce_number(127, all_digits=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pronounce_digits not pronounce_number

no_no_words = list(_SHORT_SCALE_EN.values())[:5]
no_no_words.append('and')
print(no_no_words)
print(result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug prints

@JarbasAl
Copy link
Collaborator

add unittests and this should be good, i'm not sure about the behaviour for all_digits=False, i bet there is some language where we will run into an issue like catalan having 4 ways to pronounce time....

@ChanceNCounter
Copy link
Contributor Author

I expect the variant decorator will be able to deal with that, if and when the time comes. I'm still getting my arms all the way around it, but it seems like you covered all the bases.

JarbasAl added a commit to HelloChatterbox/lingua-nostra that referenced this pull request May 9, 2021
@ChanceNCounter
Copy link
Contributor Author

@krisgesling This obviously needs a rebase, but the code itself is merged and working at Chatterbox, so I think it's good to go if you agree. Just need to get around to the rebase.

Copy link
Contributor

@krisgesling krisgesling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, it looks like a nice addition - thanks.

Before the rebase, can we address Jarbas's two comments above, and I've raised a few questions below.

I wrote a short suite of tests to help me think through different cases too - feel free to steal them or write new ones: e750f39
Commented out tests are how I expected the output, and the passing version of it should be directly above.

lingua_franca/lang/format_en.py Outdated Show resolved Hide resolved
lingua_franca/lang/format_en.py Outdated Show resolved Hide resolved
lingua_franca/lang/format_en.py Outdated Show resolved Hide resolved
result.insert(0, pronounce_number_en(int(op_val)))
if is_float:
result.append(decimal_part)
no_no_words = list(_SHORT_SCALE_EN.values())[:5]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we specifically care about the first 5 values? Is this just an optimisation because the chances of the rest being there are so slim?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it slices 2 or 3 digits at a time, the rest can't be there. Right now, I'm trying to remember why I included anything but 'hundred'.

result = " ".join(result)
else:
while len(op_val) > 1:
idx = -2 if len(op_val) in [2, 4] else -3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without first reading this code I wrote the following tests:

self.assertEqual(pronounce_digits(238513096), "twenty three eighty five thirteen zero ninety six")
self.assertEqual(pronounce_digits(238513696), "twenty three eighty five thirteen sixty nine six")

I like that you go from the end rather than beginning so the final numbers can be read closer to what they actually are - "ninety six".

However being a longer number, it ends up getting broken down into multiple groups of three so we get:

self.assertEqual(pronounce_digits(238513096), "two thirty eight five thirteen ninety six")

What's the intended outcome here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're aiming for speaking in two digit numbers, should we check for an odd number length, speak the first digit and then speak all remaining pairs? Something like:

if len(op_val) % 2 == 1:
  result.append(pronounce_number(op_val[0]))
  op_val = op_val[1:]
remaining_pairs = # some code
for pair in remaining_pairs:
  result.append(pronounce_number(pair))

Copy link
Contributor Author

@ChanceNCounter ChanceNCounter Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be speaking in pairs slightly more often than intended. It doesn't really work on large numbers, but my intention was to "end with" three digit groupings in most cases, which just sounded most natural to me.

I'm gonna go over the code again top to bottom tomorrow, but the gist is:

123 -> "one twenty three"
1234 -> "twelve thirty four"
12345 -> "twelve three forty five"
123456 -> "one twenty three four fifty six"

It's definitely bugged on large numbers atm. The above should be followed by "one two thirty four five sixty seven", but I'm getting "twelve thirty four five sixty seven".

Once you're looking at 9+ digits, I don't think the function is much use without all_digits:

>>> assert(format.pronounce_digits(238513096, all_digits=True) == "two three eight five one three zero nine six")
>>> 

(edit: "tomorrow" to commence mid-afternoon UTC")

no_no_words.append('and')
print(no_no_words)
print(result)
result = [word for word in result if word.strip() not in no_no_words]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case where you think this might happen that we can test for? Or is it just a safety measure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens anytime the input is longer than two digits. The algorithm acts by running pronounce_number() on 2-3 digits at a time. This often returns the words hundred and and.

The latter stray debug print (=P) is the result prior to this operation:

>>> pronounce_digits(234534)
['two', 'hundred', 'and', 'thirty', 'four', 'five', 'hundred', 'and', 'thirty', 'four']
'two thirty four five thirty four'

pronounce_number(534), prepended with pronounce_number(234), sanitized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strip, on the other hand, is probably unneeded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA: Yes Contributor License Agreement exists (see https://github.com/MycroftAI/contributors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants