-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature/pronounce_digits #150
base: master
Are you sure you want to change the base?
feature/pronounce_digits #150
Conversation
works for smaller numbers, but sucks as soon as they're big. *massive* WIP
lingua_franca/format.py
Outdated
treating each pair as a single number. | ||
|
||
Examples: | ||
>>> pronounce_number(127, all_digits=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pronounce_digits not pronounce_number
lingua_franca/lang/format_en.py
Outdated
no_no_words = list(_SHORT_SCALE_EN.values())[:5] | ||
no_no_words.append('and') | ||
print(no_no_words) | ||
print(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug prints
add unittests and this should be good, i'm not sure about the behaviour for all_digits=False, i bet there is some language where we will run into an issue like catalan having 4 ways to pronounce time.... |
I expect the variant decorator will be able to deal with that, if and when the time comes. I'm still getting my arms all the way around it, but it seems like you covered all the bases. |
@krisgesling This obviously needs a rebase, but the code itself is merged and working at Chatterbox, so I think it's good to go if you agree. Just need to get around to the rebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, it looks like a nice addition - thanks.
Before the rebase, can we address Jarbas's two comments above, and I've raised a few questions below.
I wrote a short suite of tests to help me think through different cases too - feel free to steal them or write new ones: e750f39
Commented out tests are how I expected the output, and the passing version of it should be directly above.
lingua_franca/lang/format_en.py
Outdated
result.insert(0, pronounce_number_en(int(op_val))) | ||
if is_float: | ||
result.append(decimal_part) | ||
no_no_words = list(_SHORT_SCALE_EN.values())[:5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we specifically care about the first 5 values? Is this just an optimisation because the chances of the rest being there are so slim?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it slices 2 or 3 digits at a time, the rest can't be there. Right now, I'm trying to remember why I included anything but 'hundred'.
lingua_franca/lang/format_en.py
Outdated
result = " ".join(result) | ||
else: | ||
while len(op_val) > 1: | ||
idx = -2 if len(op_val) in [2, 4] else -3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without first reading this code I wrote the following tests:
self.assertEqual(pronounce_digits(238513096), "twenty three eighty five thirteen zero ninety six")
self.assertEqual(pronounce_digits(238513696), "twenty three eighty five thirteen sixty nine six")
I like that you go from the end rather than beginning so the final numbers can be read closer to what they actually are - "ninety six".
However being a longer number, it ends up getting broken down into multiple groups of three so we get:
self.assertEqual(pronounce_digits(238513096), "two thirty eight five thirteen ninety six")
What's the intended outcome here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're aiming for speaking in two digit numbers, should we check for an odd number length, speak the first digit and then speak all remaining pairs? Something like:
if len(op_val) % 2 == 1:
result.append(pronounce_number(op_val[0]))
op_val = op_val[1:]
remaining_pairs = # some code
for pair in remaining_pairs:
result.append(pronounce_number(pair))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be speaking in pairs slightly more often than intended. It doesn't really work on large numbers, but my intention was to "end with" three digit groupings in most cases, which just sounded most natural to me.
I'm gonna go over the code again top to bottom tomorrow, but the gist is:
123 -> "one twenty three"
1234 -> "twelve thirty four"
12345 -> "twelve three forty five"
123456 -> "one twenty three four fifty six"
It's definitely bugged on large numbers atm. The above should be followed by "one two thirty four five sixty seven", but I'm getting "twelve thirty four five sixty seven".
Once you're looking at 9+ digits, I don't think the function is much use without all_digits
:
>>> assert(format.pronounce_digits(238513096, all_digits=True) == "two three eight five one three zero nine six")
>>>
(edit: "tomorrow" to commence mid-afternoon UTC")
lingua_franca/lang/format_en.py
Outdated
no_no_words.append('and') | ||
print(no_no_words) | ||
print(result) | ||
result = [word for word in result if word.strip() not in no_no_words] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any case where you think this might happen that we can test for? Or is it just a safety measure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens anytime the input is longer than two digits. The algorithm acts by running pronounce_number()
on 2-3 digits at a time. This often returns the words hundred
and and
.
The latter stray debug print (=P) is the result prior to this operation:
>>> pronounce_digits(234534)
['two', 'hundred', 'and', 'thirty', 'four', 'five', 'hundred', 'and', 'thirty', 'four']
'two thirty four five thirty four'
pronounce_number(534)
, prepended with pronounce_number(234)
, sanitized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strip, on the other hand, is probably unneeded.
No description provided.