Handling Preceding Zeroes #204

firebladed · 2021-07-12T17:16:48Z

Describe the bug
Zeroes preceding a non zero digit are ignored, either initially or following a pause

the problem is partly related to the in-predictability of pauses in readings of number sequences
as
"0 1 4 6 0 6" is correct interpreted to [0.0, 1.0, 4.0, 6.0, 0.0, 6.0]
but "01 46 06") incorrectly goes to [1.0, 46.0, 6.0]

To Reproduce
Steps to reproduce the behavior:

install lingua_franca
open python3

>>> from lingua_franca import load_languages, set_default_lang, parse
>>> from lingua_franca import extractnumbers
>>> load_languages(['en'])

>>>extract_numbers("010 101")
[10.0, 101.0]
>>> extract_numbers("01 010 101")
[1.0, 10.0, 101.0]
>>> extract_numbers("51 21 05")
[51.0, 21.0, 5.0]
>>> extract_numbers("01 46 06")
[1.0, 46.0, 6.0]

Expected behavior

zeros should be added to output as separate numbers,

I think zeros preceding a single non zero digit should be treated as a separate number, either by default or as an option

e.g
"0 1" (zero one) -> [0, 1]
"01 46 06" (zero one four six zero six) -> [0, 1, 46, 0, 6]

Additional context
this is problematic used for reading code numbers e.g totp codes
which could be zero in any digit and can be read in multiple ways

e.g 0 1 4 6 0 6 (zero one four six zero six)
34 45 65 (three four four five six seven ,thirty four forty five sixty five)
234 567 (two hundred and thirty four five hundred and sixty seven

one aspect i'm not sure of is should 46 read as "four six" be interpreted as [46] or [4, 6] when preceding a decimal (or there is no decimal) after a decimal point is different as "normal" reading is e.g 0.01475 (zero point zero one four seven five)

however "46" (fourty six) can always be converted to "4 6" however missing zeroes cannot be recovered

The text was updated successfully, but these errors were encountered:

JarbasAl · 2021-07-12T17:44:25Z

you want to keep an eye on #150

EDIT: nvm, its the reverse problem....

ChanceNCounter · 2021-07-12T18:34:27Z

Partially misplaced, I think. Apparently planned #150 format.pronounce_digits() would be a more appropriate function call for the suggested behavior.

However, I'm not sure if it retains leading zeroes at the moment, either, because it uses extract_number() along the way.

The fundamental challenge here is continuing to treating the input as a string while parsing.

Relating this back to the code side, the English number extractors "chunk" numbers as they go based on powers of 10. While parsing a base-10 number left-to-right, whenever you encounter a power of 10, you scan the remainder of the number for larger powers of 10. If you do not find any, you have identified the end of a "place."

"1,075,018" -> 1000000 | 75,000 | 18 -> sum() -> 1075018

ChanceNCounter · 2021-07-13T01:39:40Z

I stand corrected. In the current version of the PR, format.pronounce_digits() does indeed preserve leading zeroes:

>>> format.pronounce_digits("014606")
'zero one four six zero six'

ChanceNCounter · 2021-07-15T23:21:41Z

On reflection, the "fail" case above is OOS. If the input appears to mean something specific - "46" == 46.0 - LF can't account for whether the program calling its parsers meant to feed it "46".

I vote one of two things:

Add a sugar parameter extract_numbers(..., max_digits=0) where False things retain the current behavior
- Pros: sugar, function signature isn't very long
- Cons: edge case, needs localization and some non-English extractors already need work
wontfix

krisgesling · 2021-07-16T00:51:36Z

Hey @firebladed,

If we're looking at STT output, another option might be something like an extract_digits() method that intentionally pulls out all the digits in a string as individual numbers. I think this will be more straightforward than trying to determine when people meant to have digits expressed together or not.

Can anyone think of cases other than codes or phone numbers, where this would come up?

If it won't be supported in the extract_number(s) methods we probably need to add a note to the docstring that leading zero's will be ignored.

Probably not what you're referring to, but just in case...
If it's something that you know is a number like a TOTP or PIN returned from another system, then I'd suggest that extract_numbers() is probably overkill. For example, if you typecast the string to a list you get your list of digits:

>>> totp = "012345"
>>> list(totp)
['0', '1', '2', '3', '4', '5']

If there might be spaces in the source:

>>> totp = "01 2 3 45"
>>> list(totp.replace(" ",""))
['0', '1', '2', '3', '4', '5']

or if the source may be an int you would need to do something slightly more verbose:

>>> totp = 123456   # note an int cannot have a leading zero
>>> [digit for digit in str(totp)]
['1', '2', '3', '4', '5'. '6']

This could possibly act as a workaround for the STT case:

extracted_codes = [
    list(utterance.replace(" ","")),
    extract_numbers(utterance)[0]
]
if totp in extracted_codes:
    authenticated = True

firebladed added the bug Something isn't working label Jul 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Preceding Zeroes #204

Handling Preceding Zeroes #204

firebladed commented Jul 12, 2021 •

edited

Loading

JarbasAl commented Jul 12, 2021 •

edited

Loading

ChanceNCounter commented Jul 12, 2021

ChanceNCounter commented Jul 13, 2021

ChanceNCounter commented Jul 15, 2021

krisgesling commented Jul 16, 2021

Handling Preceding Zeroes #204

Handling Preceding Zeroes #204

Comments

firebladed commented Jul 12, 2021 • edited Loading

JarbasAl commented Jul 12, 2021 • edited Loading

ChanceNCounter commented Jul 12, 2021

ChanceNCounter commented Jul 13, 2021

ChanceNCounter commented Jul 15, 2021

krisgesling commented Jul 16, 2021

firebladed commented Jul 12, 2021 •

edited

Loading

JarbasAl commented Jul 12, 2021 •

edited

Loading