`get_errors` fails with invalid unicode error strings #561

oditorium · 2024-12-07T06:34:46Z

The current code from line 170 reads as follows

def get_errors(error_string):
    return ' '.join(
        line for line in error_string.decode(DEFAULT_ENCODING).splitlines()
    ).strip()

However, tesseract can sometimes produce invalid Unicode values in error_string (eg MacOS Sequioa 15.1.1, see #562 ) in which case this raises an exception and the original error message is lost.

The fix is easy and probably uncontroversial: just add , errors="replace" and this fails gracefully

def get_errors(error_string):
    return ' '.join(
        line for line in error_string.decode(DEFAULT_ENCODING, errors="replace").splitlines()
    ).strip()

The text was updated successfully, but these errors were encountered:

bozhodimitrov · 2024-12-07T12:00:58Z

Silencing (aka "replacing") errors is not a fix, but just a workaround and a recipe for a disaster.

oditorium · 2024-12-07T12:04:14Z

Silencing errors is not a fix, but just a workaround and a recipe for a disaster.

This does the opposite -- it is not silencing the error but it actually allows you access to the original error message that gets lost otherwise. Essentially without this fix you will permanently fail with a unicode error and you will find it hard to understand what went wrong. Here you still fail -- worst case the new error message is as useless as the previous one. But usually it is stricly better because rather than getting a completely meaningless unicode error message you will get a message that more or less tells you what went wrong, with some funny characters added (see #562)

Surely this here

OCR failed [(1, 'Error in fopenReadStream: failed to open locally with tail
 tess_pqyhbx6k_input.JPEG for filename /tmp/tess_pqyhbx6k_input.JPEG 
Leptonica Error in findFileFormat: image file not found: /tmp/tess_pqyhbx6k_input.JPEG 
Error in fopenReadStream: failed to open locally with tail ���� for 
filename ���� Leptonica Error in pixRead: image file not found: ���� 
Image file ���� cannot be read! Error during processing.')]

is better than

Unicode error in get_error

bozhodimitrov · 2024-12-07T12:22:59Z

Yes, I understand your point, but the underlying fundamental problem here is that there is non UTF-8 character encoding/decoding errors, which are masked as the replacement character instead of fixing the actual issue, which is to use the proper decoding codec, which will result in no errors.

Your solution is partly acceptable, because it allows the user to get a better error message from the third-party layer tesseract.
But I am pretty sure it will break current expected logic, because the behavior changes, no?

Does the OCR failed ... message comes from a raised TesseractError exception?

oditorium · 2024-12-07T12:28:16Z

Yes, I understand your point, but the underlying fundamental problem here is that there is non UTF-8 encoding errors, which are masked as the replacement character instead of fixing the actual issue, which is to use the proper decoding codec, which will result in no errors.

Your solution is partly acceptable, because it allows the user to get a better error message from the third-party layer tesseract. But I am pretty sure it will break current expected logic, because the behavior changes, no?

Does the OCR failed ... message comes from a raised TesseractError exception?

It is a thin wrapper -- the part between the [.] is what you get when you print the exception that is raised. Without this change you essentially get [Unicode error in get_error] which is completely useless.

I agree that in an ideal world it would be better to address the unicode error -- but surely this solution is strictly better than what it does now? Currently you get the feedback "an error occured somewhere and I won't tell you where". After this you get the feedback "this error occurred. Sorry for the unreadable character".

I would be surprised if this would break something -- all it does is to pass on a useful error message upstream.

bozhodimitrov · 2024-12-07T12:36:31Z

Can you provide an example of such broken input for the purposes of unit testing?
And when you report issues consider this approach: The most important skill I've learned for software engineering (beginner) anthony explains #340

Not trying to teach you, just wanted to share that such info will make it a lot easier to accept such requests.

oditorium · 2024-12-07T12:57:39Z

Can you provide an example of such broken input for the purposes of unit testing? And when you report issues consider this approach: The most important skill I've learned for software engineering (beginner) anthony explains #340

Not trying to teach you, just wanted to share that such info will make it a lot easier to accept such requests.

Thanks

Unfortunately all I have is what is written here and in #562 -- the underlying error was that the temp directory was badly chosen (/tmp in recent MacOS is not a good choice, but apparently the tests don't show that). It took me maybe 2 hours to get to the underlying error message. But if you can replicate #562 you will be able to replicate #561. In other words -- if you have access to a fully up to date Mac and just remove the TMPDIR environment variable you should be able to replicate #562. Or if you want to force it set export TMPDIR=/tmp but be careful not to do this globally because this will break a lot of other stuff.

bozhodimitrov · 2024-12-07T14:23:51Z

The temp dir used by pytesseract is just the builtin option chosen by the standard library of Python itself.
So this is not a concern. And the user can modify that as you pointed. So if we want a failure test case, we need to reproduce the error. Otherwise there is no point of reporting such issue. This is like saying - the environment is broken, adjust the code to work in broken environments. So my answer to this is simply, no.

oditorium · 2024-12-08T18:43:24Z

as I said — run it on MacOs with TMPDIR not set. or dummy up an exception where the string is improperly encoded. my point is — replace is better than what is currently there in every imaginable case. but if you don’t agree this is fine. it works for me now.

…

________________________________ From: Bozho Dimitrov ***@***.***> Sent: Saturday, December 7, 2024 2:24:13 PM To: madmaze/pytesseract ***@***.***> Cc: Stefan Loesch ***@***.***>; Author ***@***.***> Subject: Re: [madmaze/pytesseract] `get_errors` fails with invalid unicode error strings (Issue #561) The temp dir used by pytesseract is just the builtin option chosen by the standard library of Python itself. So this is not a concern. And the user can modify that as you pointed. So if we want a failure test case, we need to reproduce the error. Otherwise there is no point of reporting such issue. — Reply to this email directly, view it on GitHub<#561 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACADTXAFSXCQ5BJIG6HCBM32EMAI3AVCNFSM6AAAAABTF5VX3OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRVGE4DOMJVHE>. You are receiving this because you authored the thread.Message ID: ***@***.***>

oditorium mentioned this issue Dec 7, 2024

Failure on MacOS with supervisord #562

Closed

bozhodimitrov closed this as completed Dec 7, 2024

bozhodimitrov reopened this Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`get_errors` fails with invalid unicode error strings #561

`get_errors` fails with invalid unicode error strings #561

oditorium commented Dec 7, 2024 •

edited

Loading

bozhodimitrov commented Dec 7, 2024 •

edited

Loading

oditorium commented Dec 7, 2024 •

edited

Loading

bozhodimitrov commented Dec 7, 2024 •

edited

Loading

oditorium commented Dec 7, 2024

bozhodimitrov commented Dec 7, 2024 •

edited

Loading

oditorium commented Dec 7, 2024 •

edited

Loading

bozhodimitrov commented Dec 7, 2024 •

edited

Loading

oditorium commented Dec 8, 2024 via email

get_errors fails with invalid unicode error strings #561

get_errors fails with invalid unicode error strings #561

Comments

oditorium commented Dec 7, 2024 • edited Loading

bozhodimitrov commented Dec 7, 2024 • edited Loading

oditorium commented Dec 7, 2024 • edited Loading

bozhodimitrov commented Dec 7, 2024 • edited Loading

oditorium commented Dec 7, 2024

bozhodimitrov commented Dec 7, 2024 • edited Loading

oditorium commented Dec 7, 2024 • edited Loading

bozhodimitrov commented Dec 7, 2024 • edited Loading

oditorium commented Dec 8, 2024 via email

`get_errors` fails with invalid unicode error strings #561

`get_errors` fails with invalid unicode error strings #561

oditorium commented Dec 7, 2024 •

edited

Loading

bozhodimitrov commented Dec 7, 2024 •

edited

Loading

oditorium commented Dec 7, 2024 •

edited

Loading

bozhodimitrov commented Dec 7, 2024 •

edited

Loading

bozhodimitrov commented Dec 7, 2024 •

edited

Loading

oditorium commented Dec 7, 2024 •

edited

Loading

bozhodimitrov commented Dec 7, 2024 •

edited

Loading