Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_errors fails with invalid unicode error strings #561

Open
oditorium opened this issue Dec 7, 2024 · 8 comments
Open

get_errors fails with invalid unicode error strings #561

oditorium opened this issue Dec 7, 2024 · 8 comments

Comments

@oditorium
Copy link

oditorium commented Dec 7, 2024

The current code from line 170 reads as follows

def get_errors(error_string):
    return ' '.join(
        line for line in error_string.decode(DEFAULT_ENCODING).splitlines()
    ).strip()

However, tesseract can sometimes produce invalid Unicode values in error_string (eg MacOS Sequioa 15.1.1, see #562 ) in which case this raises an exception and the original error message is lost.

The fix is easy and probably uncontroversial: just add , errors="replace" and this fails gracefully

def get_errors(error_string):
    return ' '.join(
        line for line in error_string.decode(DEFAULT_ENCODING, errors="replace").splitlines()
    ).strip()
@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Dec 7, 2024

Silencing (aka "replacing") errors is not a fix, but just a workaround and a recipe for a disaster.

@oditorium
Copy link
Author

oditorium commented Dec 7, 2024

Silencing errors is not a fix, but just a workaround and a recipe for a disaster.

This does the opposite -- it is not silencing the error but it actually allows you access to the original error message that gets lost otherwise. Essentially without this fix you will permanently fail with a unicode error and you will find it hard to understand what went wrong. Here you still fail -- worst case the new error message is as useless as the previous one. But usually it is stricly better because rather than getting a completely meaningless unicode error message you will get a message that more or less tells you what went wrong, with some funny characters added (see #562)

Surely this here

OCR failed [(1, 'Error in fopenReadStream: failed to open locally with tail
 tess_pqyhbx6k_input.JPEG for filename /tmp/tess_pqyhbx6k_input.JPEG 
Leptonica Error in findFileFormat: image file not found: /tmp/tess_pqyhbx6k_input.JPEG 
Error in fopenReadStream: failed to open locally with tail ���� for 
filename ���� Leptonica Error in pixRead: image file not found: ���� 
Image file ���� cannot be read! Error during processing.')]

is better than

Unicode error in get_error

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Dec 7, 2024

Yes, I understand your point, but the underlying fundamental problem here is that there is non UTF-8 character encoding/decoding errors, which are masked as the replacement character instead of fixing the actual issue, which is to use the proper decoding codec, which will result in no errors.

Your solution is partly acceptable, because it allows the user to get a better error message from the third-party layer tesseract.
But I am pretty sure it will break current expected logic, because the behavior changes, no?

Does the OCR failed ... message comes from a raised TesseractError exception?

@bozhodimitrov bozhodimitrov reopened this Dec 7, 2024
@oditorium
Copy link
Author

Yes, I understand your point, but the underlying fundamental problem here is that there is non UTF-8 encoding errors, which are masked as the replacement character instead of fixing the actual issue, which is to use the proper decoding codec, which will result in no errors.

Your solution is partly acceptable, because it allows the user to get a better error message from the third-party layer tesseract. But I am pretty sure it will break current expected logic, because the behavior changes, no?

Does the OCR failed ... message comes from a raised TesseractError exception?

It is a thin wrapper -- the part between the [.] is what you get when you print the exception that is raised. Without this change you essentially get [Unicode error in get_error] which is completely useless.

I agree that in an ideal world it would be better to address the unicode error -- but surely this solution is strictly better than what it does now? Currently you get the feedback "an error occured somewhere and I won't tell you where". After this you get the feedback "this error occurred. Sorry for the unreadable character".

I would be surprised if this would break something -- all it does is to pass on a useful error message upstream.

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Dec 7, 2024

Can you provide an example of such broken input for the purposes of unit testing?
And when you report issues consider this approach: The most important skill I've learned for software engineering (beginner) anthony explains #340

Not trying to teach you, just wanted to share that such info will make it a lot easier to accept such requests.

@oditorium
Copy link
Author

oditorium commented Dec 7, 2024

Can you provide an example of such broken input for the purposes of unit testing? And when you report issues consider this approach: The most important skill I've learned for software engineering (beginner) anthony explains #340

Not trying to teach you, just wanted to share that such info will make it a lot easier to accept such requests.

Thanks

Unfortunately all I have is what is written here and in #562 -- the underlying error was that the temp directory was badly chosen (/tmp in recent MacOS is not a good choice, but apparently the tests don't show that). It took me maybe 2 hours to get to the underlying error message. But if you can replicate #562 you will be able to replicate #561. In other words -- if you have access to a fully up to date Mac and just remove the TMPDIR environment variable you should be able to replicate #562. Or if you want to force it set export TMPDIR=/tmp but be careful not to do this globally because this will break a lot of other stuff.

@bozhodimitrov
Copy link
Collaborator

bozhodimitrov commented Dec 7, 2024

The temp dir used by pytesseract is just the builtin option chosen by the standard library of Python itself.
So this is not a concern. And the user can modify that as you pointed. So if we want a failure test case, we need to reproduce the error. Otherwise there is no point of reporting such issue. This is like saying - the environment is broken, adjust the code to work in broken environments. So my answer to this is simply, no.

@oditorium
Copy link
Author

oditorium commented Dec 8, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants