-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_errors
fails with invalid unicode error strings
#561
Comments
Silencing (aka "replacing") errors is not a fix, but just a workaround and a recipe for a disaster. |
This does the opposite -- it is not silencing the error but it actually allows you access to the original error message that gets lost otherwise. Essentially without this fix you will permanently fail with a unicode error and you will find it hard to understand what went wrong. Here you still fail -- worst case the new error message is as useless as the previous one. But usually it is stricly better because rather than getting a completely meaningless unicode error message you will get a message that more or less tells you what went wrong, with some funny characters added (see #562) Surely this here
is better than
|
Yes, I understand your point, but the underlying fundamental problem here is that there is non UTF-8 character encoding/decoding errors, which are masked as the replacement character instead of fixing the actual issue, which is to use the proper decoding codec, which will result in no errors. Your solution is partly acceptable, because it allows the user to get a better error message from the third-party layer tesseract. Does the |
It is a thin wrapper -- the part between the I agree that in an ideal world it would be better to address the unicode error -- but surely this solution is strictly better than what it does now? Currently you get the feedback "an error occured somewhere and I won't tell you where". After this you get the feedback "this error occurred. Sorry for the unreadable character". I would be surprised if this would break something -- all it does is to pass on a useful error message upstream. |
Can you provide an example of such broken input for the purposes of unit testing? Not trying to teach you, just wanted to share that such info will make it a lot easier to accept such requests. |
Thanks Unfortunately all I have is what is written here and in #562 -- the underlying error was that the temp directory was badly chosen ( |
The temp dir used by pytesseract is just the builtin option chosen by the standard library of Python itself. |
as I said — run it on MacOs with TMPDIR not set. or dummy up an exception where the string is improperly encoded. my point is — replace is better than what is currently there in every imaginable case. but if you don’t agree this is fine. it works for me now.
…________________________________
From: Bozho Dimitrov ***@***.***>
Sent: Saturday, December 7, 2024 2:24:13 PM
To: madmaze/pytesseract ***@***.***>
Cc: Stefan Loesch ***@***.***>; Author ***@***.***>
Subject: Re: [madmaze/pytesseract] `get_errors` fails with invalid unicode error strings (Issue #561)
The temp dir used by pytesseract is just the builtin option chosen by the standard library of Python itself.
So this is not a concern. And the user can modify that as you pointed. So if we want a failure test case, we need to reproduce the error. Otherwise there is no point of reporting such issue.
—
Reply to this email directly, view it on GitHub<#561 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACADTXAFSXCQ5BJIG6HCBM32EMAI3AVCNFSM6AAAAABTF5VX3OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRVGE4DOMJVHE>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
The current code from line 170 reads as follows
However, tesseract can sometimes produce invalid Unicode values in
error_string
(eg MacOS Sequioa 15.1.1, see #562 ) in which case this raises an exception and the original error message is lost.The fix is easy and probably uncontroversial: just add
, errors="replace"
and this fails gracefullyThe text was updated successfully, but these errors were encountered: