-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple output formats #159
Comments
As a matter of fact, pytesseract supports this scenario partially - you can use the config argument to pass the second extension (which is weird way of specifying the both output extensions). I am ok with with solution 1 and 3. But for 2, we need to agree that those will be the final function signatures for run_tesseract/run_and_get_output. And if I judge by the commits - we change those every year, so I prefer to not expose them or if we expose them, there should be a clear default warning, that the interface is not final. My vote is for 3 and we can use image_to_outputs or something like that. |
Ok for solution number 3 with function Rergarding the output, I see multiple scenaris:
|
I like the tuple approach (3) the most, since we can also extend it (if we are crazy enough :D). As far as the output - I guess that we can keep the same approach like before - asking the user for the Output format and returning that (again) as tuple. Lets think a bit on this, because if we do it nice and simple now, we will have less problems maintaining that in the future. |
New function signature
Implementation
For part 1: we can reuse For part 2: I think we need to developp a |
I'd love to see this feature as well. I see there's been no updates since 2018, so repicking this back up |
I would also like to see this feature. |
Hi,
Tesseract feature
Tesseract allows to make a single call and have multiple output format for example:
This will generate an out.pdf and an out.tsv; hence at the same time retrieve OCR results in a readable format by python and a searchable pdf.
Doing both formats at the same time is interesting because according to my experiences it is twice faster. I believe that it is due to avoiding redoing the OCR computation.
Not possible with pytesseract
But using this feature is not possible with pytesseract since you expose only specific functions (one for each task)
Potential solutions
image_to_pdf_or_hocr
to make it accept extension such aspdf tsv
Meaning modifying
run_tesseract
and (with some precaution onextension
)run_and_get_output
The text was updated successfully, but these errors were encountered: