-
Notifications
You must be signed in to change notification settings - Fork 721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research the option of using stdin/stdout instead saving image on disk #172
Comments
Hi @cgallay This question is also relevant for the stdout. |
I found some issues with the tesseract stdin/stout and some modes/versions are affected. |
It seems that the problems with the stdin are fixed in tesseract 5.0, but testing is needed. |
Hi, wanted to ask if this feature is added in the latest release of pytesseract. Saving image on the disk the purpose of running the tesseract command is quite slow as of now ( I think that is what PyTesseract is doing right now ). Can I work on this feature if its not implemented ( will need help from you to implement this ). |
At the moment pytesseract supports passing the path of the images itself, which will skip creating temp files on disk. You can also pass the path to a text file with list of images for batch processing - this will also skip the pytesseract temp files. Both of those options are documented. About the feature itself - we need to first test the tesseract stdin on most recent versions in order to be sure that it will work correctly with pytesseract. |
I just fiddled around with it and this does the trick, for tesseract 4.1.1 and 5.0.0-alpha-20201231-243-gff83 at least (I'm on windows 10 with both tesseract versions in MSYS2). It does not really honor the interface yet, because it does not give you any output files, even if an extension is given, but I was just trying to get tesseract with stdin to work.
|
Nice, you can always monkey patch the module level function and make it work for you. |
I just compiled version 3.05.02 from https://github.com/tesseract-ocr/tesseract/tree/3.05.02 and ran the same modification as above with no problems with traineddata from https://github.com/tesseract-ocr/tessdata/tree/3.04.00. I know it's just a sample and not a full fledged test, but maybe that helps. |
Yes, this is very helpful. This means that this feature can be added to pytesseract. And finally, based on the above - we should decide if this will be the default implementation (and if we should keep the old one around or not). |
I just wrote a CLI wrapper for self-using, which use stdin and stdout to communicate with tesseract executable. Also I convert the
|
Hi there, |
@GreenCobalt |
@GreenCobalt you can try https://github.com/sirfz/tesserocr for your use case, but I am not sure what underlying version of the tesseract implementation is used. |
Wow! It really works but It does not work for OpenCV users. They should convert their image into PIL Image. |
updated run_and_get_output based on madmaze/pytesseract#172 (comment) to use stdin to avoid unnecessary disk writes, minor speed improvement
Hi,
I am wondering why you don't use stdin argument to send the image to tesseract instead of saving it on the disk?
pytesseract/src/pytesseract.py
Line 208 in 25a9d38
The text was updated successfully, but these errors were encountered: