How to fix OCR errors? #1254

endolith · 2024-02-17T16:18:49Z

endolith
Feb 17, 2024

This works pretty well, but is not perfect. How can I modify the invisible text to fix OCR errors without otherwise affecting the layout of the document. For example I want to change society. ' into society.¹.

I found a lot of people asking for this:

Answered by endolith

Feb 18, 2024

OK I finally got it working on a different machine:

sudo apt install ocrmypdf
conda create --name ocrmypdf pip ipython
conda activate ocrmypdf
pip install git+https://github.com/ocrmypdf/OCRmyPDF.git
ipython

Then inside ipython:

import ocrmypdf
from pathlib import Path
ocrmypdf.api._pdf_to_hocr(input_pdf=Path("png2pdf.pdf"), output_folder=Path("./output"))
ocrmypdf.api._hocr_to_ocr_pdf(work_folder=Path("./output/"), output_file=Path("OCRed.pdf"))

Without any text modifications it outputs slightly different text than what I had before, with extra line breaks, but I guess that's from newer versions of various things.

View full answer

endolith · 2024-02-17T16:21:21Z

endolith
Feb 17, 2024
Author

One comment says:

With ocrmypdf, you could run --output-type hocr -k to generate the .hocr file which contains a HTML-like description of the recognized characters and edit this file then manually use hocrtranform to finish the file. It won't be exactly right unless you also fiddle with the pixel coordinates.

3 replies

endolith Feb 17, 2024
Author

But that doesn't work:

ocrmypdf: error: argument --output-type: invalid choice: 'hocr' (choose from 'pdfa', 'pdf', 'pdfa-1', 'pdfa-2', 'pdfa-3')

endolith Feb 17, 2024
Author

Oh I see:

> ocrmypdf -l eng -k --pdf-renderer hocr "png2pdf.pdf" "ocred.pdf"
… 
Temporary working files retained at:
/tmp/com.github.ocrmypdf.7791wzq8

Which creates a bunch of files in the temp folder:

/tmp/com.github.ocrmypdf.7791wzq8:
000001_ocr.png        000003_ocr_hocr.pdf   000005_rasterize.png  000008_ocr_hocr.hocr  000010_ocr_hocr.txt   000013_ocr.png        000015_ocr_hocr.pdf   000017_rasterize.png  000020_ocr_hocr.hocr  origin.pdf@
000001_ocr_hocr.hocr  000003_ocr_hocr.txt   000006_ocr.png        000008_ocr_hocr.pdf   000010_rasterize.png  000013_ocr_hocr.hocr  000015_ocr_hocr.txt   000018_ocr.png        000020_ocr_hocr.pdf   pdfa.pdf
000001_ocr_hocr.pdf   000003_rasterize.png  000006_ocr_hocr.hocr  000008_ocr_hocr.txt   000011_ocr.png        000013_ocr_hocr.pdf   000015_rasterize.png  000018_ocr_hocr.hocr  000020_ocr_hocr.txt   pdfa.ps
000001_ocr_hocr.txt   000004_ocr.png        000006_ocr_hocr.pdf   000008_rasterize.png  000011_ocr_hocr.hocr  000013_ocr_hocr.txt   000016_ocr.png        000018_ocr_hocr.pdf   000020_rasterize.png
000001_rasterize.png  000004_ocr_hocr.hocr  000006_ocr_hocr.txt   000009_ocr.png        000011_ocr_hocr.pdf   000013_rasterize.png  000016_ocr_hocr.hocr  000018_ocr_hocr.txt   debug.log
000002_ocr.png        000004_ocr_hocr.pdf   000006_rasterize.png  000009_ocr_hocr.hocr  000011_ocr_hocr.txt   000014_ocr.png        000016_ocr_hocr.pdf   000018_rasterize.png  fix_docinfo.pdf@
000002_ocr_hocr.hocr  000004_ocr_hocr.txt   000007_ocr.png        000009_ocr_hocr.pdf   000011_rasterize.png  000014_ocr_hocr.hocr  000016_ocr_hocr.txt   000019_ocr.png        graft_layers.pdf
000002_ocr_hocr.pdf   000004_rasterize.png  000007_ocr_hocr.hocr  000009_ocr_hocr.txt   000012_ocr.png        000014_ocr_hocr.pdf   000016_rasterize.png  000019_ocr_hocr.hocr  images/
000002_ocr_hocr.txt   000005_ocr.png        000007_ocr_hocr.pdf   000009_rasterize.png  000012_ocr_hocr.hocr  000014_ocr_hocr.txt   000017_ocr.png        000019_ocr_hocr.pdf   metafix.pdf
000002_rasterize.png  000005_ocr_hocr.hocr  000007_ocr_hocr.txt   000010_ocr.png        000012_ocr_hocr.pdf   000014_rasterize.png  000017_ocr_hocr.hocr  000019_ocr_hocr.txt   optimize.opt.pdf
000003_ocr.png        000005_ocr_hocr.pdf   000007_rasterize.png  000010_ocr_hocr.hocr  000012_ocr_hocr.txt   000015_ocr.png        000017_ocr_hocr.pdf   000019_rasterize.png  optimize.pdf
000003_ocr_hocr.hocr  000005_ocr_hocr.txt   000008_ocr.png        000010_ocr_hocr.pdf   000012_rasterize.png  000015_ocr_hocr.hocr  000017_ocr_hocr.txt   000020_ocr.png        origin@

So then I have to modify the …_ocr_hocr.hocr files and not the …_ocr_hocr.txt files?

endolith Feb 17, 2024
Author

OK I made 72 manual small edits to the .hocr files… 😩 Now I need to figure out how to convert them back into a PDF.

jbarlow83 · 2024-02-17T20:40:06Z

jbarlow83
Feb 17, 2024
Maintainer

You can use the ocrmypdf API - api._pdf_to_hocr and then api._hocr_to_ocr_pdf to create the hocr files and then finish the PDF. It splits the normal pipeline into two steps. There is no command line option for this, because I don't think it would be very useful and I don't want to support people who try to manually edit hocr files - there are lots of ways you can mess up.

You could do small edits this way, but it mostly amounts to changing words in place. It would take a full word processor to do more complex changes. The APIs are currently private because they are subject to change.

The author of gscan2pdf is rewriting that program in Python and planning to use ocrmypdf's features, including an interface to edit hOCR files. No idea of his timeline.

6 replies

endolith Feb 17, 2024
Author

Is that not available in the version installed by apt install ocrmypdf in WSL?

In [4]: ocrmypdf.__version__
Out[4]: '9.6.0+dfsg'

endolith Feb 17, 2024
Author

I tried the API but it doesn't seem to exist.

I tried the pdfbook solution in some comment but it won't build.

OMG why is it soooo difficult to just intercept the OCR and replace a few characters before it writes to PDF? This is like a day-long project just to replace a few characters.

endolith Feb 17, 2024
Author

Also not in the version installed by pip?

In [2]: ocrmypdf.__version__
Out[2]: '14.4.0'

endolith Feb 18, 2024
Author

OK I finally got it working on a different machine:

sudo apt install ocrmypdf
conda create --name ocrmypdf pip ipython
conda activate ocrmypdf
pip install git+https://github.com/ocrmypdf/OCRmyPDF.git
ipython

Then inside ipython:

import ocrmypdf
from pathlib import Path
ocrmypdf.api._pdf_to_hocr(input_pdf=Path("png2pdf.pdf"), output_folder=Path("./output"))
ocrmypdf.api._hocr_to_ocr_pdf(work_folder=Path("./output/"), output_file=Path("OCRed.pdf"))

Without any text modifications it outputs slightly different text than what I had before, with extra line breaks, but I guess that's from newer versions of various things.

Answer selected by endolith

This comment has been hidden.

Sign in to view

jbarlow83 Feb 21, 2024
Maintainer

Yes, there are differences between hocr and straight PDF rendering so this workflow is not identical.

diegodlh · 2024-06-19T19:10:43Z

diegodlh
Jun 19, 2024

Just a side comment you may find interesting, looking for hOCR-editing tools I found Scribe OCR today which seems quite promising!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to fix OCR errors? #1254

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

This comment has been hidden.

{{title}}

{{title}}

Select a reply

How to fix OCR errors? #1254

endolith Feb 17, 2024

Replies: 3 comments · 9 replies

endolith Feb 17, 2024 Author

endolith Feb 17, 2024 Author

endolith Feb 17, 2024 Author

endolith Feb 17, 2024 Author

jbarlow83 Feb 17, 2024 Maintainer

endolith Feb 17, 2024 Author

endolith Feb 17, 2024 Author

endolith Feb 17, 2024 Author

endolith Feb 18, 2024 Author

This comment has been hidden.

jbarlow83 Feb 21, 2024 Maintainer

diegodlh Jun 19, 2024

endolith
Feb 17, 2024

Replies: 3 comments 9 replies

endolith
Feb 17, 2024
Author

endolith Feb 17, 2024
Author

endolith Feb 17, 2024
Author

endolith Feb 17, 2024
Author

jbarlow83
Feb 17, 2024
Maintainer

endolith Feb 17, 2024
Author

endolith Feb 17, 2024
Author

endolith Feb 17, 2024
Author

endolith Feb 18, 2024
Author

jbarlow83 Feb 21, 2024
Maintainer

diegodlh
Jun 19, 2024