improve segmentation #104

bertsky · 2020-01-10T00:49:58Z

This fixes #101 (using raw_lines by default for textline images, but there are still some corner cases that need to be fixed in Tesseract) and brings a number of segmentation-related improvements:

interprete overwrite_regions more consistently
annotate @orientation (independent of dedicated deskewing processor) for vertical and @type for all other text blocks
no separators and noise regions in reading order
segment tables into cells and lines so they can be OCRed, too

- do not start new entries at index 0 when not deleting existing regions (but after last index) - do not add separator and noise regions to the reading order

…ing)

- with raw_lines=true, avoid line segmentation for textline images - default to textequiv_level=word (i.e. add more segmentation)

codecov · 2020-01-10T00:59:15Z

Codecov Report

Merging #104 into master will decrease coverage by 7.67%.
The diff coverage is 27.04%.

@@            Coverage Diff             @@
##           master     #104      +/-   ##
==========================================
- Coverage   46.82%   39.14%   -7.68%     
==========================================
  Files           8        9       +1     
  Lines         692      894     +202     
  Branches      129      190      +61     
==========================================
+ Hits          324      350      +26     
- Misses        337      492     +155     
- Partials       31       52      +21

Impacted Files	Coverage Δ
ocrd_tesserocr/segment_table.py	`0% <0%> (ø)`
ocrd_tesserocr/segment_region.py	`59.5% <36.11%> (-15.73%)`	⬇️
ocrd_tesserocr/recognize.py	`51.68% <55.95%> (-1.82%)`	⬇️
ocrd_tesserocr/segment_line.py	`77.58% <66.66%> (-2.42%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ceae71f...36e7ba4. Read the comment docs.

- segment-table: run page segmentation on TableRegions, filtering only TextRegion results and adding recursive reading order - segment-line: also search for TextRegion inside TableRegion - recognize: also search for TextLine/TextRegion inside TableRegion

stweil · 2020-01-10T10:12:27Z

@bertsky, could you please merge my PR #103 first and then rebase your PR to fix merge conflicts?

ocrd_tesserocr/segment_region.py

bertsky · 2020-01-10T10:29:00Z

@bertsky, could you please merge my PR #103 first and then rebase your PR to fix merge conflicts?

If it's not urgent, let's please do it the other way round: I have more fixes in the queue here and don't want to resynchronize everything just for cosmetics. Most of your changes are also in my commits here BTW (but that's a total coincidence). We can rebase and merge your PR more easily after this and #102 and #98 are solved.

bertsky · 2020-01-11T23:30:31Z

5b39cc0 fixes the overly simplistic page_update_higher_textequiv_levels to propagate text inside tables, too.

71550a8 takes that to a whole new level by finally attempting to adhere to readingDirection, textLineOrder, recursive ReadingOrder, and Relations of @type='join', e.g. drop-capital vs paragraph or cross-columns paragraph or cross-lines word.

This should also be covered in the PAGE validator ASAP! (And it would not hurt to bring these functions into ocrd_utils.)

…relations - apart from joining TextEquiv.Unicode, average all TextEquiv.conf - when concatenating glyphs, respect readingDirection - when concatenating words, respect readingDirection - when concatenating lines, respect textLineOrder - when concatenating subregions, respect ReadingOrder - also look for TextRegion in TextRegion (but depth-first) - avoid joining by whitespace if predecessor and successor appear in a Relation of type=join

kba

Fantastic, support for tables is something many users want.

ocrd_tesserocr/recognize.py

ocrd_tesserocr/segment_table.py

kba · 2020-01-13T11:02:43Z

This should also be covered in the PAGE validator ASAP!

@tboenig What GT do we have which uses these features and is robust enough for testing?

wrznr

This is a major breakthrough! Having RO handled correctly is a long-standing desideratum. Many thanks! However, some (minor) change requests remain. In addition, I know it is sometimes inconvenient but would smaller PRs be an option for you? In this case, IMHO RO handling and table segmentation could have been split up into separate feature-related PRs.

ocrd_tesserocr/ocrd-tool.json

ocrd_tesserocr/recognize.py

bertsky · 2020-01-13T12:45:00Z

In addition, I know it is sometimes inconvenient but would smaller PRs be an option for you? In this case, IMHO RO handling and table segmentation could have been split up into separate feature-related PRs.

Sure, I can split the last commit as a separate PR, but that would of course have to be based on this PR anyway. (I still have not received an answer from anyone whether this is better.) Just for my understanding: you don't want to see and comment the diffs accumulated over all commits in files changed? (If that is so, why not view/comment the commits individually?)

wrznr · 2020-01-13T12:52:40Z

Because these commits are not independent. I review the first commit, encounter a problem and go nuts for nothing because you have fixed that problem in a later commit. Feature-based PRs make it much easier to understand the internals of what you are proposing. As it stands, I can only do a formal review wrt. syntax and language issues.

I can split the last commit as a separate PR

Pls. do not but maybe keep your PRs smaller in the future.

(I still have not received an answer from anyone whether this is better.)

It is. Okay?

kba · 2020-01-13T14:13:57Z

but that would of course have to be based on this PR anyway
...
Because these commits are not independent.

We all appreciate concise PR but in my experience with splitting up PR into smaller dependent chunks is often very frustrating because you have to maintain multiple branches and while the "dependency" PR are not merged, the diff of the PR is against master branch so it's even more confusing and redundant.

kba · 2020-01-13T17:20:05Z

@bertsky, could you please merge my PR #103 first and then rebase your PR to fix merge conflicts?

If it's not urgent, let's please do it the other way round

I would not call it urgent but I also don't see why this should block #103 . Just a few conflicts in ocrd-tool.json, feel free to pull https://github.com/kba/ocrd_tesserocr/pull/new/raw-line if you don't want to solve them yourself.

And I try to avoid rebasing as much as possible. Merge commits might "pollute" the log but there's always git log --no-merges and a consistent state across forks.

bertsky · 2020-01-13T20:09:58Z

@wrznr @kba thanks for clarifying! Let's please discuss this further just once:

We all appreciate concise PR but in my experience with splitting up PR into smaller dependent chunks is often very frustrating because you have to maintain multiple branches and while the "dependency" PR are not merged, the diff of the PR is against master branch so it's even more confusing and redundant.

Okay, I'm glad we all agree on that.

Because these commits are not independent. I review the first commit, encounter a problem and go nuts for nothing because you have fixed that problem in a later commit.

I always try to make sequences of compact commits, using rebasing, which can be understood. The tendency is rather towards too large commits for me. So I am surprised to read it's the opposite in your view. Which change(s) did you find difficult to follow here?

Also, if it is this way around, why not use the accumulated diff perspective instead?

Feature-based PRs make it much easier to understand the internals of what you are proposing.

I agree this PR can be said to combine impovements across several features. (That's why I have used several commits.) But they all relate to segmentation, and I digged them up together by testing and Tesseract debugging. Are you serious I should separate them into individual branches, switching between branches while testing workspaces? How would you then test these yourself?

but maybe keep your PRs smaller in the future.

It boils down to this: non-trivial changes necessitate larger PRs IMO. But I will try!

bertsky · 2020-01-13T20:15:27Z

I would not call it urgent but I also don't see why this should block #103

As I said: I have got more (stashed/uncommited) changes in the queue, and this would mean more synchronization work for me. I don't see why I should pay that (unnecessary) price.

stweil · 2020-01-14T07:00:46Z

I tested this PR with one of our old book (https://digi.bib.uni-mannheim.de/urn/urn:nbn:de:bsz:180-digad-22977) using this process:

time -p ocrd process -m mets.xml \
  "tesserocr-segment-region -I MAX -O OCR-D-SEG-BLOCK -p '{\"crop_polygons\": true}'" \
  "tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE" \
  "tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{\"model\": \"Fraktur_5000000_0.502_198857\"}'"

The results with the previous code and without crop_polygons were better in most cases where there was a difference. This might be caused by the use of raw lines.

The region detection only changed slightly. There is still a problem with an overlap of the image region for the initial A and the text region, resulting in wrong text at the beginning of the lines.

bertsky · 2020-01-14T11:50:32Z

@stweil thanks for testing!

However, I think there is a misunderstanding here: this PR is not about crop_polygons – not at all. As I have already stated, my fixes for BlockPolygon are not enough to make this usable yet. (And when they do become ready, we agreed to rename the option and make it default.)

So, please repeat your test with crop_polygons=false (the default)!

Also, if you do still find regressions, could you please use ocrd-dinglehopper to compare before/after and post the relevant pages here?

(To embed the HTML, you can use the https://jsbin.com service – just paste the HTML on the left, then click Share, Output-only, and Link.)

wrznr · 2020-01-14T13:19:51Z

@bertsky First and foremost, I am grateful for your work and it is up to you how you structure your commits and PRs. When I ask for smaller PRs then it is a personal preference (and best practice in many, many companies and projects) which helps me to provide substantial reviews. The sheer amount of changes and code additions in this PR is not trackable for me, so my review is, again, purely formal. (@kba I also doubt that merging is any faster and less frustrating than for multiple smaller PRs.)

bertsky · 2020-01-15T09:02:44Z

@kba updating version string in tool json and setup: Will you do that after the merge, along with release tag (or should I add it to the PR)? How about 0.7?

…gnize

bertsky · 2020-01-15T12:15:47Z

Sorry @wrznr bad timing. Just added 68a4948 to complete the PR (removing a FIXME).

What do you think: Is it okay to import page_get_reading_order from another processor as long as we don't have these in ocrd_utils?

wrznr · 2020-01-15T12:18:29Z

Since the processor is from the same package, I think it is okay to do so. (Maybe with the corresponding issue in core and a Fixme comment?)

kba · 2020-01-15T12:24:24Z

Since the processor is from the same package

Even if it were not, as long as that's properly documented and installable. Better to avoid code duplication and have a slightly more involved setup.

PR to ocrd_utils of course always welcome.

kba · 2020-01-15T17:40:28Z

@kba updating version string in tool json and setup: Will you do that after the merge, along with release tag (or should I add it to the PR)? How about 0.7?

Sure, let me know when it's ready.

bertsky · 2020-01-16T11:00:35Z

PR to ocrd_utils of course always welcome.

I will make an attempt. Have to think of the general API though (RO iterator, modification, region type filters) and harmonize with ideas from ocrd_segment.

bertsky · 2020-01-16T11:02:44Z

@kba updating version string in tool json and setup: Will you do that after the merge, along with release tag (or should I add it to the PR)? How about 0.7?

Sure, let me know when it's ready.

It's ready alright. Of course – again – some extra effort should be made to add test coverage, but I don't have the time now.

@stweil have you run your test to satisfaction?

stweil · 2020-01-16T12:22:35Z

I have re-run the test without crop_polygons and still see a regression with the new code. For the page shown above, the new code does not recognize the first line ("Vorred ...") and misses an "a" in line 5 ("das"). The rest of that page is identical now. Other pages show similar small differences, and in most cases the old code gave getter results.

bertsky · 2020-01-16T14:07:07Z

@stweil ok, thanks – could you please upload the original file of that page here, so I can have a look myself?

wrznr

Wow. Many thanks!

stweil · 2020-01-17T09:12:54Z

could you please upload the original file of that page here

@bertsky, is the METS file with all file links sufficient, or do you need other files, too? I tested physical page 7.

bertsky · 2020-01-17T10:07:39Z

@stweil this was sufficient, thanks.

and still see a regression with the new code. For the page shown above, the new code does not recognize the first line ("Vorred ...") and misses an "a" in line 5 ("das"). The rest of that page is identical now. Other pages show similar small differences, and in most cases the old code gave getter results.

I see similar problems with other Fraktur models.

Yet, this is to be expected: your workflow is too simplistic for raw_lines=true:

ocrd-tesserocr-segment-line gives only bboxes, not polygons (as would ocrd-cis-ocropy-segment, or ocrd-tesserocr-segment-line+ocrd-cis-ocropy-resegment)
ocrd-tesserocr-segment-line does not suppress foreground components of neighbouring/intruding regions (as would ocrd-cis-ocropy-clip)

Now, I already documented this in the JSON description of raw_lines:

Do not attempt additional segmentation (baseline+xheight+ascenders/descenders prediction) when using line images (i.e. when textequiv_level<region). Disable when line segments/images likely contain components of more than 1 line.

So to me this is a wontfix issue. Unless you would like to have more documentation, perhaps in README.md or the description of ocrd-tesserocr-segment-line?

stweil · 2020-01-19T13:45:53Z

ocrd_tesserocr/segment_region.py

@@ -71,7 +74,9 @@ def process(self):
                tessapi.SetVariable("textord_tabfind_find_tables", "1") # (default)
                # this should yield additional blocks within the table blocks
                # from the page iterator, but does not in fact (yet?):
-                tessapi.SetVariable("textord_tablefind_recognize_tables", "1")
+                # (and it can run into assertion errors when the table structure
+                #  does not meet certain homogenity expectations)


Suggested change

# does not meet certain homogenity expectations)

# does not meet certain homogeneity expectations)

bertsky · 2020-01-21T23:27:18Z

Yet, this is to be expected: your workflow is too simplistic for raw_lines=true:
So to me this is a wontfix issue. Unless you would like to have more documentation, perhaps in README.md or the description of ocrd-tesserocr-segment-line?

@stweil Another problem for raw lines just came up, which is next to independent of workflow configuration. This has swung the decision in favour of the old behaviour as default.

I can still remove the new mode entirely, if this is what everyone wants.

bertsky · 2020-01-23T14:06:59Z

@kba I think you can merge and release now.

kba · 2020-01-23T14:33:15Z

Thanks, released as v0.7.0

bertsky added 6 commits January 9, 2020 02:50

segment-region: overwrite_regions applies to all types alike

08083c9

segment-region: PT.VERTICAL_TEXT gets 90° orientation, too

98c2dc0

segment-region: fix reading order...

b2730ff

- do not start new entries at index 0 when not deleting existing regions (but after last index) - do not add separator and noise regions to the reading order

segment-region: set TextRegion/@type (paragraph/caption/heading/float…

a17e46c

…ing)

recognize: add option raw_lines (and default true)...

ca628ff

- with raw_lines=true, avoid line segmentation for textline images - default to textequiv_level=word (i.e. add more segmentation)

segment-region: revert to textord_tablefind_recognize_tables=0

a20da26

bertsky requested review from kba and wrznr January 10, 2020 00:49

bertsky force-pushed the raw-line branch from f2fe470 to d7e03f2 Compare January 10, 2020 00:53

bertsky force-pushed the raw-line branch from d7e03f2 to 79e4bba Compare January 10, 2020 09:08

stweil reviewed Jan 10, 2020

View reviewed changes

ocrd_tesserocr/segment_region.py Outdated Show resolved Hide resolved

recognize: follow TableRegion/TextRegion when updating higher levels

5b39cc0

bertsky force-pushed the raw-line branch from c828085 to 71550a8 Compare January 12, 2020 20:23

kba approved these changes Jan 13, 2020

View reviewed changes

ocrd_tesserocr/recognize.py Show resolved Hide resolved

ocrd_tesserocr/recognize.py Show resolved Hide resolved

ocrd_tesserocr/recognize.py Show resolved Hide resolved

ocrd_tesserocr/segment_table.py Outdated Show resolved Hide resolved

wrznr suggested changes Jan 13, 2020

View reviewed changes

tool json: swap segment-line/region description

8064c0a

wrznr self-requested a review January 15, 2020 09:07

improve comments/docstrings

66da5f6

wrznr approved these changes Jan 15, 2020

View reviewed changes

segment-table: re-use recursive region order implementation from reco…

68a4948

…gnize

segment-line: restrict line polygon to region outline

a7811cc

wrznr approved these changes Jan 17, 2020

View reviewed changes

stweil reviewed Jan 19, 2020

View reviewed changes

bertsky mentioned this pull request Jan 21, 2020

recognize: use PSM_RAW_LINE instead of PSM_SINGLE_LINE #101

Closed

recognize: default to raw_lines=false, improve documentation

36e7ba4

kba merged commit 502382c into OCR-D:master Jan 23, 2020

bertsky mentioned this pull request Jan 24, 2020

all: add dpi parameter as manual override to image metadata #108

Merged

bertsky deleted the raw-line branch February 21, 2020 16:51

bertsky mentioned this pull request May 18, 2020

ocrd_utils.coordinates_for_segment: clip to parent? OCR-D/core#489

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve segmentation #104

improve segmentation #104

bertsky commented Jan 10, 2020 •

edited

Loading

codecov bot commented Jan 10, 2020 •

edited

Loading

stweil commented Jan 10, 2020

bertsky commented Jan 10, 2020

bertsky commented Jan 11, 2020 •

edited

Loading

kba left a comment

kba commented Jan 13, 2020

wrznr left a comment

bertsky commented Jan 13, 2020

wrznr commented Jan 13, 2020

kba commented Jan 13, 2020

kba commented Jan 13, 2020

bertsky commented Jan 13, 2020

bertsky commented Jan 13, 2020

stweil commented Jan 14, 2020 •

edited

Loading

bertsky commented Jan 14, 2020

wrznr commented Jan 14, 2020 •

edited

Loading

bertsky commented Jan 15, 2020 •

edited

Loading

bertsky commented Jan 15, 2020

wrznr commented Jan 15, 2020

kba commented Jan 15, 2020

kba commented Jan 15, 2020

bertsky commented Jan 16, 2020

bertsky commented Jan 16, 2020

stweil commented Jan 16, 2020

bertsky commented Jan 16, 2020

wrznr left a comment

stweil commented Jan 17, 2020 •

edited

Loading

bertsky commented Jan 17, 2020

stweil Jan 19, 2020

bertsky commented Jan 21, 2020

bertsky commented Jan 23, 2020

kba commented Jan 23, 2020

	# does not meet certain homogenity expectations)
	# does not meet certain homogeneity expectations)

improve segmentation #104

improve segmentation #104

Conversation

bertsky commented Jan 10, 2020 • edited Loading

codecov bot commented Jan 10, 2020 • edited Loading

Codecov Report

stweil commented Jan 10, 2020

bertsky commented Jan 10, 2020

bertsky commented Jan 11, 2020 • edited Loading

kba left a comment

Choose a reason for hiding this comment

kba commented Jan 13, 2020

wrznr left a comment

Choose a reason for hiding this comment

bertsky commented Jan 13, 2020

wrznr commented Jan 13, 2020

kba commented Jan 13, 2020

kba commented Jan 13, 2020

bertsky commented Jan 13, 2020

bertsky commented Jan 13, 2020

stweil commented Jan 14, 2020 • edited Loading

bertsky commented Jan 14, 2020

wrznr commented Jan 14, 2020 • edited Loading

bertsky commented Jan 15, 2020 • edited Loading

bertsky commented Jan 15, 2020

wrznr commented Jan 15, 2020

kba commented Jan 15, 2020

kba commented Jan 15, 2020

bertsky commented Jan 16, 2020

bertsky commented Jan 16, 2020

stweil commented Jan 16, 2020

bertsky commented Jan 16, 2020

wrznr left a comment

Choose a reason for hiding this comment

stweil commented Jan 17, 2020 • edited Loading

bertsky commented Jan 17, 2020

stweil Jan 19, 2020

Choose a reason for hiding this comment

bertsky commented Jan 21, 2020

bertsky commented Jan 23, 2020

kba commented Jan 23, 2020

bertsky commented Jan 10, 2020 •

edited

Loading

codecov bot commented Jan 10, 2020 •

edited

Loading

bertsky commented Jan 11, 2020 •

edited

Loading

stweil commented Jan 14, 2020 •

edited

Loading

wrznr commented Jan 14, 2020 •

edited

Loading

bertsky commented Jan 15, 2020 •

edited

Loading

stweil commented Jan 17, 2020 •

edited

Loading