Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve segmentation #104

Merged
merged 14 commits into from
Jan 23, 2020
Merged

improve segmentation #104

merged 14 commits into from
Jan 23, 2020

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Jan 10, 2020

This fixes #101 (using raw_lines by default for textline images, but there are still some corner cases that need to be fixed in Tesseract) and brings a number of segmentation-related improvements:

  • interprete overwrite_regions more consistently
  • annotate @orientation (independent of dedicated deskewing processor) for vertical and @type for all other text blocks
  • no separators and noise regions in reading order
  • segment tables into cells and lines so they can be OCRed, too

- do not start new entries at index 0
  when not deleting existing regions
  (but after last index)
- do not add separator and noise regions
  to the reading order
- with raw_lines=true, avoid line segmentation for textline images
- default to textequiv_level=word (i.e. add more segmentation)
@codecov
Copy link

codecov bot commented Jan 10, 2020

Codecov Report

Merging #104 into master will decrease coverage by 7.67%.
The diff coverage is 27.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #104      +/-   ##
==========================================
- Coverage   46.82%   39.14%   -7.68%     
==========================================
  Files           8        9       +1     
  Lines         692      894     +202     
  Branches      129      190      +61     
==========================================
+ Hits          324      350      +26     
- Misses        337      492     +155     
- Partials       31       52      +21
Impacted Files Coverage Δ
ocrd_tesserocr/segment_table.py 0% <0%> (ø)
ocrd_tesserocr/segment_region.py 59.5% <36.11%> (-15.73%) ⬇️
ocrd_tesserocr/recognize.py 51.68% <55.95%> (-1.82%) ⬇️
ocrd_tesserocr/segment_line.py 77.58% <66.66%> (-2.42%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ceae71f...36e7ba4. Read the comment docs.

- segment-table: run page segmentation on TableRegions, filtering only
  TextRegion results and adding recursive reading order
- segment-line: also search for TextRegion inside TableRegion
- recognize: also search for TextLine/TextRegion inside TableRegion
@stweil
Copy link
Contributor

stweil commented Jan 10, 2020

@bertsky, could you please merge my PR #103 first and then rebase your PR to fix merge conflicts?

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 10, 2020

@bertsky, could you please merge my PR #103 first and then rebase your PR to fix merge conflicts?

If it's not urgent, let's please do it the other way round: I have more fixes in the queue here and don't want to resynchronize everything just for cosmetics. Most of your changes are also in my commits here BTW (but that's a total coincidence). We can rebase and merge your PR more easily after this and #102 and #98 are solved.

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 11, 2020

5b39cc0 fixes the overly simplistic page_update_higher_textequiv_levels to propagate text inside tables, too.

71550a8 takes that to a whole new level by finally attempting to adhere to readingDirection, textLineOrder, recursive ReadingOrder, and Relations of @type='join', e.g. drop-capital vs paragraph or cross-columns paragraph or cross-lines word.

This should also be covered in the PAGE validator ASAP! (And it would not hurt to bring these functions into ocrd_utils.)

…relations

- apart from joining TextEquiv.Unicode, average all TextEquiv.conf
- when concatenating glyphs, respect readingDirection
- when concatenating words, respect readingDirection
- when concatenating lines, respect textLineOrder
- when concatenating subregions, respect ReadingOrder
- also look for TextRegion in TextRegion (but depth-first)
- avoid joining by whitespace if predecessor and successor
  appear in a Relation of type=join
Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, support for tables is something many users want.

ocrd_tesserocr/recognize.py Show resolved Hide resolved
ocrd_tesserocr/recognize.py Show resolved Hide resolved
ocrd_tesserocr/recognize.py Show resolved Hide resolved
ocrd_tesserocr/segment_table.py Outdated Show resolved Hide resolved
@kba
Copy link
Member

kba commented Jan 13, 2020

This should also be covered in the PAGE validator ASAP!

@tboenig What GT do we have which uses these features and is robust enough for testing?

Copy link
Contributor

@wrznr wrznr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a major breakthrough! Having RO handled correctly is a long-standing desideratum. Many thanks! However, some (minor) change requests remain. In addition, I know it is sometimes inconvenient but would smaller PRs be an option for you? In this case, IMHO RO handling and table segmentation could have been split up into separate feature-related PRs.

ocrd_tesserocr/ocrd-tool.json Outdated Show resolved Hide resolved
ocrd_tesserocr/ocrd-tool.json Show resolved Hide resolved
ocrd_tesserocr/ocrd-tool.json Outdated Show resolved Hide resolved
ocrd_tesserocr/ocrd-tool.json Outdated Show resolved Hide resolved
ocrd_tesserocr/recognize.py Outdated Show resolved Hide resolved
ocrd_tesserocr/recognize.py Outdated Show resolved Hide resolved
@bertsky
Copy link
Collaborator Author

bertsky commented Jan 13, 2020

In addition, I know it is sometimes inconvenient but would smaller PRs be an option for you? In this case, IMHO RO handling and table segmentation could have been split up into separate feature-related PRs.

Sure, I can split the last commit as a separate PR, but that would of course have to be based on this PR anyway. (I still have not received an answer from anyone whether this is better.) Just for my understanding: you don't want to see and comment the diffs accumulated over all commits in files changed? (If that is so, why not view/comment the commits individually?)

@wrznr
Copy link
Contributor

wrznr commented Jan 13, 2020

Because these commits are not independent. I review the first commit, encounter a problem and go nuts for nothing because you have fixed that problem in a later commit. Feature-based PRs make it much easier to understand the internals of what you are proposing. As it stands, I can only do a formal review wrt. syntax and language issues.

I can split the last commit as a separate PR

Pls. do not but maybe keep your PRs smaller in the future.

(I still have not received an answer from anyone whether this is better.)

It is. Okay?

@kba
Copy link
Member

kba commented Jan 13, 2020

but that would of course have to be based on this PR anyway
...
Because these commits are not independent.

We all appreciate concise PR but in my experience with splitting up PR into smaller dependent chunks is often very frustrating because you have to maintain multiple branches and while the "dependency" PR are not merged, the diff of the PR is against master branch so it's even more confusing and redundant.

@kba
Copy link
Member

kba commented Jan 13, 2020

@bertsky, could you please merge my PR #103 first and then rebase your PR to fix merge conflicts?

If it's not urgent, let's please do it the other way round

I would not call it urgent but I also don't see why this should block #103 . Just a few conflicts in ocrd-tool.json, feel free to pull https://github.com/kba/ocrd_tesserocr/pull/new/raw-line if you don't want to solve them yourself.

And I try to avoid rebasing as much as possible. Merge commits might "pollute" the log but there's always git log --no-merges and a consistent state across forks.

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 13, 2020

@wrznr @kba thanks for clarifying! Let's please discuss this further just once:

We all appreciate concise PR but in my experience with splitting up PR into smaller dependent chunks is often very frustrating because you have to maintain multiple branches and while the "dependency" PR are not merged, the diff of the PR is against master branch so it's even more confusing and redundant.

Okay, I'm glad we all agree on that.

Because these commits are not independent. I review the first commit, encounter a problem and go nuts for nothing because you have fixed that problem in a later commit.

I always try to make sequences of compact commits, using rebasing, which can be understood. The tendency is rather towards too large commits for me. So I am surprised to read it's the opposite in your view. Which change(s) did you find difficult to follow here?

Also, if it is this way around, why not use the accumulated diff perspective instead?

Feature-based PRs make it much easier to understand the internals of what you are proposing.

I agree this PR can be said to combine impovements across several features. (That's why I have used several commits.) But they all relate to segmentation, and I digged them up together by testing and Tesseract debugging. Are you serious I should separate them into individual branches, switching between branches while testing workspaces? How would you then test these yourself?

but maybe keep your PRs smaller in the future.

It boils down to this: non-trivial changes necessitate larger PRs IMO. But I will try!

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 13, 2020

I would not call it urgent but I also don't see why this should block #103

As I said: I have got more (stashed/uncommited) changes in the queue, and this would mean more synchronization work for me. I don't see why I should pay that (unnecessary) price.

@stweil
Copy link
Contributor

stweil commented Jan 14, 2020

Bildschirmfoto 2020-01-14 um 08 18 16

I tested this PR with one of our old book (https://digi.bib.uni-mannheim.de/urn/urn:nbn:de:bsz:180-digad-22977) using this process:
time -p ocrd process -m mets.xml \
  "tesserocr-segment-region -I MAX -O OCR-D-SEG-BLOCK -p '{\"crop_polygons\": true}'" \
  "tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE" \
  "tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS -p '{\"model\": \"Fraktur_5000000_0.502_198857\"}'"

The results with the previous code and without crop_polygons were better in most cases where there was a difference. This might be caused by the use of raw lines.

The region detection only changed slightly. There is still a problem with an overlap of the image region for the initial A and the text region, resulting in wrong text at the beginning of the lines.

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 14, 2020

@stweil thanks for testing!

However, I think there is a misunderstanding here: this PR is not about crop_polygons – not at all. As I have already stated, my fixes for BlockPolygon are not enough to make this usable yet. (And when they do become ready, we agreed to rename the option and make it default.)

So, please repeat your test with crop_polygons=false (the default)!

Also, if you do still find regressions, could you please use ocrd-dinglehopper to compare before/after and post the relevant pages here?

(To embed the HTML, you can use the https://jsbin.com service – just paste the HTML on the left, then click Share, Output-only, and Link.)

@wrznr
Copy link
Contributor

wrznr commented Jan 14, 2020

@bertsky First and foremost, I am grateful for your work and it is up to you how you structure your commits and PRs. When I ask for smaller PRs then it is a personal preference (and best practice in many, many companies and projects) which helps me to provide substantial reviews. The sheer amount of changes and code additions in this PR is not trackable for me, so my review is, again, purely formal. (@kba I also doubt that merging is any faster and less frustrating than for multiple smaller PRs.)

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 15, 2020

@kba updating version string in tool json and setup: Will you do that after the merge, along with release tag (or should I add it to the PR)? How about 0.7?

@wrznr wrznr self-requested a review January 15, 2020 09:07
@bertsky
Copy link
Collaborator Author

bertsky commented Jan 15, 2020

Sorry @wrznr bad timing. Just added 68a4948 to complete the PR (removing a FIXME).

What do you think: Is it okay to import page_get_reading_order from another processor as long as we don't have these in ocrd_utils?

@wrznr
Copy link
Contributor

wrznr commented Jan 15, 2020

Since the processor is from the same package, I think it is okay to do so. (Maybe with the corresponding issue in core and a Fixme comment?)

@kba
Copy link
Member

kba commented Jan 15, 2020

Since the processor is from the same package

Even if it were not, as long as that's properly documented and installable. Better to avoid code duplication and have a slightly more involved setup.

PR to ocrd_utils of course always welcome.

@kba
Copy link
Member

kba commented Jan 15, 2020

@kba updating version string in tool json and setup: Will you do that after the merge, along with release tag (or should I add it to the PR)? How about 0.7?

Sure, let me know when it's ready.

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 16, 2020

PR to ocrd_utils of course always welcome.

I will make an attempt. Have to think of the general API though (RO iterator, modification, region type filters) and harmonize with ideas from ocrd_segment.

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 16, 2020

@kba updating version string in tool json and setup: Will you do that after the merge, along with release tag (or should I add it to the PR)? How about 0.7?

Sure, let me know when it's ready.

It's ready alright. Of course – again – some extra effort should be made to add test coverage, but I don't have the time now.

@stweil have you run your test to satisfaction?

@stweil
Copy link
Contributor

stweil commented Jan 16, 2020

I have re-run the test without crop_polygons and still see a regression with the new code. For the page shown above, the new code does not recognize the first line ("Vorred ...") and misses an "a" in line 5 ("das"). The rest of that page is identical now. Other pages show similar small differences, and in most cases the old code gave getter results.

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 16, 2020

@stweil ok, thanks – could you please upload the original file of that page here, so I can have a look myself?

Copy link
Contributor

@wrznr wrznr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. Many thanks!

@stweil
Copy link
Contributor

stweil commented Jan 17, 2020

could you please upload the original file of that page here

@bertsky, is the METS file with all file links sufficient, or do you need other files, too? I tested physical page 7.

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 17, 2020

@stweil this was sufficient, thanks.

and still see a regression with the new code. For the page shown above, the new code does not recognize the first line ("Vorred ...") and misses an "a" in line 5 ("das"). The rest of that page is identical now. Other pages show similar small differences, and in most cases the old code gave getter results.

I see similar problems with other Fraktur models.

Yet, this is to be expected: your workflow is too simplistic for raw_lines=true:

  • ocrd-tesserocr-segment-line gives only bboxes, not polygons (as would ocrd-cis-ocropy-segment, or ocrd-tesserocr-segment-line+ocrd-cis-ocropy-resegment)
  • ocrd-tesserocr-segment-line does not suppress foreground components of neighbouring/intruding regions (as would ocrd-cis-ocropy-clip)

Now, I already documented this in the JSON description of raw_lines:

Do not attempt additional segmentation (baseline+xheight+ascenders/descenders prediction) when using line images (i.e. when textequiv_level<region). Disable when line segments/images likely contain components of more than 1 line.

So to me this is a wontfix issue. Unless you would like to have more documentation, perhaps in README.md or the description of ocrd-tesserocr-segment-line?

@@ -71,7 +74,9 @@ def process(self):
tessapi.SetVariable("textord_tabfind_find_tables", "1") # (default)
# this should yield additional blocks within the table blocks
# from the page iterator, but does not in fact (yet?):
tessapi.SetVariable("textord_tablefind_recognize_tables", "1")
# (and it can run into assertion errors when the table structure
# does not meet certain homogenity expectations)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# does not meet certain homogenity expectations)
# does not meet certain homogeneity expectations)

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 21, 2020

Yet, this is to be expected: your workflow is too simplistic for raw_lines=true:
So to me this is a wontfix issue. Unless you would like to have more documentation, perhaps in README.md or the description of ocrd-tesserocr-segment-line?

@stweil Another problem for raw lines just came up, which is next to independent of workflow configuration. This has swung the decision in favour of the old behaviour as default.

I can still remove the new mode entirely, if this is what everyone wants.

@bertsky
Copy link
Collaborator Author

bertsky commented Jan 23, 2020

@kba I think you can merge and release now.

@kba kba merged commit 502382c into OCR-D:master Jan 23, 2020
@kba
Copy link
Member

kba commented Jan 23, 2020

Thanks, released as v0.7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

recognize: use PSM_RAW_LINE instead of PSM_SINGLE_LINE
4 participants