-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve segmentation #104
improve segmentation #104
Conversation
- do not start new entries at index 0 when not deleting existing regions (but after last index) - do not add separator and noise regions to the reading order
- with raw_lines=true, avoid line segmentation for textline images - default to textequiv_level=word (i.e. add more segmentation)
Codecov Report
@@ Coverage Diff @@
## master #104 +/- ##
==========================================
- Coverage 46.82% 39.14% -7.68%
==========================================
Files 8 9 +1
Lines 692 894 +202
Branches 129 190 +61
==========================================
+ Hits 324 350 +26
- Misses 337 492 +155
- Partials 31 52 +21
Continue to review full report at Codecov.
|
- segment-table: run page segmentation on TableRegions, filtering only TextRegion results and adding recursive reading order - segment-line: also search for TextRegion inside TableRegion - recognize: also search for TextLine/TextRegion inside TableRegion
If it's not urgent, let's please do it the other way round: I have more fixes in the queue here and don't want to resynchronize everything just for cosmetics. Most of your changes are also in my commits here BTW (but that's a total coincidence). We can rebase and merge your PR more easily after this and #102 and #98 are solved. |
5b39cc0 fixes the overly simplistic 71550a8 takes that to a whole new level by finally attempting to adhere to This should also be covered in the PAGE validator ASAP! (And it would not hurt to bring these functions into ocrd_utils.) |
…relations - apart from joining TextEquiv.Unicode, average all TextEquiv.conf - when concatenating glyphs, respect readingDirection - when concatenating words, respect readingDirection - when concatenating lines, respect textLineOrder - when concatenating subregions, respect ReadingOrder - also look for TextRegion in TextRegion (but depth-first) - avoid joining by whitespace if predecessor and successor appear in a Relation of type=join
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic, support for tables is something many users want.
@tboenig What GT do we have which uses these features and is robust enough for testing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a major breakthrough! Having RO handled correctly is a long-standing desideratum. Many thanks! However, some (minor) change requests remain. In addition, I know it is sometimes inconvenient but would smaller PRs be an option for you? In this case, IMHO RO handling and table segmentation could have been split up into separate feature-related PRs.
Sure, I can split the last commit as a separate PR, but that would of course have to be based on this PR anyway. (I still have not received an answer from anyone whether this is better.) Just for my understanding: you don't want to see and comment the diffs accumulated over all commits in |
Because these commits are not independent. I review the first commit, encounter a problem and go nuts for nothing because you have fixed that problem in a later commit. Feature-based PRs make it much easier to understand the internals of what you are proposing. As it stands, I can only do a formal review wrt. syntax and language issues.
Pls. do not but maybe keep your PRs smaller in the future.
It is. Okay? |
We all appreciate concise PR but in my experience with splitting up PR into smaller dependent chunks is often very frustrating because you have to maintain multiple branches and while the "dependency" PR are not merged, the diff of the PR is against master branch so it's even more confusing and redundant. |
I would not call it urgent but I also don't see why this should block #103 . Just a few conflicts in ocrd-tool.json, feel free to pull https://github.com/kba/ocrd_tesserocr/pull/new/raw-line if you don't want to solve them yourself. And I try to avoid rebasing as much as possible. Merge commits might "pollute" the log but there's always |
@wrznr @kba thanks for clarifying! Let's please discuss this further just once:
Okay, I'm glad we all agree on that.
I always try to make sequences of compact commits, using rebasing, which can be understood. The tendency is rather towards too large commits for me. So I am surprised to read it's the opposite in your view. Which change(s) did you find difficult to follow here? Also, if it is this way around, why not use the accumulated diff perspective instead?
I agree this PR can be said to combine impovements across several features. (That's why I have used several commits.) But they all relate to segmentation, and I digged them up together by testing and Tesseract debugging. Are you serious I should separate them into individual branches, switching between branches while testing workspaces? How would you then test these yourself?
It boils down to this: non-trivial changes necessitate larger PRs IMO. But I will try! |
As I said: I have got more (stashed/uncommited) changes in the queue, and this would mean more synchronization work for me. I don't see why I should pay that (unnecessary) price. |
I tested this PR with one of our old book (https://digi.bib.uni-mannheim.de/urn/urn:nbn:de:bsz:180-digad-22977) using this process:
The results with the previous code and without The region detection only changed slightly. There is still a problem with an overlap of the image region for the initial A and the text region, resulting in wrong text at the beginning of the lines. |
@stweil thanks for testing! However, I think there is a misunderstanding here: this PR is not about So, please repeat your test with Also, if you do still find regressions, could you please use ocrd-dinglehopper to compare before/after and post the relevant pages here? (To embed the HTML, you can use the https://jsbin.com service – just paste the HTML on the left, then click |
@bertsky First and foremost, I am grateful for your work and it is up to you how you structure your commits and PRs. When I ask for smaller PRs then it is a personal preference (and best practice in many, many companies and projects) which helps me to provide substantial reviews. The sheer amount of changes and code additions in this PR is not trackable for me, so my review is, again, purely formal. (@kba I also doubt that merging is any faster and less frustrating than for multiple smaller PRs.) |
@kba updating version string in tool json and setup: Will you do that after the merge, along with release tag (or should I add it to the PR)? How about 0.7? |
Since the processor is from the same package, I think it is okay to do so. (Maybe with the corresponding issue in core and a Fixme comment?) |
Even if it were not, as long as that's properly documented and installable. Better to avoid code duplication and have a slightly more involved setup. PR to ocrd_utils of course always welcome. |
Sure, let me know when it's ready. |
I will make an attempt. Have to think of the general API though (RO iterator, modification, region type filters) and harmonize with ideas from ocrd_segment. |
It's ready alright. Of course – again – some extra effort should be made to add test coverage, but I don't have the time now. @stweil have you run your test to satisfaction? |
I have re-run the test without |
@stweil ok, thanks – could you please upload the original file of that page here, so I can have a look myself? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. Many thanks!
@bertsky, is the METS file with all file links sufficient, or do you need other files, too? I tested physical page 7. |
@stweil this was sufficient, thanks.
I see similar problems with other Fraktur models. Yet, this is to be expected: your workflow is too simplistic for
Now, I already documented this in the JSON description of
So to me this is a wontfix issue. Unless you would like to have more documentation, perhaps in README.md or the description of ocrd-tesserocr-segment-line? |
@@ -71,7 +74,9 @@ def process(self): | |||
tessapi.SetVariable("textord_tabfind_find_tables", "1") # (default) | |||
# this should yield additional blocks within the table blocks | |||
# from the page iterator, but does not in fact (yet?): | |||
tessapi.SetVariable("textord_tablefind_recognize_tables", "1") | |||
# (and it can run into assertion errors when the table structure | |||
# does not meet certain homogenity expectations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# does not meet certain homogenity expectations) | |
# does not meet certain homogeneity expectations) |
@stweil Another problem for raw lines just came up, which is next to independent of workflow configuration. This has swung the decision in favour of the old behaviour as default. I can still remove the new mode entirely, if this is what everyone wants. |
@kba I think you can merge and release now. |
Thanks, released as v0.7.0 |
This fixes #101 (using raw_lines by default for textline images, but there are still some corner cases that need to be fixed in Tesseract) and brings a number of segmentation-related improvements:
overwrite_regions
more consistently@orientation
(independent of dedicated deskewing processor) for vertical and@type
for all other text blocks