Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-printspace regions partially missing #8

Open
bertsky opened this issue May 30, 2024 · 2 comments
Open

non-printspace regions partially missing #8

bertsky opened this issue May 30, 2024 · 2 comments

Comments

@bertsky
Copy link
Contributor

bertsky commented May 30, 2024

I noticed that on some pages there only segments within the printspace are annotated, so there are no text regions for catch-words, page numbers, headers etc. There is only a Border annotation, no PrintSpace element, so this seems somewhat inconsistent. Also, it only affects some pages.

This is a problem if used as structural GT to train segmentation models.

I could run an incremental segmentation to automatically "find" these segments and make a PR or visual comparison if you want.

@M3ssman
Copy link
Member

M3ssman commented May 30, 2024

@bertsky I encourage any kind of improvement to enhance data usability, but can you point me to an example?
I'm not sure whether I got the issue right and how something within the PrintSpace can be marked without being a TextRegion.

@bertsky
Copy link
Contributor Author

bertsky commented May 31, 2024

phys1278993

Here, in the footer of the page, the signature mark and page number are not annotated.

phys1290695

On this example, the running title in the header and catch word in the footer are not annoted.

In both cases, there is a Border element (more or less precisely) around the physical page (as it should be), but no PrintSpace element. The latter is only required on GT level 3, but practically having no PrintSpace element and no segments outside of the print space (headers/footers) is difficult for use as layout training data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants