Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocrd-segment-extract-lines - Lines are not extracted, in case they are in an area of other lines #61

Open
stefanCCS opened this issue Aug 15, 2022 · 5 comments

Comments

@stefanCCS
Copy link

Hi,
I think I have found a bug in ocrd-segment-extract-lines:
I cannot prove to 100%, but I think I see my environment, that the lines are not extracted (no images are created), in case a line is somehow graphically (concerning the coordinates) within another line of the same region.
I extract only images in this case using this command:

ocrd-segment-extract-lines -I $infolder -O $extractLineImagesFolder  -P  output-types '[]' -P min-line-length 0 -P min-line-width 5 -P min-line-height 5

Page-Extract: Here the line TR-15_line0002 was not extracted:

    <pc:TextRegion id="TR-15" orientation="0.">
      <pc:AlternativeImage filename="OCR-D-REG-VL-BL/OCR-D-REG-VL-BL_4749_007817786_00183_TR-15.IMG-DESKEW.png" comments=",binarized,deskewed,verticallinesremoved" />
      <pc:Coords points="237,383 237,438 443,438 443,383" />
      <pc:TextLine id="TR-15_line0001">
        <pc:Coords points="237,438 237,383 239,383 253,391 311,391 320,383 349,383 357,390 365,383 384,383 402,391 419,383 427,383 430,418 428,438 302,438 298,435 289,435 284,438" />
        <pc:Baseline points="227,415 430,418" />
      </pc:TextLine>
      <pc:TextLine id="TR-15_line0003">
        <pc:Coords points="261,438 269,433 274,433 295,438" />
        <pc:Baseline points="254,475 295,475" />
      </pc:TextLine>
      <pc:TextLine id="TR-15_line0002">
        <pc:Coords points="385,438 388,435 388,434 409,434 409,438" />
        <pc:Baseline points="343,478 412,475" />
      </pc:TextLine>
    </pc:TextRegion>

Logfile content for this case:

2022-08-11_14-21-13-extractlines.log-2022-08-11 14:21:31.189 WARNING processor.ExtractLines - Line 'TR-14_line0001' contains no text content
2022-08-11_14-21-13-extractlines.log-2022-08-11 14:21:31.201 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-SEG-LINE-CCS-IMG-BL-4749_007817786_00183_TR-14_TR-14_line0001.bin, file_grp: OCR-D-SEG-LINE-CCS-IMG-BL, path: OCR-D-SEG-LINE-CCS-IMG-BL/OCR-D-SEG-LINE-CCS-IMG-BL-4749_007817786_00183_TR-14_TR-14_line0001.bin.png
2022-08-11_14-21-13-extractlines.log-2022-08-11 14:21:31.242 WARNING processor.ExtractLines - Line 'TR-15_line0001' contains no text content
2022-08-11_14-21-13-extractlines.log:2022-08-11 14:21:31.255 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-SEG-LINE-CCS-IMG-BL-4749_007817786_00183_TR-15_TR-15_line0001.bin, file_grp: OCR-D-SEG-LINE-CCS-IMG-BL, path: OCR-D-SEG-LINE-CCS-IMG-BL/OCR-D-SEG-LINE-CCS-IMG-BL-4749_007817786_00183_TR-15_TR-15_line0001.bin.png
2022-08-11_14-21-13-extractlines.log-2022-08-11 14:21:31.256 WARNING processor.ExtractLines - Line 'TR-15_line0003' contains no text content
2022-08-11_14-21-13-extractlines.log-2022-08-11 14:21:31.267 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-SEG-LINE-CCS-IMG-BL-4749_007817786_00183_TR-15_TR-15_line0003.bin, file_grp: OCR-D-SEG-LINE-CCS-IMG-BL, path: OCR-D-SEG-LINE-CCS-IMG-BL/OCR-D-SEG-LINE-CCS-IMG-BL-4749_007817786_00183_TR-15_TR-15_line0003.bin.png
2022-08-11_14-21-13-extractlines.log-2022-08-11 14:21:31.268 WARNING processor.ExtractLines - Line 'TR-15_line0002' contains no text content
2022-08-11_14-21-13-extractlines.log-2022-08-11 14:21:31.311 WARNING processor.ExtractLines - Line 'TR-16_line0001' contains no text content
2022-08-11_14-21-13-extractlines.log-2022-08-11 14:21:31.348 INFO ocrd.workspace.save_image_file - created file ID: OCR-D-SEG-LINE-CCS-IMG-BL-4749_007817786_00183_TR-16_TR-16_line0001.bin, file_grp: OCR-D-SEG-LINE-CCS-IMG-BL, path: OCR-D-SEG-LINE-CCS-IMG-BL/OCR-D-SEG-LINE-CCS-IMG-BL-4749_007817786_00183_TR-16_TR-16_line0001.bin.png

@bertsky
Copy link
Collaborator

bertsky commented Aug 15, 2022

According to these coords, TR-15_line0002 is only 4px high, but you specified -P min-line-height 5.

@stefanCCS
Copy link
Author

@bertsky:
Ahh, I understand - makes sense.
Two comments to this:

  • It would be very helpful to create a log message in this case (on a log level, which is typically visible). Maybe you can use this issue (or create a new one) to follow-up this ...
  • If I have an elements with has y-coordinates from 434 to 438, in my view (so far) the height is 5 px (incl. the pixels at the borders) --> is there a definition, what means "height" ?

@bertsky
Copy link
Collaborator

bertsky commented Aug 16, 2022

  • It would be very helpful to create a log message in this case (on a log level, which is typically visible). Maybe you can use this issue (or create a new one) to follow-up this ...

Agreed. Currently, there is only

  • an error message if the line has zero size
  • a warning message if the text is empty

So I could add an info message if the text is shorter than min-line-length or the size is smaller than min-line-height / min-line-width.

  • If I have an elements with has y-coordinates from 434 to 438, in my view (so far) the height is 5 px (incl. the pixels at the borders) --> is there a definition, what means "height" ?

That's how PRImA sees and implements it (but they fail to communicate it in their standards), but not OCR-D (so far). All coordinate/polygon/image handling libraries I know use the pixel-below-right convention, not the path-refers-to-inside-of-polygon interpretation. There's been a short discussion with PRImA on this, but as you can see their neglect of this important detail is striking.

@stefanCCS
Copy link
Author

  • It would be very helpful to create a log message in this case (on a log level, which is typically visible). Maybe you can use this issue (or create a new one) to follow-up this ...

Agreed. Currently, there is only

* an error message if the line has zero size

* a warning message if the text is empty

So I could add an info message if the text is shorter than min-line-length or the size is smaller than min-line-height / min-line-width.

Yes, this would be very good :-)

  • If I have an elements with has y-coordinates from 434 to 438, in my view (so far) the height is 5 px (incl. the pixels at the borders) --> is there a definition, what means "height" ?

That's how PRImA sees and implements it (but they fail to communicate it in their standards), but not OCR-D (so far). All coordinate/polygon/image handling libraries I know use the pixel-below-right convention, not the path-refers-to-inside-of-polygon interpretation. There's been a short discussion with PRImA on this, but as you can see their neglect of this important detail is striking.

Well, if I have an image, which has a height of 1000, it has the pixel numbers from y=0 to y=999 (or y=1 to y=1000, depending on library used).
Therefore, in my opinion including the border pixels is somehow more "natural".
But in the end, I think it does not matter so much (but in would be good to have a common view in one Framework like here in OCR-D)).

@bertsky
Copy link
Collaborator

bertsky commented Aug 16, 2022

Well, if I have an image, which has a height of 1000, it has the pixel numbers from y=0 to y=999 (or y=1 to y=1000, depending on library used).

Yes, and in array terms you express that via a slice/interval 0:1000. And "pixel numbers" has a different semantics than a coordinate system (namely ordinal vs. cardinal). You can think of coordinates as the "cracks" between pixels.

Therefore, in my opinion including the border pixels is somehow more "natural".

For an implementation, the pixel-below-right coordinate semantics is by far easier than the path-refers-to-inside-of-polygon (which needs a notion of path and directionality; the former is undefined for baselines and subsegments, the latter is not agreed upon).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants