blockno is not the same across get_text methods #2219

darwinharianto · 2023-02-08T00:44:15Z

darwinharianto
Feb 8, 2023

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Potentially add an issue reference.
I was trying to connect get_text("words") result with get_text("rawdict"). Looking at the documentations, I thought I can link them using blockno (also known as seqno I believe). When I tried to link them, I have to add seqno from get_text("words")
with how many image blocks I get from get_text("rawdict") up to the seqno.

Describe the solution you'd like
A clear and concise description of what you want to happen.
Same blockno and seqno on different get_text to link them easier.

Describe alternatives you've considered
Are there several options for how your request could be met?

Additional context
Add any other context or screenshots about the feature request here.

Answered by JorjMcKie

Feb 8, 2023

The number of blocks on a page is subject to MuPDF's heuristics to recognize text blocks as such. A full / unlimited extraction will also identify image blocks. All these (desired) blocks will be put into a TextPage object - from which extractions and searches will take place.
For performance reasons, not all blocks that are identifyable on a page will always be selected in this process for example, plain text, "words" and "xhtml" extraction as well as text search will extract no image blocks.
Other differences occur if dehyphenation is being switched on or off.
Especially the inclusion / exclusion of images in the TextPage object has an enormous effect on the time needed to build it and …

View full answer

JorjMcKie · 2023-02-08T05:55:49Z

JorjMcKie
Feb 8, 2023
Maintainer

This is typical "Discussions" item, so I will first reclassify.

0 replies

JorjMcKie · 2023-02-08T06:20:40Z

JorjMcKie
Feb 8, 2023
Maintainer

The number of blocks on a page is subject to MuPDF's heuristics to recognize text blocks as such. A full / unlimited extraction will also identify image blocks. All these (desired) blocks will be put into a TextPage object - from which extractions and searches will take place.
For performance reasons, not all blocks that are identifyable on a page will always be selected in this process for example, plain text, "words" and "xhtml" extraction as well as text search will extract no image blocks.
Other differences occur if dehyphenation is being switched on or off.
Especially the inclusion / exclusion of images in the TextPage object has an enormous effect on the time needed to build it and also of course on its size.
Therefore, different default text page flags are used when doing e.g. page.get_text("words") compared to page.get_text("rawdict"). These defaults are built into TEXTFLAGS_TEXT, TEXTFLAGS_RAWDICT and the others.
If you - for whatever reason - need an equal block count across different extractions / searches, just always use the same flags value.
Even better: build the TextPage separately with the desired flags, and subsequently re-use it in all extractions and searches by referring to it with the textpage parameter. This will speed up multiple extraction a lot! I have built a Jupyter notebook to demonstrate this effect.

0 replies

darwinharianto · 2023-02-08T07:21:37Z

darwinharianto
Feb 8, 2023
Author

Thank you, that get_text("rawdict", textpage=textpage) really helps. It sped up my process too!

I was having a problem with texts that on top of other texts. I thought that by looking at it's bounding box for each character I could determine if they are overlapping or not using iou. So first I

                words = page.get_text("words", textpage = textPage)
                textsMeta = [w for w in words if Rect(w[:4]) in box]

then loop through the metadata on blockno and lineno inside page.get_text("rawdict", textpage = textPage) to get bbox of each character.
then reversed the order to know which is written first or last.

I read #736, but it is hard to understand...

0 replies

darwinharianto · 2023-02-08T08:18:29Z

darwinharianto
Feb 8, 2023
Author

@JorjMcKie
I used the flag textpage = page.get_textpage(flags=fitz.TEXTFLAGS_RAWDICT)
But it still gives me different blockno on
page.get_text("rawdict", textpage = textPage)
page.get_text("words", textpage = textPage)

is it because page.get_text("words") reindex the textpage's block?

6 replies

JorjMcKie Feb 8, 2023
Maintainer

Ah, and this is the output of blocks:

In [13]: blocks
Out[13]:
[(240.00100708007812,
  88.93600463867188,
  540.0009765625,
  388.9360046386719,
  '<image: DeviceRGB, width: 1200, height: 1200, bpc: 8>',
  0,
  1),
 (236.9040069580078,
  396.9154968261719,
  540.0006713867188,
  432.41064453125,
  'PyMuPDF Documentation\n',
  1,
  0),
 (422.281005859375,
  433.3641052246094,
  540.0000610351562,
  457.98211669921875,
  'Release 1.21.1\n',
  2,
  0),
 (485.56500244140625,
  515.5146484375,
  540.0001220703125,
  540.1671142578125,
  'Artifex\n',
  3,
  0),
 (469.5480041503906,
  652.2603759765625,
  540.0,
  669.3801879882812,
  'Jan 09, 2023\n',
  4,
  0)]

darwinharianto Feb 9, 2023
Author

Ah, I was comparing
rawdict = page.get_text("rawdict", textpage = textPage) with words = page.get_text("words", textpage = textPage)

rawdict's data doesn't contain blockno on it's data, I used it as rawdict['blocks'][blockno]. I assume that blockno value from words correspond to blocks indices from rawdict's data

Now that I take another look at rawdict keys, I found number on its key. This key is the blockno right? It doesn't seem to be documented in the docs

darwinharianto Feb 9, 2023
Author

for the lineno, I could just use it as block["lines"][lineNo] right?
I am not sure how to use the wordno..

JorjMcKie Feb 9, 2023
Maintainer

The number in the "dict" / "rawdict" dictionaries indeed is the blocknumber. Both these two dictionaries are the same and share the same code. "rawdict" only goes one level deeper that's all.
The line dictionary and the span dictionary have no "number" keys. So e.g. inside a block dict, the lines are a list which you can access via index. Similar spans inside a line dict, or chars inside a span dict.

The block "number" key is documented. It just may be missing in that image.

JorjMcKie Feb 9, 2023
Maintainer

I am not sure how to use the wordno..

No need to use it all. I am just outputting a counter because I have the info - ignoring it may happen later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blockno is not the same across get_text methods #2219

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

blockno is not the same across get_text methods #2219

darwinharianto Feb 8, 2023

Replies: 4 comments · 6 replies

JorjMcKie Feb 8, 2023 Maintainer

JorjMcKie Feb 8, 2023 Maintainer

darwinharianto Feb 8, 2023 Author

darwinharianto Feb 8, 2023 Author

JorjMcKie Feb 8, 2023 Maintainer

darwinharianto Feb 9, 2023 Author

darwinharianto Feb 9, 2023 Author

JorjMcKie Feb 9, 2023 Maintainer

JorjMcKie Feb 9, 2023 Maintainer

darwinharianto
Feb 8, 2023

Replies: 4 comments 6 replies

JorjMcKie
Feb 8, 2023
Maintainer

JorjMcKie
Feb 8, 2023
Maintainer

darwinharianto
Feb 8, 2023
Author

darwinharianto
Feb 8, 2023
Author

JorjMcKie Feb 8, 2023
Maintainer

darwinharianto Feb 9, 2023
Author

darwinharianto Feb 9, 2023
Author

JorjMcKie Feb 9, 2023
Maintainer

JorjMcKie Feb 9, 2023
Maintainer