blockno is not the same across get_text methods #2219
-
Is your feature request related to a problem? Please describe. Describe the solution you'd like Describe alternatives you've considered Additional context |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 6 replies
-
This is typical "Discussions" item, so I will first reclassify. |
Beta Was this translation helpful? Give feedback.
-
The number of blocks on a page is subject to MuPDF's heuristics to recognize text blocks as such. A full / unlimited extraction will also identify image blocks. All these (desired) blocks will be put into a |
Beta Was this translation helpful? Give feedback.
-
Thank you, that get_text("rawdict", textpage=textpage) really helps. It sped up my process too! I was having a problem with texts that on top of other texts. I thought that by looking at it's bounding box for each character I could determine if they are overlapping or not using iou. So first I
then loop through the metadata on blockno and lineno inside page.get_text("rawdict", textpage = textPage) to get bbox of each character. I read #736, but it is hard to understand... |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie is it because |
Beta Was this translation helpful? Give feedback.
The number of blocks on a page is subject to MuPDF's heuristics to recognize text blocks as such. A full / unlimited extraction will also identify image blocks. All these (desired) blocks will be put into a
TextPage
object - from which extractions and searches will take place.For performance reasons, not all blocks that are identifyable on a page will always be selected in this process for example, plain text, "words" and "xhtml" extraction as well as text search will extract no image blocks.
Other differences occur if dehyphenation is being switched on or off.
Especially the inclusion / exclusion of images in the
TextPage
object has an enormous effect on the time needed to build it and …