Skip to content

blockno is not the same across get_text methods #2219

Answered by JorjMcKie
darwinharianto asked this question in Q&A
Discussion options

You must be logged in to vote

The number of blocks on a page is subject to MuPDF's heuristics to recognize text blocks as such. A full / unlimited extraction will also identify image blocks. All these (desired) blocks will be put into a TextPage object - from which extractions and searches will take place.
For performance reasons, not all blocks that are identifyable on a page will always be selected in this process for example, plain text, "words" and "xhtml" extraction as well as text search will extract no image blocks.
Other differences occur if dehyphenation is being switched on or off.
Especially the inclusion / exclusion of images in the TextPage object has an enormous effect on the time needed to build it and …

Replies: 4 comments 6 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by darwinharianto
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
6 replies
@JorjMcKie
Comment options

@darwinharianto
Comment options

@darwinharianto
Comment options

@JorjMcKie
Comment options

@JorjMcKie
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
2 participants
Converted from issue

This discussion was converted from issue #2218 on February 08, 2023 05:55.