-
First of all thank you very much for this great work. I particularly appreciate your layout preserving text extraction method.
import fitz
pdf_filename = 'my.pdf'
with fitz.open(pdf_filename) as doc:
print(doc.get_toc()) seems to give results only if the TOC is already present at the beginning of the document. On this pdf: https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/text-extraction/Dart.pdf for example, is there a method in pymupdf to generate the pdf outline? In this case only the numbered titles of the paragraphs. As well as for pdf with more complex formatting such as: https://blog.xpgreat.com/file/lstm.pdf, with numbered parts and sub-parts. I also tested mupdf with the command: Your configuration (mandatory)In my case, I made installation on macOS arm64 M2 (not Intel).
Please feel free to modify the README.md to notify macOS users with the apple chip that it also works by following this steps, I'm sure it will be useful for some :). |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 14 replies
-
This is a typical "Discussions" item, so let me convert it before asnwering. |
Beta Was this translation helpful? Give feedback.
-
Talking about TOC in PDF can easily become confusing:
A PDF technically can have both of the above things, or just one, or neither. If neither exists or just number 2 above, then I think you are asking for a way to create number 1 if just number 2 exists, right? Because TOC pages in a PDF are just ordinary text, it is left to the programmer's wits to
And there certainly also is no function in PyMuPDF which creates a TOC if nothing of the above exists at all.
Correct ... with the redaction. |
Beta Was this translation helpful? Give feedback.
-
Thank you for this explanation, I had difficulty in describing the problem precisely.
It is exactly that!
Yes, that sounds exciting! Before starting a development, I try to find out what has already been done to solve this problem.
|
Beta Was this translation helpful? Give feedback.
-
I once (long ago) have written a GUI script which interactively allows to create a TOC: |
Beta Was this translation helpful? Give feedback.
Talking about TOC in PDF can easily become confusing:
A PDF technically can have both of the above things, or just one, or neither. If neither exists or just number 2 above, then
doc.get…