Go over a few hundred (100-200) pdf documents from Common Crawl Corpus. Retrieve complexity scores distribution and bounding boxes, inspect the layout (using plot_bb function) and the detected textual blocks according to their complexity.
Explore the patterns between the layouts (bounding boxes detected) and the ability of pymupdf to extract data correctly. In the metadata would be nice to add regex for latex. Also by inspecting different documents contemplate on the extraction of formatting lines (like page number, foot notes, which are not relevant to the content).
There is an example of use case in ./notesboks/Example_llm.ipynb
Feel free to create a new prompt, remove/add examples, add new metadata to the prompt.