Skip to content

Latest commit

 

History

History
15 lines (8 loc) · 760 Bytes

testing_protocol.md

File metadata and controls

15 lines (8 loc) · 760 Bytes

Testing protocol

Go over a few hundred (100-200) pdf documents from Common Crawl Corpus. Retrieve complexity scores distribution and bounding boxes, inspect the layout (using plot_bb function) and the detected textual blocks according to their complexity.

Explore the patterns between the layouts (bounding boxes detected) and the ability of pymupdf to extract data correctly. In the metadata would be nice to add regex for latex. Also by inspecting different documents contemplate on the extraction of formatting lines (like page number, foot notes, which are not relevant to the content).

There is an example of use case in ./notesboks/Example_llm.ipynb

Feel free to create a new prompt, remove/add examples, add new metadata to the prompt.