Testing protocol

Go over a few hundred (100-200) pdf documents from Common Crawl Corpus. Retrieve complexity scores distribution and bounding boxes, inspect the layout (using plot_bb function) and the detected textual blocks according to their complexity.

Explore the patterns between the layouts (bounding boxes detected) and the ability of pymupdf to extract data correctly. In the metadata would be nice to add regex for latex. Also by inspecting different documents contemplate on the extraction of formatting lines (like page number, foot notes, which are not relevant to the content).

There is an example of use case in ./notesboks/Example_llm.ipynb

Feel free to create a new prompt, remove/add examples, add new metadata to the prompt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testing_protocol.md

testing_protocol.md

Testing protocol

Files

testing_protocol.md

Latest commit

History

testing_protocol.md

File metadata and controls

Testing protocol