Labeled data for creating machine learning models related to PDF consumption: token types, paragraph extraction, and reading order
- Docker Desktop 4.25.0 install link
Start the labeling tool:
make start
When ready, check out the web here:
http://localhost:8080
To stop the server:
make stop
-
Token Type: Labels each word that appears in a PDF. Check out this repository https://github.com/huridocs/pdf-tokens-type-labeler
-
Reading Order: Sorts the information in a PDF https://github.com/huridocs/pdf-reading-order
-
Paragraph Extraction: Segments a PDF in paragraphs https://github.com/huridocs/pdf_paragraphs_extraction
-
Table Of Content: Extracts the Table Of Content https://github.com/huridocs/pdf_paragraphs_extraction
This is a fork, supported by HURIDOCS, of the Allen AI project PAWLS https://github.com/allenai/pawls