User Guide for Textharvester Python Toolkit

Introduction

HG TextHarvester is a basic toolkit designed for digitizing and extracting textual data from Historic Graves Project documents. This guide provides step-by-step instructions on how to use HG TextHarvester to convert PDFs to images, perform OCR (Optical Character Recognition), and handwriting recognition (HWR) to extract text, and compile the data into CSV files for easy analysis and record-keeping.

System Requirements

Python 3.x
Required Libraries: PIL, requests, csv, glob, os, json, base64
OpenAI API Key (for OCR)

Setup

Install Python: Ensure Python 3.x is installed on your system.
Install Libraries: Install necessary Python libraries using pip:

pip install pillow requests csv glob json base64

API Key: Set your OpenAI API key as an environment variable:

On Windows: set OPENAI_API_KEY=your_api_key
On Unix/Mac: export OPENAI_API_KEY=your_api_key

API key can be requested from dtcurragh if not already provided.

Usage

HG TextHarvester comprises three main scripts:

PDF to JPG Conversion (pdf2jpg.py):

Converts each page of a PDF document into separate JPG images.
Usage: open pdf2jpg.py in vscode, replace filepath placeholders with actual filepaths within the scripts. Save the file and run the script using the run button in vscode or python pdf2jpg.py

OCR and Text Categorization (vision_ndl.py):

Processes JPG images using OCR to extract text.
Saves OCR results in JSON format.
Usage: type python vision_ndl.py <input_folder_path> in your temrinal before pressing enter. Monitor the progress in the terminal, looking out for error codes returned by openai

JSON to CSV Conversion (json2csv_ndl.py):

Parses JSON files to extract data.
Compiles data into a CSV file.
Usage: json2csv_ndl.py in vscode, replace filepath placeholders with actual filepaths within the scripts. Save the file and run the script using the run button in vscode or python json2csv_ndl.py

Workflow

Prepare your PDFs: Place your PDFs in an accessible folder.
Convert PDFs to JPGs: Run pdf2jpg.py for each PDF.
Process Images with OCR: Run vision_ndl.py on the folder containing JPGs.
Compile Data to CSV: Run json2csv_ndl.py to collate all OCR data into a CSV file.

Troubleshooting

API Limitations: If you hit API rate limits, try processing in smaller batches.
Data Quality: For best OCR results, ensure images are clear and well-lit. Block capitals preferred for handwriting
Error Handling: If scripts encounter errors, check the console output for specific error messages.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
__pycache__		__pycache__
test_folder		test_folder
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
j2c_t_test.py		j2c_t_test.py
json2csv_ndl.py		json2csv_ndl.py
json2csv_trans.py		json2csv_trans.py
pdf2jpg.py		pdf2jpg.py
threading_test.py		threading_test.py
validate_ndl.py		validate_ndl.py
validate_transcript.py		validate_transcript.py
vision_ndl.log		vision_ndl.log
vision_ndl.py		vision_ndl.py
vision_transcript.log		vision_transcript.log
vision_transcript.py		vision_transcript.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

User Guide for Textharvester Python Toolkit

Introduction

System Requirements

Setup

Usage

Workflow

Troubleshooting

About

Contributors 2

Languages

donalotiarnaigh/textharvester-python

Folders and files

Latest commit

History

Repository files navigation

User Guide for Textharvester Python Toolkit

Introduction

System Requirements

Setup

Usage

Workflow

Troubleshooting

About

Resources

Stars

Watchers

Forks

Contributors 2

Languages