HG TextHarvester is a basic toolkit designed for digitizing and extracting textual data from Historic Graves Project documents. This guide provides step-by-step instructions on how to use HG TextHarvester to convert PDFs to images, perform OCR (Optical Character Recognition), and handwriting recognition (HWR) to extract text, and compile the data into CSV files for easy analysis and record-keeping.
- Python 3.x
- Required Libraries:
PIL
,requests
,csv
,glob
,os
,json
,base64
- OpenAI API Key (for OCR)
- Install Python: Ensure Python 3.x is installed on your system.
- Install Libraries: Install necessary Python libraries using pip:
pip install pillow requests csv glob json base64
- API Key: Set your OpenAI API key as an environment variable:
- On Windows:
set OPENAI_API_KEY=your_api_key
- On Unix/Mac:
export OPENAI_API_KEY=your_api_key
API key can be requested from dtcurragh if not already provided.
HG TextHarvester comprises three main scripts:
- PDF to JPG Conversion (
pdf2jpg.py
):
- Converts each page of a PDF document into separate JPG images.
- Usage: open
pdf2jpg.py
in vscode, replace filepath placeholders with actual filepaths within the scripts. Save the file and run the script using therun
button in vscode orpython pdf2jpg.py
- OCR and Text Categorization (
vision_ndl.py
):
- Processes JPG images using OCR to extract text.
- Saves OCR results in JSON format.
- Usage: type
python vision_ndl.py <input_folder_path>
in your temrinal before pressing enter. Monitor the progress in the terminal, looking out for error codes returned by openai
- JSON to CSV Conversion (
json2csv_ndl.py
):
- Parses JSON files to extract data.
- Compiles data into a CSV file.
- Usage:
json2csv_ndl.py
in vscode, replace filepath placeholders with actual filepaths within the scripts. Save the file and run the script using therun
button in vscode orpython json2csv_ndl.py
- Prepare your PDFs: Place your PDFs in an accessible folder.
- Convert PDFs to JPGs: Run
pdf2jpg.py
for each PDF. - Process Images with OCR: Run
vision_ndl.py
on the folder containing JPGs. - Compile Data to CSV: Run
json2csv_ndl.py
to collate all OCR data into a CSV file.
- API Limitations: If you hit API rate limits, try processing in smaller batches.
- Data Quality: For best OCR results, ensure images are clear and well-lit. Block capitals preferred for handwriting
- Error Handling: If scripts encounter errors, check the console output for specific error messages.