-
Copy
subset_1_filtered_updated_final_output.csv
into the local folder- This is the raw output from the complete PDF extraction pipeline
-
Process the raw OCR output using
process_raw_output.ipynb
- Demonstrates REGEX application via
extract_meaningful_text
function - Pipeline:
- Filter pipeline errors from dirty web-scraped PDFs
- Apply REGEX to produce continuous training text
- Modify
extract_meaningful_text
function as needed for different outputs
- Demonstrates REGEX application via