-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #7 from goldmermaid/main
Add prompt to extract sample notebook
- Loading branch information
Showing
4 changed files
with
282 additions
and
98 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
280 changes: 280 additions & 0 deletions
280
examples/prompt_to_extract_table_from_pdf_to_json.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,280 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Prompt to Extract Key-values into JSON from W2 (PDF)\n", | ||
"\n", | ||
"Below it's an example of using OpenParser to extract key-values from a W2 PDF into JSON format. (Note: the model is still in beta and is NOT robust enough to generate the same output. Please bear with it!)\n", | ||
"\n", | ||
"### 1. Load the libraries\n", | ||
"\n", | ||
"If you have install `open_parser`, uncomment the below line." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# !pip3 install python-dotenv\n", | ||
"# !pip3 install --upgrade open_parser" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"/var/folders/mb/7wp0k3g17jd11kk9xlv5mh3m0000gn/T/ipykernel_67864/3281231558.py:2: DeprecationWarning: \n", | ||
"Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),\n", | ||
"(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)\n", | ||
"but was not found to be installed on your system.\n", | ||
"If this would cause problems for you,\n", | ||
"please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466\n", | ||
" \n", | ||
" import pandas as pd\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import os\n", | ||
"import pandas as pd\n", | ||
"\n", | ||
"from dotenv import load_dotenv\n", | ||
"from open_parser import OpenParser\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### 2. Set up your OpenParser API key\n", | ||
"\n", | ||
"To set up your `CAMBIO_API_KEY` API key, you will:\n", | ||
"\n", | ||
"1. create a `.env` file in your root folder;\n", | ||
"2. add the following one line to your `.env file:\n", | ||
" ```\n", | ||
" CAMBIO_API_KEY=17b************************\n", | ||
" ```\n", | ||
"\n", | ||
"Then run the below line to load your API key." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"load_dotenv(override=True)\n", | ||
"example_apikey = os.getenv(\"CAMBIO_API_KEY\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### 3. Load sample data and Run OpenParser\n", | ||
"\n", | ||
"OpenParser supports both image and PDF. First let's load a sample data to test OpenParser's capabilities.\n", | ||
"\n", | ||
"Now we can run OpenParser on our sample data and then display it in the Markdown format." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Upload response: 204\n", | ||
"Extraction success.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"example_local_file = \"./sample_data/test1.pdf\"\n", | ||
"example_prompt = \"Return table in a JSON format with each box's key and value.\"\n", | ||
"\n", | ||
"op = OpenParser(example_apikey)\n", | ||
"qa_result = op.parse(example_local_file, example_prompt)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[{'result': [{\"Employee's social security number\": '758-58-5787'},\n", | ||
" {'Employer identification number (EIN)': '78-8778788'},\n", | ||
" {\"Employer's name, address, and ZIP code\": 'DesignNext\\nKatham Dorbosto, Kashiani, Gopalganj\\nGopalganj, AK 8133'},\n", | ||
" {'Control number': '9'},\n", | ||
" {\"Employee's first name and initial\": 'Jesan'},\n", | ||
" {'Last name': 'Rahaman'},\n", | ||
" {\"State, Employer's state ID number\": 'AL,877878878'},\n", | ||
" {'State wages, tips, etc.': '80000.00'},\n", | ||
" {'Federal income tax withheld': '3835.00'}],\n", | ||
" 'log': {'instruction': \"Return table in a JSON format with each box's key and value.\",\n", | ||
" 'source': '',\n", | ||
" 'usage': {'input_tokens': 1750, 'output_tokens': 232}},\n", | ||
" 'page_num': 0}]" | ||
] | ||
}, | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"qa_result" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 11, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<div>\n", | ||
"<style scoped>\n", | ||
" .dataframe tbody tr th:only-of-type {\n", | ||
" vertical-align: middle;\n", | ||
" }\n", | ||
"\n", | ||
" .dataframe tbody tr th {\n", | ||
" vertical-align: top;\n", | ||
" }\n", | ||
"\n", | ||
" .dataframe thead th {\n", | ||
" text-align: right;\n", | ||
" }\n", | ||
"</style>\n", | ||
"<table border=\"1\" class=\"dataframe\">\n", | ||
" <thead>\n", | ||
" <tr style=\"text-align: right;\">\n", | ||
" <th></th>\n", | ||
" <th>Value</th>\n", | ||
" </tr>\n", | ||
" </thead>\n", | ||
" <tbody>\n", | ||
" <tr>\n", | ||
" <th>Employee's social security number</th>\n", | ||
" <td>758-58-5787</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>Employer identification number (EIN)</th>\n", | ||
" <td>78-8778788</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>Employer's name, address, and ZIP code</th>\n", | ||
" <td>DesignNext\\nKatham Dorbosto, Kashiani, Gopalga...</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>Control number</th>\n", | ||
" <td>9</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>Employee's first name and initial</th>\n", | ||
" <td>Jesan</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>Last name</th>\n", | ||
" <td>Rahaman</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>State, Employer's state ID number</th>\n", | ||
" <td>AL,877878878</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>State wages, tips, etc.</th>\n", | ||
" <td>80000.00</td>\n", | ||
" </tr>\n", | ||
" <tr>\n", | ||
" <th>Federal income tax withheld</th>\n", | ||
" <td>3835.00</td>\n", | ||
" </tr>\n", | ||
" </tbody>\n", | ||
"</table>\n", | ||
"</div>" | ||
], | ||
"text/plain": [ | ||
" Value\n", | ||
"Employee's social security number 758-58-5787\n", | ||
"Employer identification number (EIN) 78-8778788\n", | ||
"Employer's name, address, and ZIP code DesignNext\\nKatham Dorbosto, Kashiani, Gopalga...\n", | ||
"Control number 9\n", | ||
"Employee's first name and initial Jesan\n", | ||
"Last name Rahaman\n", | ||
"State, Employer's state ID number AL,877878878\n", | ||
"State wages, tips, etc. 80000.00\n", | ||
"Federal income tax withheld 3835.00" | ||
] | ||
}, | ||
"execution_count": 11, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"data = qa_result[0]['result']\n", | ||
"keys = [list(item.keys())[0] for item in data]\n", | ||
"values = [list(item.values())[0] for item in data]\n", | ||
"\n", | ||
"# Create a DataFrame\n", | ||
"df = pd.DataFrame(values, index=keys, columns=['Value'])\n", | ||
"\n", | ||
"df" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## End of the notebook\n", | ||
"\n", | ||
"Check more [case studies](https://www.cambioml.com/blog) of CambioML!\n", | ||
"\n", | ||
"<a href=\"https://www.cambioml.com/\" title=\"Title\">\n", | ||
" <img src=\"./sample_data/cambioml_logo_large.png\" style=\"height: 100px; display: block; margin-left: auto; margin-right: auto;\"/>\n", | ||
"</a>" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "open-parser", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.13" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Binary file not shown.
This file was deleted.
Oops, something went wrong.