Skip to content

Commit

Permalink
Merge pull request #7 from goldmermaid/main
Browse files Browse the repository at this point in the history
Add prompt to extract sample notebook
  • Loading branch information
CambioML authored Apr 5, 2024
2 parents e0c5318 + b89d57a commit a2c1e52
Show file tree
Hide file tree
Showing 4 changed files with 282 additions and 98 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extract data from PDF Document into Markdown\n",
"# Extract a Table from an Image into Markdown Format\n",
"\n",
"Below it's simple example of using OpenParser to accurately extract a table from an image into markdown format.\n",
"Below it's a simple example of using OpenParser to accurately extract a table from an image into markdown format.\n",
"\n",
"### 1. Load the libraries\n",
"\n",
Expand Down Expand Up @@ -58,7 +58,6 @@
"metadata": {},
"outputs": [],
"source": [
"from dotenv import load_dotenv\n",
"load_dotenv(override=True)\n",
"example_apikey = os.getenv(\"CAMBIO_API_KEY\")\n"
]
Expand Down
280 changes: 280 additions & 0 deletions examples/prompt_to_extract_table_from_pdf_to_json.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Prompt to Extract Key-values into JSON from W2 (PDF)\n",
"\n",
"Below it's an example of using OpenParser to extract key-values from a W2 PDF into JSON format. (Note: the model is still in beta and is NOT robust enough to generate the same output. Please bear with it!)\n",
"\n",
"### 1. Load the libraries\n",
"\n",
"If you have install `open_parser`, uncomment the below line."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# !pip3 install python-dotenv\n",
"# !pip3 install --upgrade open_parser"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/mb/7wp0k3g17jd11kk9xlv5mh3m0000gn/T/ipykernel_67864/3281231558.py:2: DeprecationWarning: \n",
"Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),\n",
"(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)\n",
"but was not found to be installed on your system.\n",
"If this would cause problems for you,\n",
"please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466\n",
" \n",
" import pandas as pd\n"
]
}
],
"source": [
"import os\n",
"import pandas as pd\n",
"\n",
"from dotenv import load_dotenv\n",
"from open_parser import OpenParser\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Set up your OpenParser API key\n",
"\n",
"To set up your `CAMBIO_API_KEY` API key, you will:\n",
"\n",
"1. create a `.env` file in your root folder;\n",
"2. add the following one line to your `.env file:\n",
" ```\n",
" CAMBIO_API_KEY=17b************************\n",
" ```\n",
"\n",
"Then run the below line to load your API key."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"load_dotenv(override=True)\n",
"example_apikey = os.getenv(\"CAMBIO_API_KEY\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Load sample data and Run OpenParser\n",
"\n",
"OpenParser supports both image and PDF. First let's load a sample data to test OpenParser's capabilities.\n",
"\n",
"Now we can run OpenParser on our sample data and then display it in the Markdown format."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Upload response: 204\n",
"Extraction success.\n"
]
}
],
"source": [
"example_local_file = \"./sample_data/test1.pdf\"\n",
"example_prompt = \"Return table in a JSON format with each box's key and value.\"\n",
"\n",
"op = OpenParser(example_apikey)\n",
"qa_result = op.parse(example_local_file, example_prompt)\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'result': [{\"Employee's social security number\": '758-58-5787'},\n",
" {'Employer identification number (EIN)': '78-8778788'},\n",
" {\"Employer's name, address, and ZIP code\": 'DesignNext\\nKatham Dorbosto, Kashiani, Gopalganj\\nGopalganj, AK 8133'},\n",
" {'Control number': '9'},\n",
" {\"Employee's first name and initial\": 'Jesan'},\n",
" {'Last name': 'Rahaman'},\n",
" {\"State, Employer's state ID number\": 'AL,877878878'},\n",
" {'State wages, tips, etc.': '80000.00'},\n",
" {'Federal income tax withheld': '3835.00'}],\n",
" 'log': {'instruction': \"Return table in a JSON format with each box's key and value.\",\n",
" 'source': '',\n",
" 'usage': {'input_tokens': 1750, 'output_tokens': 232}},\n",
" 'page_num': 0}]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qa_result"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Value</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Employee's social security number</th>\n",
" <td>758-58-5787</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Employer identification number (EIN)</th>\n",
" <td>78-8778788</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Employer's name, address, and ZIP code</th>\n",
" <td>DesignNext\\nKatham Dorbosto, Kashiani, Gopalga...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Control number</th>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Employee's first name and initial</th>\n",
" <td>Jesan</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Last name</th>\n",
" <td>Rahaman</td>\n",
" </tr>\n",
" <tr>\n",
" <th>State, Employer's state ID number</th>\n",
" <td>AL,877878878</td>\n",
" </tr>\n",
" <tr>\n",
" <th>State wages, tips, etc.</th>\n",
" <td>80000.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Federal income tax withheld</th>\n",
" <td>3835.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Value\n",
"Employee's social security number 758-58-5787\n",
"Employer identification number (EIN) 78-8778788\n",
"Employer's name, address, and ZIP code DesignNext\\nKatham Dorbosto, Kashiani, Gopalga...\n",
"Control number 9\n",
"Employee's first name and initial Jesan\n",
"Last name Rahaman\n",
"State, Employer's state ID number AL,877878878\n",
"State wages, tips, etc. 80000.00\n",
"Federal income tax withheld 3835.00"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = qa_result[0]['result']\n",
"keys = [list(item.keys())[0] for item in data]\n",
"values = [list(item.values())[0] for item in data]\n",
"\n",
"# Create a DataFrame\n",
"df = pd.DataFrame(values, index=keys, columns=['Value'])\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## End of the notebook\n",
"\n",
"Check more [case studies](https://www.cambioml.com/blog) of CambioML!\n",
"\n",
"<a href=\"https://www.cambioml.com/\" title=\"Title\">\n",
" <img src=\"./sample_data/cambioml_logo_large.png\" style=\"height: 100px; display: block; margin-left: auto; margin-right: auto;\"/>\n",
"</a>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "open-parser",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Binary file added examples/sample_data/test1.pdf
Binary file not shown.
95 changes: 0 additions & 95 deletions examples/test_information_extraction.ipynb

This file was deleted.

0 comments on commit a2c1e52

Please sign in to comment.