Skip to content

Commit

Permalink
Merge pull request #29 from goldmermaid/main
Browse files Browse the repository at this point in the history
edited readme to be up to dated
  • Loading branch information
goldmermaid authored May 7, 2024
2 parents 694e333 + f81401d commit c3d1975
Show file tree
Hide file tree
Showing 6 changed files with 50 additions and 243 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ conda activate any-parse
pip3 install any-parser
```
## :bashfile usage
## bashfile usage
To use AnyParser via `curl` requests, you can run the following bash command from the root folder of this repository:
```
bash parse.sh <your apiKey> <file path> <prompt for parse (optional, default="")>
Expand All @@ -38,8 +38,8 @@ bash parse.sh gl************************************** /path/to/your/file.pdf "
AnyParser can extract text, numbers and symbols from PDF, images, etc. Check out each notebook below to run AnyParser within 10 lines of code!
### [Prompt to Extract Key-values into JSON from W2 (PDF)](https://github.com/CambioML/any-parser/blob/main/examples/prompt_to_extract_table_from_pdf_to_json.ipynb)
Do you want to extract key-values from a W2 PDF into JSON format? Check out this notebook (3-min read)!
### [Extract a Table from PDF into Excel](https://github.com/CambioML/any-parser/blob/main/demo/pdf_to_html_to_excel.ipynb)
Do you want to extract a complicated table from a financial report (PDF) into Excel spread sheet? Check out this notebook (3-min read)!
### [Extract a Table from an Image into Markdown Format](https://github.com/CambioML/any-parser/blob/main/examples/extract_table_from_image_to_markdown.ipynb)
Are you a financial analyst who need to extract ACCURATE number from a table in an image or a PDF. Check out this notebook (3-min read)!
93 changes: 47 additions & 46 deletions demo/pdf_to_html_to_excel.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,21 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install dependency"
"# Extract a Table from PDF into Excel\n",
"\n",
"Below it's an example of using AnyParser to extract a complicated table from a financial report (PDF) into Excel spread sheet. The sample data is from https://gaming.nv.gov/about/gaming-revenue/information/\n",
"\n",
"\n",
"## 1. Load the libraries\n",
"\n",
"If you have install `any_parser`, uncomment the below line."
]
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Collecting openpyxl\n",
" Downloading openpyxl-3.1.2-py2.py3-none-any.whl.metadata (2.5 kB)\n",
"Collecting et-xmlfile (from openpyxl)\n",
" Downloading et_xmlfile-1.1.0-py3-none-any.whl.metadata (1.8 kB)\n",
"Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m250.0/250.0 kB\u001b[0m \u001b[31m6.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hDownloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)\n",
"Installing collected packages: et-xmlfile, openpyxl\n",
"Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2\n"
]
}
],
"outputs": [],
"source": [
"# !pip3 install python-dotenv\n",
"# !pip3 install --upgrade any-parser\n",
Expand All @@ -38,19 +29,19 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import and update path"
"## 2. Import and update path"
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/mb/7wp0k3g17jd11kk9xlv5mh3m0000gn/T/ipykernel_76979/2399796935.py:2: DeprecationWarning: \n",
"/var/folders/mb/7wp0k3g17jd11kk9xlv5mh3m0000gn/T/ipykernel_79818/3792744553.py:2: DeprecationWarning: \n",
"Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),\n",
"(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)\n",
"but was not found to be installed on your system.\n",
Expand All @@ -74,12 +65,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load api key and Initialize AnyParser"
"## 3. Load Your API Key and Initialize AnyParser"
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -93,12 +84,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Helper function: Convert HTML to Excel"
"## 4. Helper function: Convert HTML to Excel"
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -132,12 +123,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## parse into html: March_2024 page 8"
"## 5. Parse into HTML and Excel\n",
"\n",
"### 5.1 sample: March_2024 page 8"
]
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 6,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -194,21 +187,22 @@
}
],
"source": [
"sample_page8 = \"./Nevada_Gaming_Revenue_Report_(March_2024)_page8.pdf\"\n",
"sample_page8_result = op.parse(sample_page8)\n",
"sample_page8 = \"Nevada_Gaming_Revenue_Report_(March_2024)_page8\"\n",
"sample_page8_result = op.parse(\"./{}.pdf\".format(sample_page8))\n",
"\n",
"display(HTML(sample_page8_result[0]))"
]
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Excel file saved to output/./Nevada_Gaming_Revenue_Report_(March_2024)_page8.pdf.xlsx\n"
"Excel file saved to output/Nevada_Gaming_Revenue_Report_(March_2024)_page8.xlsx\n"
]
}
],
Expand All @@ -222,12 +216,12 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## parse into html: March_2024 page 35"
"### 5.2 sample: March_2024 page 35"
]
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 8,
"metadata": {},
"outputs": [
{
Expand All @@ -250,21 +244,22 @@
}
],
"source": [
"sample_page35 = \"./Nevada_Gaming_Revenue_Report_(March_2024)_page35.pdf\"\n",
"sample_page35_result = op.parse(sample_page35)\n",
"sample_page35 = \"Nevada_Gaming_Revenue_Report_(March_2024)_page35\"\n",
"sample_page35_result = op.parse(\"./{}.pdf\".format(sample_page35))\n",
"\n",
"display(HTML(sample_page35_result[0]))"
]
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Excel file saved to output/./Nevada_Gaming_Revenue_Report_(March_2024)_page35.pdf.xlsx\n"
"Excel file saved to output/Nevada_Gaming_Revenue_Report_(March_2024)_page35.xlsx\n"
]
}
],
Expand All @@ -278,7 +273,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## parse into html: March_2024 page 43"
"### 5.3 sample: March_2024 page 43"
]
},
{
Expand Down Expand Up @@ -316,8 +311,8 @@
}
],
"source": [
"sample_page43 = \"Nevada_Gaming_Revenue_Report_(March_2024)_page43.pdf\"\n",
"sample_page43_result = op.parse(\"./\"+ sample_page43)\n",
"sample_page43 = \"Nevada_Gaming_Revenue_Report_(March_2024)_page43\"\n",
"sample_page43_result = op.parse(\"./{}.pdf\".format(sample_page43))\n",
"display(HTML(sample_page43_result[0]))"
]
},
Expand All @@ -330,7 +325,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"Excel file saved to output/Nevada_Gaming_Revenue_Report_(March_2024)_page43.pdf.xlsx\n"
"Excel file saved to output/Nevada_Gaming_Revenue_Report_(March_2024)_page43.xlsx\n"
]
}
],
Expand All @@ -341,11 +336,17 @@
]
},
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"metadata": {},
"outputs": [],
"source": []
"source": [
"## End of the notebook\n",
"\n",
"Check more [case studies](https://www.cambioml.com/blog) of CambioML!\n",
"\n",
"<a href=\"https://www.cambioml.com/\" title=\"Title\">\n",
" <img src=\"../examples//sample_data/cambioml_logo_large.png\" style=\"height: 100px; display: block; margin-left: auto; margin-right: auto;\"/>\n",
"</a>"
]
}
],
"metadata": {
Expand Down
Loading

0 comments on commit c3d1975

Please sign in to comment.