Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prompt to extract sample notebook #7

Merged
merged 1 commit into from
Apr 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extract data from PDF Document into Markdown\n",
"# Extract a Table from an Image into Markdown Format\n",
"\n",
"Below it's simple example of using OpenParser to accurately extract a table from an image into markdown format.\n",
"Below it's a simple example of using OpenParser to accurately extract a table from an image into markdown format.\n",
"\n",
"### 1. Load the libraries\n",
"\n",
Expand Down Expand Up @@ -58,7 +58,6 @@
"metadata": {},
"outputs": [],
"source": [
"from dotenv import load_dotenv\n",
"load_dotenv(override=True)\n",
"example_apikey = os.getenv(\"CAMBIO_API_KEY\")\n"
]
Expand Down
280 changes: 280 additions & 0 deletions examples/prompt_to_extract_table_from_pdf_to_json.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Prompt to Extract Key-values into JSON from W2 (PDF)\n",
"\n",
"Below it's an example of using OpenParser to extract key-values from a W2 PDF into JSON format. (Note: the model is still in beta and is NOT robust enough to generate the same output. Please bear with it!)\n",
"\n",
"### 1. Load the libraries\n",
"\n",
"If you have install `open_parser`, uncomment the below line."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# !pip3 install python-dotenv\n",
"# !pip3 install --upgrade open_parser"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/mb/7wp0k3g17jd11kk9xlv5mh3m0000gn/T/ipykernel_67864/3281231558.py:2: DeprecationWarning: \n",
"Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),\n",
"(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)\n",
"but was not found to be installed on your system.\n",
"If this would cause problems for you,\n",
"please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466\n",
" \n",
" import pandas as pd\n"
]
}
],
"source": [
"import os\n",
"import pandas as pd\n",
"\n",
"from dotenv import load_dotenv\n",
"from open_parser import OpenParser\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. Set up your OpenParser API key\n",
"\n",
"To set up your `CAMBIO_API_KEY` API key, you will:\n",
"\n",
"1. create a `.env` file in your root folder;\n",
"2. add the following one line to your `.env file:\n",
" ```\n",
" CAMBIO_API_KEY=17b************************\n",
" ```\n",
"\n",
"Then run the below line to load your API key."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"load_dotenv(override=True)\n",
"example_apikey = os.getenv(\"CAMBIO_API_KEY\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Load sample data and Run OpenParser\n",
"\n",
"OpenParser supports both image and PDF. First let's load a sample data to test OpenParser's capabilities.\n",
"\n",
"Now we can run OpenParser on our sample data and then display it in the Markdown format."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Upload response: 204\n",
"Extraction success.\n"
]
}
],
"source": [
"example_local_file = \"./sample_data/test1.pdf\"\n",
"example_prompt = \"Return table in a JSON format with each box's key and value.\"\n",
"\n",
"op = OpenParser(example_apikey)\n",
"qa_result = op.parse(example_local_file, example_prompt)\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'result': [{\"Employee's social security number\": '758-58-5787'},\n",
" {'Employer identification number (EIN)': '78-8778788'},\n",
" {\"Employer's name, address, and ZIP code\": 'DesignNext\\nKatham Dorbosto, Kashiani, Gopalganj\\nGopalganj, AK 8133'},\n",
" {'Control number': '9'},\n",
" {\"Employee's first name and initial\": 'Jesan'},\n",
" {'Last name': 'Rahaman'},\n",
" {\"State, Employer's state ID number\": 'AL,877878878'},\n",
" {'State wages, tips, etc.': '80000.00'},\n",
" {'Federal income tax withheld': '3835.00'}],\n",
" 'log': {'instruction': \"Return table in a JSON format with each box's key and value.\",\n",
" 'source': '',\n",
" 'usage': {'input_tokens': 1750, 'output_tokens': 232}},\n",
" 'page_num': 0}]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"qa_result"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Value</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Employee's social security number</th>\n",
" <td>758-58-5787</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Employer identification number (EIN)</th>\n",
" <td>78-8778788</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Employer's name, address, and ZIP code</th>\n",
" <td>DesignNext\\nKatham Dorbosto, Kashiani, Gopalga...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Control number</th>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Employee's first name and initial</th>\n",
" <td>Jesan</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Last name</th>\n",
" <td>Rahaman</td>\n",
" </tr>\n",
" <tr>\n",
" <th>State, Employer's state ID number</th>\n",
" <td>AL,877878878</td>\n",
" </tr>\n",
" <tr>\n",
" <th>State wages, tips, etc.</th>\n",
" <td>80000.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Federal income tax withheld</th>\n",
" <td>3835.00</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Value\n",
"Employee's social security number 758-58-5787\n",
"Employer identification number (EIN) 78-8778788\n",
"Employer's name, address, and ZIP code DesignNext\\nKatham Dorbosto, Kashiani, Gopalga...\n",
"Control number 9\n",
"Employee's first name and initial Jesan\n",
"Last name Rahaman\n",
"State, Employer's state ID number AL,877878878\n",
"State wages, tips, etc. 80000.00\n",
"Federal income tax withheld 3835.00"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = qa_result[0]['result']\n",
"keys = [list(item.keys())[0] for item in data]\n",
"values = [list(item.values())[0] for item in data]\n",
"\n",
"# Create a DataFrame\n",
"df = pd.DataFrame(values, index=keys, columns=['Value'])\n",
"\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## End of the notebook\n",
"\n",
"Check more [case studies](https://www.cambioml.com/blog) of CambioML!\n",
"\n",
"<a href=\"https://www.cambioml.com/\" title=\"Title\">\n",
" <img src=\"./sample_data/cambioml_logo_large.png\" style=\"height: 100px; display: block; margin-left: auto; margin-right: auto;\"/>\n",
"</a>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "open-parser",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Binary file added examples/sample_data/test1.pdf
Binary file not shown.
95 changes: 0 additions & 95 deletions examples/test_information_extraction.ipynb

This file was deleted.

Loading