CambioML · CambioML · Apr 5, 2024 · Apr 5, 2024
@@ -4,9 +4,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Extract data from PDF Document into Markdown\n",
+    "# Extract a Table from an Image into Markdown Format\n",
     "\n",
-    "Below it's simple example of using OpenParser to accurately extract a table from an image into markdown format.\n",
+    "Below it's a simple example of using OpenParser to accurately extract a table from an image into markdown format.\n",
     "\n",
     "### 1. Load the libraries\n",
     "\n",
@@ -58,7 +58,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from dotenv import load_dotenv\n",
     "load_dotenv(override=True)\n",
     "example_apikey = os.getenv(\"CAMBIO_API_KEY\")\n"
    ]

@@ -0,0 +1,280 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Prompt to Extract Key-values into JSON from W2 (PDF)\n",
+    "\n",
+    "Below it's an example of using OpenParser to extract key-values from a W2 PDF into JSON format. (Note: the model is still in beta and is NOT robust enough to generate the same output. Please bear with it!)\n",
+    "\n",
+    "### 1. Load the libraries\n",
+    "\n",
+    "If you have install `open_parser`, uncomment the below line."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# !pip3 install python-dotenv\n",
+    "# !pip3 install --upgrade open_parser"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/var/folders/mb/7wp0k3g17jd11kk9xlv5mh3m0000gn/T/ipykernel_67864/3281231558.py:2: DeprecationWarning: \n",
+      "Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),\n",
+      "(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)\n",
+      "but was not found to be installed on your system.\n",
+      "If this would cause problems for you,\n",
+      "please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466\n",
+      "        \n",
+      "  import pandas as pd\n"
+     ]
+    }
+   ],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "\n",
+    "from dotenv import load_dotenv\n",
+    "from open_parser import OpenParser\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2. Set up your OpenParser API key\n",
+    "\n",
+    "To set up your `CAMBIO_API_KEY` API key, you will:\n",
+    "\n",
+    "1. create a `.env` file in your root folder;\n",
+    "2. add the following one line to your `.env file:\n",
+    "    ```\n",
+    "    CAMBIO_API_KEY=17b************************\n",
+    "    ```\n",
+    "\n",
+    "Then run the below line to load your API key."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "load_dotenv(override=True)\n",
+    "example_apikey = os.getenv(\"CAMBIO_API_KEY\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3. Load sample data and Run OpenParser\n",
+    "\n",
+    "OpenParser supports both image and PDF.  First let's load a sample data to test OpenParser's capabilities.\n",
+    "\n",
+    "Now we can run OpenParser on our sample data and then display it in the Markdown format."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Upload response: 204\n",
+      "Extraction success.\n"
+     ]
+    }
+   ],
+   "source": [
+    "example_local_file = \"./sample_data/test1.pdf\"\n",
+    "example_prompt = \"Return table in a JSON format with each box's key and value.\"\n",
+    "\n",
+    "op = OpenParser(example_apikey)\n",
+    "qa_result = op.parse(example_local_file, example_prompt)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[{'result': [{\"Employee's social security number\": '758-58-5787'},\n",
+       "   {'Employer identification number (EIN)': '78-8778788'},\n",
+       "   {\"Employer's name, address, and ZIP code\": 'DesignNext\\nKatham Dorbosto, Kashiani, Gopalganj\\nGopalganj, AK 8133'},\n",
+       "   {'Control number': '9'},\n",
+       "   {\"Employee's first name and initial\": 'Jesan'},\n",
+       "   {'Last name': 'Rahaman'},\n",
+       "   {\"State, Employer's state ID number\": 'AL,877878878'},\n",
+       "   {'State wages, tips, etc.': '80000.00'},\n",
+       "   {'Federal income tax withheld': '3835.00'}],\n",
+       "  'log': {'instruction': \"Return table in a JSON format with each box's key and value.\",\n",
+       "   'source': '',\n",
+       "   'usage': {'input_tokens': 1750, 'output_tokens': 232}},\n",
+       "  'page_num': 0}]"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "qa_result"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Value</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>Employee's social security number</th>\n",
+       "      <td>758-58-5787</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>Employer identification number (EIN)</th>\n",
+       "      <td>78-8778788</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>Employer's name, address, and ZIP code</th>\n",
+       "      <td>DesignNext\\nKatham Dorbosto, Kashiani, Gopalga...</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>Control number</th>\n",
+       "      <td>9</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>Employee's first name and initial</th>\n",
+       "      <td>Jesan</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>Last name</th>\n",
+       "      <td>Rahaman</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>State, Employer's state ID number</th>\n",
+       "      <td>AL,877878878</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>State wages, tips, etc.</th>\n",
+       "      <td>80000.00</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>Federal income tax withheld</th>\n",
+       "      <td>3835.00</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "                                                                                    Value\n",
+       "Employee's social security number                                             758-58-5787\n",
+       "Employer identification number (EIN)                                           78-8778788\n",
+       "Employer's name, address, and ZIP code  DesignNext\\nKatham Dorbosto, Kashiani, Gopalga...\n",
+       "Control number                                                                          9\n",
+       "Employee's first name and initial                                                   Jesan\n",
+       "Last name                                                                         Rahaman\n",
+       "State, Employer's state ID number                                            AL,877878878\n",
+       "State wages, tips, etc.                                                          80000.00\n",
+       "Federal income tax withheld                                                       3835.00"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "data = qa_result[0]['result']\n",
+    "keys = [list(item.keys())[0] for item in data]\n",
+    "values = [list(item.values())[0] for item in data]\n",
+    "\n",
+    "# Create a DataFrame\n",
+    "df = pd.DataFrame(values, index=keys, columns=['Value'])\n",
+    "\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## End of the notebook\n",
+    "\n",
+    "Check more [case studies](https://www.cambioml.com/blog) of CambioML!\n",
+    "\n",
+    "<a href=\"https://www.cambioml.com/\" title=\"Title\">\n",
+    "    <img src=\"./sample_data/cambioml_logo_large.png\" style=\"height: 100px; display: block; margin-left: auto; margin-right: auto;\"/>\n",
+    "</a>"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "open-parser",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}