diff --git a/examples/test_document_extraction.ipynb b/examples/extract_table_from_image_to_markdown.ipynb
similarity index 99%
rename from examples/test_document_extraction.ipynb
rename to examples/extract_table_from_image_to_markdown.ipynb
index ba4b513..0e35e3d 100644
--- a/examples/test_document_extraction.ipynb
+++ b/examples/extract_table_from_image_to_markdown.ipynb
@@ -4,9 +4,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
- "# Extract data from PDF Document into Markdown\n",
+ "# Extract a Table from an Image into Markdown Format\n",
"\n",
- "Below it's simple example of using OpenParser to accurately extract a table from an image into markdown format.\n",
+ "Below it's a simple example of using OpenParser to accurately extract a table from an image into markdown format.\n",
"\n",
"### 1. Load the libraries\n",
"\n",
@@ -58,7 +58,6 @@
"metadata": {},
"outputs": [],
"source": [
- "from dotenv import load_dotenv\n",
"load_dotenv(override=True)\n",
"example_apikey = os.getenv(\"CAMBIO_API_KEY\")\n"
]
diff --git a/examples/prompt_to_extract_table_from_pdf_to_json.ipynb b/examples/prompt_to_extract_table_from_pdf_to_json.ipynb
new file mode 100644
index 0000000..a0e3ef5
--- /dev/null
+++ b/examples/prompt_to_extract_table_from_pdf_to_json.ipynb
@@ -0,0 +1,280 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Prompt to Extract Key-values into JSON from W2 (PDF)\n",
+ "\n",
+ "Below it's an example of using OpenParser to extract key-values from a W2 PDF into JSON format. (Note: the model is still in beta and is NOT robust enough to generate the same output. Please bear with it!)\n",
+ "\n",
+ "### 1. Load the libraries\n",
+ "\n",
+ "If you have install `open_parser`, uncomment the below line."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# !pip3 install python-dotenv\n",
+ "# !pip3 install --upgrade open_parser"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "/var/folders/mb/7wp0k3g17jd11kk9xlv5mh3m0000gn/T/ipykernel_67864/3281231558.py:2: DeprecationWarning: \n",
+ "Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),\n",
+ "(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)\n",
+ "but was not found to be installed on your system.\n",
+ "If this would cause problems for you,\n",
+ "please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466\n",
+ " \n",
+ " import pandas as pd\n"
+ ]
+ }
+ ],
+ "source": [
+ "import os\n",
+ "import pandas as pd\n",
+ "\n",
+ "from dotenv import load_dotenv\n",
+ "from open_parser import OpenParser\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 2. Set up your OpenParser API key\n",
+ "\n",
+ "To set up your `CAMBIO_API_KEY` API key, you will:\n",
+ "\n",
+ "1. create a `.env` file in your root folder;\n",
+ "2. add the following one line to your `.env file:\n",
+ " ```\n",
+ " CAMBIO_API_KEY=17b************************\n",
+ " ```\n",
+ "\n",
+ "Then run the below line to load your API key."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "load_dotenv(override=True)\n",
+ "example_apikey = os.getenv(\"CAMBIO_API_KEY\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 3. Load sample data and Run OpenParser\n",
+ "\n",
+ "OpenParser supports both image and PDF. First let's load a sample data to test OpenParser's capabilities.\n",
+ "\n",
+ "Now we can run OpenParser on our sample data and then display it in the Markdown format."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Upload response: 204\n",
+ "Extraction success.\n"
+ ]
+ }
+ ],
+ "source": [
+ "example_local_file = \"./sample_data/test1.pdf\"\n",
+ "example_prompt = \"Return table in a JSON format with each box's key and value.\"\n",
+ "\n",
+ "op = OpenParser(example_apikey)\n",
+ "qa_result = op.parse(example_local_file, example_prompt)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[{'result': [{\"Employee's social security number\": '758-58-5787'},\n",
+ " {'Employer identification number (EIN)': '78-8778788'},\n",
+ " {\"Employer's name, address, and ZIP code\": 'DesignNext\\nKatham Dorbosto, Kashiani, Gopalganj\\nGopalganj, AK 8133'},\n",
+ " {'Control number': '9'},\n",
+ " {\"Employee's first name and initial\": 'Jesan'},\n",
+ " {'Last name': 'Rahaman'},\n",
+ " {\"State, Employer's state ID number\": 'AL,877878878'},\n",
+ " {'State wages, tips, etc.': '80000.00'},\n",
+ " {'Federal income tax withheld': '3835.00'}],\n",
+ " 'log': {'instruction': \"Return table in a JSON format with each box's key and value.\",\n",
+ " 'source': '',\n",
+ " 'usage': {'input_tokens': 1750, 'output_tokens': 232}},\n",
+ " 'page_num': 0}]"
+ ]
+ },
+ "execution_count": 6,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "qa_result"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Value | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " Employee's social security number | \n",
+ " 758-58-5787 | \n",
+ "
\n",
+ " \n",
+ " Employer identification number (EIN) | \n",
+ " 78-8778788 | \n",
+ "
\n",
+ " \n",
+ " Employer's name, address, and ZIP code | \n",
+ " DesignNext\\nKatham Dorbosto, Kashiani, Gopalga... | \n",
+ "
\n",
+ " \n",
+ " Control number | \n",
+ " 9 | \n",
+ "
\n",
+ " \n",
+ " Employee's first name and initial | \n",
+ " Jesan | \n",
+ "
\n",
+ " \n",
+ " Last name | \n",
+ " Rahaman | \n",
+ "
\n",
+ " \n",
+ " State, Employer's state ID number | \n",
+ " AL,877878878 | \n",
+ "
\n",
+ " \n",
+ " State wages, tips, etc. | \n",
+ " 80000.00 | \n",
+ "
\n",
+ " \n",
+ " Federal income tax withheld | \n",
+ " 3835.00 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Value\n",
+ "Employee's social security number 758-58-5787\n",
+ "Employer identification number (EIN) 78-8778788\n",
+ "Employer's name, address, and ZIP code DesignNext\\nKatham Dorbosto, Kashiani, Gopalga...\n",
+ "Control number 9\n",
+ "Employee's first name and initial Jesan\n",
+ "Last name Rahaman\n",
+ "State, Employer's state ID number AL,877878878\n",
+ "State wages, tips, etc. 80000.00\n",
+ "Federal income tax withheld 3835.00"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "data = qa_result[0]['result']\n",
+ "keys = [list(item.keys())[0] for item in data]\n",
+ "values = [list(item.values())[0] for item in data]\n",
+ "\n",
+ "# Create a DataFrame\n",
+ "df = pd.DataFrame(values, index=keys, columns=['Value'])\n",
+ "\n",
+ "df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## End of the notebook\n",
+ "\n",
+ "Check more [case studies](https://www.cambioml.com/blog) of CambioML!\n",
+ "\n",
+ "\n",
+ " \n",
+ ""
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "open-parser",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.13"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/examples/sample_data/test1.pdf b/examples/sample_data/test1.pdf
new file mode 100644
index 0000000..d08fa12
Binary files /dev/null and b/examples/sample_data/test1.pdf differ
diff --git a/examples/test_information_extraction.ipynb b/examples/test_information_extraction.ipynb
deleted file mode 100644
index a0a5e83..0000000
--- a/examples/test_information_extraction.ipynb
+++ /dev/null
@@ -1,95 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Information Extraction"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "%reload_ext autoreload\n",
- "%autoreload 2\n",
- "\n",
- "import sys\n",
- "\n",
- "sys.path.append(\".\")\n",
- "sys.path.append(\"..\")\n",
- "sys.path.append(\"../..\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {},
- "outputs": [],
- "source": [
- "import os\n",
- "from dotenv import load_dotenv\n",
- "from open_parser import OpenParser\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "load_dotenv(override=True)\n",
- "\n",
- "example_apikey = os.getenv(\"CAMBIO_API_KEY\")"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Upload response: 204\n",
- "Extraction success.\n",
- "[{'result': [{'FY23 Q1': {'Office Commercial products and cloud services revenue growth (y/y)': '7% / 13%', 'Office Consumer products and cloud services revenue growth (y/y)': '7% / 11%', 'Office 365 Commercial seat growth (y/y)': '14%', 'Microsoft 365 Consumer subscribers (in millions)': '65.1', 'Dynamics products and cloud services revenue growth (y/y)': '15% / 22%', 'LinkedIn revenue growth (y/y)': '17% / 21%'}}, {'FY23 Q2': {'Office Commercial products and cloud services revenue growth (y/y)': '7% / 14%', 'Office Consumer products and cloud services revenue growth (y/y)': '2% / 3%', 'Office 365 Commercial seat growth (y/y)': '12%', 'Microsoft 365 Consumer subscribers (in millions)': '67.7', 'Dynamics products and cloud services revenue growth (y/y)': '13% / 20%', 'LinkedIn revenue growth (y/y)': '10% / 14%'}}, {'FY23 Q3': {'Office Commercial products and cloud services revenue growth (y/y)': '13% / 17%', 'Office Consumer products and cloud services revenue growth (y/y)': '1% / 4%', 'Office 365 Commercial seat growth (y/y)': '11%', 'Microsoft 365 Consumer subscribers (in millions)': '70.8', 'Dynamics products and cloud services revenue growth (y/y)': '17% / 21%', 'LinkedIn revenue growth (y/y)': '8% / 11%'}}, {'FY23 Q4': {'Office Commercial products and cloud services revenue growth (y/y)': '12% / 14%', 'Office Consumer products and cloud services revenue growth (y/y)': '3% / 6%', 'Office 365 Commercial seat growth (y/y)': '11%', 'Microsoft 365 Consumer subscribers (in millions)': '74.9', 'Dynamics products and cloud services revenue growth (y/y)': '19% / 21%', 'LinkedIn revenue growth (y/y)': '6% / 8%'}}, {'FY24 Q1': {'Office Commercial products and cloud services revenue growth (y/y)': '15% / 14%', 'Office Consumer products and cloud services revenue growth (y/y)': '3% / 4%', 'Office 365 Commercial seat growth (y/y)': '10%', 'Microsoft 365 Consumer subscribers (in millions)': '76.7', 'Dynamics products and cloud services revenue growth (y/y)': '22% / 21%', 'LinkedIn revenue growth (y/y)': '8%'}}], 'log': {'instruction': 'Return table under Investor Metrics in JSON format with year as the key and the column as subkeys.', 'source': '', 'usage': {'input_tokens': 1758, 'output_tokens': 771}}, 'page_num': 0}]\n"
- ]
- }
- ],
- "source": [
- "example_local_file = \"./test2.pdf\"\n",
- "example_prompt = \"Return table under Investor Metrics in JSON format with year as the key and the column as subkeys.\"\n",
- "\n",
- "op = OpenParser(example_apikey)\n",
- "qa_result = op.parse(example_local_file, example_prompt)\n",
- "\n",
- "print(qa_result)\n"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "open-parser",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.10.14"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}