diff --git a/examples/test_document_extraction.ipynb b/examples/extract_table_from_image_to_markdown.ipynb similarity index 99% rename from examples/test_document_extraction.ipynb rename to examples/extract_table_from_image_to_markdown.ipynb index ba4b513..0e35e3d 100644 --- a/examples/test_document_extraction.ipynb +++ b/examples/extract_table_from_image_to_markdown.ipynb @@ -4,9 +4,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Extract data from PDF Document into Markdown\n", + "# Extract a Table from an Image into Markdown Format\n", "\n", - "Below it's simple example of using OpenParser to accurately extract a table from an image into markdown format.\n", + "Below it's a simple example of using OpenParser to accurately extract a table from an image into markdown format.\n", "\n", "### 1. Load the libraries\n", "\n", @@ -58,7 +58,6 @@ "metadata": {}, "outputs": [], "source": [ - "from dotenv import load_dotenv\n", "load_dotenv(override=True)\n", "example_apikey = os.getenv(\"CAMBIO_API_KEY\")\n" ] diff --git a/examples/prompt_to_extract_table_from_pdf_to_json.ipynb b/examples/prompt_to_extract_table_from_pdf_to_json.ipynb new file mode 100644 index 0000000..a0e3ef5 --- /dev/null +++ b/examples/prompt_to_extract_table_from_pdf_to_json.ipynb @@ -0,0 +1,280 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Prompt to Extract Key-values into JSON from W2 (PDF)\n", + "\n", + "Below it's an example of using OpenParser to extract key-values from a W2 PDF into JSON format. (Note: the model is still in beta and is NOT robust enough to generate the same output. Please bear with it!)\n", + "\n", + "### 1. Load the libraries\n", + "\n", + "If you have install `open_parser`, uncomment the below line." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# !pip3 install python-dotenv\n", + "# !pip3 install --upgrade open_parser" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/var/folders/mb/7wp0k3g17jd11kk9xlv5mh3m0000gn/T/ipykernel_67864/3281231558.py:2: DeprecationWarning: \n", + "Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),\n", + "(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)\n", + "but was not found to be installed on your system.\n", + "If this would cause problems for you,\n", + "please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466\n", + " \n", + " import pandas as pd\n" + ] + } + ], + "source": [ + "import os\n", + "import pandas as pd\n", + "\n", + "from dotenv import load_dotenv\n", + "from open_parser import OpenParser\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2. Set up your OpenParser API key\n", + "\n", + "To set up your `CAMBIO_API_KEY` API key, you will:\n", + "\n", + "1. create a `.env` file in your root folder;\n", + "2. add the following one line to your `.env file:\n", + " ```\n", + " CAMBIO_API_KEY=17b************************\n", + " ```\n", + "\n", + "Then run the below line to load your API key." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "load_dotenv(override=True)\n", + "example_apikey = os.getenv(\"CAMBIO_API_KEY\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3. Load sample data and Run OpenParser\n", + "\n", + "OpenParser supports both image and PDF. First let's load a sample data to test OpenParser's capabilities.\n", + "\n", + "Now we can run OpenParser on our sample data and then display it in the Markdown format." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Upload response: 204\n", + "Extraction success.\n" + ] + } + ], + "source": [ + "example_local_file = \"./sample_data/test1.pdf\"\n", + "example_prompt = \"Return table in a JSON format with each box's key and value.\"\n", + "\n", + "op = OpenParser(example_apikey)\n", + "qa_result = op.parse(example_local_file, example_prompt)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[{'result': [{\"Employee's social security number\": '758-58-5787'},\n", + " {'Employer identification number (EIN)': '78-8778788'},\n", + " {\"Employer's name, address, and ZIP code\": 'DesignNext\\nKatham Dorbosto, Kashiani, Gopalganj\\nGopalganj, AK 8133'},\n", + " {'Control number': '9'},\n", + " {\"Employee's first name and initial\": 'Jesan'},\n", + " {'Last name': 'Rahaman'},\n", + " {\"State, Employer's state ID number\": 'AL,877878878'},\n", + " {'State wages, tips, etc.': '80000.00'},\n", + " {'Federal income tax withheld': '3835.00'}],\n", + " 'log': {'instruction': \"Return table in a JSON format with each box's key and value.\",\n", + " 'source': '',\n", + " 'usage': {'input_tokens': 1750, 'output_tokens': 232}},\n", + " 'page_num': 0}]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "qa_result" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Value
Employee's social security number758-58-5787
Employer identification number (EIN)78-8778788
Employer's name, address, and ZIP codeDesignNext\\nKatham Dorbosto, Kashiani, Gopalga...
Control number9
Employee's first name and initialJesan
Last nameRahaman
State, Employer's state ID numberAL,877878878
State wages, tips, etc.80000.00
Federal income tax withheld3835.00
\n", + "
" + ], + "text/plain": [ + " Value\n", + "Employee's social security number 758-58-5787\n", + "Employer identification number (EIN) 78-8778788\n", + "Employer's name, address, and ZIP code DesignNext\\nKatham Dorbosto, Kashiani, Gopalga...\n", + "Control number 9\n", + "Employee's first name and initial Jesan\n", + "Last name Rahaman\n", + "State, Employer's state ID number AL,877878878\n", + "State wages, tips, etc. 80000.00\n", + "Federal income tax withheld 3835.00" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data = qa_result[0]['result']\n", + "keys = [list(item.keys())[0] for item in data]\n", + "values = [list(item.values())[0] for item in data]\n", + "\n", + "# Create a DataFrame\n", + "df = pd.DataFrame(values, index=keys, columns=['Value'])\n", + "\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## End of the notebook\n", + "\n", + "Check more [case studies](https://www.cambioml.com/blog) of CambioML!\n", + "\n", + "\n", + " \n", + "" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "open-parser", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/sample_data/test1.pdf b/examples/sample_data/test1.pdf new file mode 100644 index 0000000..d08fa12 Binary files /dev/null and b/examples/sample_data/test1.pdf differ diff --git a/examples/test_information_extraction.ipynb b/examples/test_information_extraction.ipynb deleted file mode 100644 index a0a5e83..0000000 --- a/examples/test_information_extraction.ipynb +++ /dev/null @@ -1,95 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Information Extraction" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "%reload_ext autoreload\n", - "%autoreload 2\n", - "\n", - "import sys\n", - "\n", - "sys.path.append(\".\")\n", - "sys.path.append(\"..\")\n", - "sys.path.append(\"../..\")" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from dotenv import load_dotenv\n", - "from open_parser import OpenParser\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "load_dotenv(override=True)\n", - "\n", - "example_apikey = os.getenv(\"CAMBIO_API_KEY\")" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Upload response: 204\n", - "Extraction success.\n", - "[{'result': [{'FY23 Q1': {'Office Commercial products and cloud services revenue growth (y/y)': '7% / 13%', 'Office Consumer products and cloud services revenue growth (y/y)': '7% / 11%', 'Office 365 Commercial seat growth (y/y)': '14%', 'Microsoft 365 Consumer subscribers (in millions)': '65.1', 'Dynamics products and cloud services revenue growth (y/y)': '15% / 22%', 'LinkedIn revenue growth (y/y)': '17% / 21%'}}, {'FY23 Q2': {'Office Commercial products and cloud services revenue growth (y/y)': '7% / 14%', 'Office Consumer products and cloud services revenue growth (y/y)': '2% / 3%', 'Office 365 Commercial seat growth (y/y)': '12%', 'Microsoft 365 Consumer subscribers (in millions)': '67.7', 'Dynamics products and cloud services revenue growth (y/y)': '13% / 20%', 'LinkedIn revenue growth (y/y)': '10% / 14%'}}, {'FY23 Q3': {'Office Commercial products and cloud services revenue growth (y/y)': '13% / 17%', 'Office Consumer products and cloud services revenue growth (y/y)': '1% / 4%', 'Office 365 Commercial seat growth (y/y)': '11%', 'Microsoft 365 Consumer subscribers (in millions)': '70.8', 'Dynamics products and cloud services revenue growth (y/y)': '17% / 21%', 'LinkedIn revenue growth (y/y)': '8% / 11%'}}, {'FY23 Q4': {'Office Commercial products and cloud services revenue growth (y/y)': '12% / 14%', 'Office Consumer products and cloud services revenue growth (y/y)': '3% / 6%', 'Office 365 Commercial seat growth (y/y)': '11%', 'Microsoft 365 Consumer subscribers (in millions)': '74.9', 'Dynamics products and cloud services revenue growth (y/y)': '19% / 21%', 'LinkedIn revenue growth (y/y)': '6% / 8%'}}, {'FY24 Q1': {'Office Commercial products and cloud services revenue growth (y/y)': '15% / 14%', 'Office Consumer products and cloud services revenue growth (y/y)': '3% / 4%', 'Office 365 Commercial seat growth (y/y)': '10%', 'Microsoft 365 Consumer subscribers (in millions)': '76.7', 'Dynamics products and cloud services revenue growth (y/y)': '22% / 21%', 'LinkedIn revenue growth (y/y)': '8%'}}], 'log': {'instruction': 'Return table under Investor Metrics in JSON format with year as the key and the column as subkeys.', 'source': '', 'usage': {'input_tokens': 1758, 'output_tokens': 771}}, 'page_num': 0}]\n" - ] - } - ], - "source": [ - "example_local_file = \"./test2.pdf\"\n", - "example_prompt = \"Return table under Investor Metrics in JSON format with year as the key and the column as subkeys.\"\n", - "\n", - "op = OpenParser(example_apikey)\n", - "qa_result = op.parse(example_local_file, example_prompt)\n", - "\n", - "print(qa_result)\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "open-parser", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.14" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}