Merge pull request #45 from CambioML/rt-migration

RT-Migration
CambioML · Oct 1, 2024 · bd4b8c1 · bd4b8c1
2 parents 909ff3b + ba3289c
commit bd4b8c1
Show file tree

Hide file tree

Showing 23 changed files with 1,766 additions and 228 deletions.
diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml
@@ -16,6 +16,7 @@ jobs:
     strategy:
       matrix:
         python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
+      max-parallel: 1  # Ensures the tests run sequentially
 
     steps:
     - uses: actions/checkout@v3
@@ -44,5 +45,7 @@ jobs:
       run: |
         isort . --profile=black --check-only --verbose
     - name: Test with unittest
+      env:
+        API_KEY: ${{ secrets.API_KEY }}
       run: |
-        poetry run python -m unittest discover
+        poetry run python -m unittest discover tests
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -1,14 +1,14 @@
 repos:
   - repo: https://github.com/psf/black
-    rev: 22.8.0
+    rev: 24.8.0
     hooks:
       - id: black
         args: [--exclude=""]
 
   # this is not technically always safe but usually is
   # use comments `# isort: off` and `# isort: on` to disable/re-enable isort
   - repo: https://github.com/pycqa/isort
-    rev: 5.12.0
+    rev: 5.13.2
     hooks:
       - id: isort
         args: [--profile=black]
@@ -17,7 +17,7 @@ repos:
   # and this tool removes unused imports, which may be providing
   # necessary side effects for the code to run
   - repo: https://github.com/PyCQA/autoflake
-    rev: v1.6.1
+    rev: v2.3.1
     hooks:
       - id: autoflake
         args:
@@ -35,6 +35,7 @@ repos:
         name: unittests
         entry: ./run_tests.sh
         language: script
+        pass_filenames: false
         # Optional: Specify types of files that trigger this hook
         # types: [python]
         # Optional: Specify files or directories to exclude

diff --git a/README.md b/README.md
@@ -1,54 +1,76 @@
 # 🌊 AnyParser
+<p align="center">
+  <a href="https://pypi.org/project/any-parser/"><img src="https://img.shields.io/pypi/v/any-parser.svg" alt="pypi_status" /></a>
+  <a href="https://github.com/cambioml/any-parser/graphs/commit-activity"><img alt="Commit activity" src="https://img.shields.io/github/commit-activity/m/cambioml/any-parser?style=flat-square"/></a>
+  <a href="https://join.slack.com/t/cambiomlworkspace/shared_invite/zt-1zes33rmt-20Rag043uvExUaUdvt5_xQ"><img src="https://badgen.net/badge/Join/Community/cyan?icon=slack" alt="Slack" /></a>
+</p>
 
-AnyParser provides an API to accurately extract your unstructured data (e.g. PDF, images, charts) into structured format.
+**AnyParser** provides an API to accurately extract unstructured data (e.g., PDFs, images, charts) into a structured format.
 
 ## :seedling: Set up your AnyParser API key
 
-You can generate your keys at the [Playground Account Page](https://www.cambioml.com/account) with up to 2 keys and 100 total free pages per account.
+To get started, generate your API key from the [Playground Account Page](https://www.cambioml.com/account). Each account comes with **100 free pages**.
 
 > ⚠️ **Note:** The free API is limited to 10 pages/call.
 
-If you're interested in more AnyParser usage and applications, please reach out at [email protected] for details.
+For more information or to inquire about larger usage plans, feel free to contact us at [email protected].
 
+To set up your API key (`CAMBIO_API_KEY`), follow these steps:
+1. Create a `.env` file in the root directory of your project.
+2. Add the following line to the `.env` file:
+```
+CAMBIO_API_KEY=0cam************************
+```
 
-To set up your API key `CAMBIO_API_KEY`, you will need to :
-
-1. create a `.env` file in your root folder;
-2. add the following one line to your `.env file:
-    ```
-    CAMBIO_API_KEY=0cam************************
-    ```
 
 ## :computer: Installation
-
-```
+### 1. Set Up a New Conda Environment and Install AnyParser
+First, create and activate a new Conda environment, then install AnyParser: 
+```bash
 conda create -n any-parse python=3.10 -y
 conda activate any-parse
 pip3 install any-parser
 ```
+### 2. Create an AnyParser Instance Using Your API Key
+Use your API key to create an instance of AnyParserRT. Make sure you’ve set up your .env file to store your API key securely:
+```python
+import os
+from dotenv import load_dotenv
+from any_parser import AnyParserRT  # Import the AnyParserRT class
 
-If you want to run pdf_to_markdown.ipynb, install the following:
-- Mac:
-    ```
-    brew install poppler
-    ```
-- Linux:
-    ```
-    sudo apt update
-    sudo apt install poppler-utils
-    ```
-- Windows:
-    ```
-    choco install poppler
-    ```
+# Load environment variables
+load_dotenv(override=True)
 
-## :scroll:  Examples
+# Get the API key from the environment
+example_apikey = os.getenv("CAMBIO_API_KEY")
+
+# Create an AnyParser instance
+ap = AnyParserRT(api_key=example_apikey)
+```
+
+### 3. Run Synchronous Extraction
+To extract data synchronously and receive immediate results:
+```python
+# Extract content from the file and get the markdown output along with processing time
+markdown, total_time = ap.extract(file_path="./data/test.pdf")
+```
+
+### 4. Run Asynchronous Extraction
+For asynchronous extraction, send the file for processing and fetch results later:
+```python
+# Send the file to begin asynchronous extraction
+file_id = ap.async_extract(file_path="./data/test.pdf")
 
-AnyParser can extract text, numbers and symbols from PDF, images, etc. Check out each notebook below to run AnyParser within 10 lines of code!
+# Fetch the extracted content using the file ID
+markdown = ap.async_fetch(file_id=file_id)
+```
+
+## :scroll:  Examples
+Check out these examples to see how you can utilize **AnyParser** to extract text, numbers, and symbols in fewer than 10 lines of code!
 
-### [Extract all text and layout from PDF into Markdown Format](https://github.com/CambioML/any-parser/blob/main/examples/pdf_to_markdown.ipynb)
-Are you an AI engineer who need to ACCURATELY extract both the text and its layout (e.g. table of content or markdown headers hierarchy) from a PDF. Check out this notebook demo (3-min read)!
+### [Extract all text and layout from PDF into Markdown Format](https://github.com/CambioML/any-parser/blob/rt-migration/examples/pdf_to_markdown.ipynb)
+Are you an AI engineer looking to **accurately** extract both the text and layout (e.g., table of contents or Markdown headers hierarchy) from a PDF? Check out this [3-minute notebook demo](https://github.com/CambioML/any-parser/blob/rt-migration/examples/pdf_to_markdown.ipynb).
 
-### [Extract a Table from an Image into Markdown Format](https://github.com/CambioML/any-parser/blob/main/examples/extract_table_from_image_to_markdown.ipynb)
-Are you a financial analyst who need to extract ACCURATE number from a table in an image or a PDF. Check out this notebook (3-min read)!
+### [Extract a Table from an Image into Markdown Format](https://github.com/CambioML/any-parser/blob/rt-migration/examples/image_to_markdown.ipynb)
+Are you a financial analyst needing to **accurately** extract numbers from a table within an image? Explore this [3-minute notebook example](https://github.com/CambioML/any-parser/blob/rt-migration/examples/image_to_markdown.ipynb).
 
diff --git a/any_parser/__init__.py b/any_parser/__init__.py
@@ -1,5 +1,7 @@
-from any_parser.base import AnyParser
+"""AnyParser module for parsing data."""
+
+from any_parser.any_parser import AnyParser
 
 __all__ = ["AnyParser"]
 
-__version__ = "0.0.13"
+__version__ = "0.0.14"
diff --git a/any_parser/any_parser.py b/any_parser/any_parser.py
@@ -0,0 +1,222 @@
+"""AnyParser RT: Real-time parser for any data format."""
+
+import base64
+import json
+import time
+from pathlib import Path
+from typing import Dict, Optional, Tuple
+
+import requests
+
+PUBLIC_SHARED_BASE_URL = "https://public-api.cambio-ai.com"
+TIMEOUT = 60
+SUPPORTED_FILE_EXTENSIONS = [
+    "pdf",
+    "doc",
+    "docx",
+    "ppt",
+    "pptx",
+    "jpg",
+    "jpeg",
+    "png",
+    "gif",
+]
+
+
+class AnyParser:
+    """AnyParser RT: Real-time parser for any data format."""
+
+    def __init__(self, api_key: str, base_url: str = PUBLIC_SHARED_BASE_URL) -> None:
+        """Initialize the AnyParser RT object.
+
+        Args:
+            api_key (str): The API key for the AnyParser
+            url (str): The URL of the AnyParser RT API.
+
+        Returns:
+            None
+        """
+        self._sync_url = f"{base_url}/extract"
+        self._async_upload_url = f"{base_url}/async/upload"
+        self._async_fetch_url = f"{base_url}/async/fetch"
+        self._api_key = api_key
+        self._headers = {
+            "Content-Type": "application/json",
+            "x-api-key": self._api_key,
+        }
+
+    def extract(
+        self, file_path: str, extract_args: Optional[Dict] = None
+    ) -> Tuple[str, str]:
+        """Extract data in real-time.
+
+        Args:
+            file_path (str): The path to the file to be parsed.
+            extract_args (Optional[Dict]): Additional extraction arguments added to prompt
+        Returns:
+            tuple(str, str): The extracted data and the time taken.
+        """
+        file_extension = Path(file_path).suffix.lower().lstrip(".")
+
+        # Check if the file exists
+        if not Path(file_path).is_file():
+            return f"Error: File does not exist: {file_path}", None
+
+        # Check for valid file extension
+        if file_extension not in SUPPORTED_FILE_EXTENSIONS:
+            supported_types = ", ".join(SUPPORTED_FILE_EXTENSIONS)
+            return (
+                f"Error: Unsupported file type: {file_extension}. Supported file types include {supported_types}.",
+                None,
+            )
+
+        # Encode the file content in base64
+        with open(file_path, "rb") as file:
+            encoded_file = base64.b64encode(file.read()).decode("utf-8")
+
+        # Create the JSON payload
+        payload = {
+            "file_content": encoded_file,
+            "file_type": file_extension,
+        }
+
+        if extract_args is not None and isinstance(extract_args, dict):
+            payload["extract_args"] = extract_args
+
+        # Send the POST request
+        start_time = time.time()
+        response = requests.post(
+            self._sync_url,
+            headers=self._headers,
+            data=json.dumps(payload),
+            timeout=TIMEOUT,
+        )
+        end_time = time.time()
+
+        # Check if the request was successful
+        if response.status_code == 200:
+            try:
+                response_data = response.json()
+                response_list = []
+                for text in response_data["markdown"]:
+                    response_list.append(text)
+                markdown_text = "\n".join(response_list)
+                return (
+                    markdown_text,
+                    f"Time Elapsed: {end_time - start_time:.2f} seconds",
+                )
+            except json.JSONDecodeError:
+                return f"Error: Invalid JSON response: {response.text}", None
+        else:
+            return f"Error: {response.status_code} {response.text}", None
+
+    def async_extract(self, file_path: str, extract_args: Optional[Dict] = None) -> str:
+        """Extract data asyncronously.
+
+        Args:
+            file_path (str): The path to the file to be parsed.
+            extract_args (Optional[Dict]): Additional extraction arguments added to prompt
+        Returns:
+            str: The file id of the uploaded file.
+        """
+        file_extension = Path(file_path).suffix.lower().lstrip(".")
+
+        # Check if the file exists
+        if not Path(file_path).is_file():
+            return f"Error: File does not exist: {file_path}"
+
+        # Check for valid file extension
+        if file_extension not in SUPPORTED_FILE_EXTENSIONS:
+            supported_types = ", ".join(SUPPORTED_FILE_EXTENSIONS)
+            return f"Error: Unsupported file type: {file_extension}. Supported file types include {supported_types}."
+
+        file_name = Path(file_path).name
+        # Create the JSON payload
+        payload = {
+            "file_name": file_name,
+        }
+
+        if extract_args is not None and isinstance(extract_args, dict):
+            payload["extract_args"] = extract_args
+
+        # Send the POST request
+        response = requests.post(
+            self._async_upload_url,
+            headers=self._headers,
+            data=json.dumps(payload),
+            timeout=TIMEOUT,
+        )
+
+        # Check if the request was successful
+        if response.status_code == 200:
+            try:
+                file_id = response.json().get("fileId")
+                presigned_url = response.json().get("presignedUrl")
+                with open(file_path, "rb") as file_to_upload:
+                    files = {"file": (file_path, file_to_upload)}
+                    upload_resp = requests.post(
+                        presigned_url["url"],
+                        data=presigned_url["fields"],
+                        files=files,
+                        timeout=TIMEOUT,
+                    )
+                    if upload_resp.status_code != 204:
+                        return f"Error: {upload_resp.status_code} {upload_resp.text}"
+                return file_id
+            except json.JSONDecodeError:
+                return "Error: Invalid JSON response"
+        else:
+            return f"Error: {response.status_code} {response.text}"
+
+    def async_fetch(
+        self,
+        file_id: str,
+        sync: bool = True,
+        sync_timeout: int = 60,
+        sync_interval: int = 5,
+    ) -> str:
+        """Fetches extraction results asynchronously.
+
+        Args:
+            file_id (str): The ID of the file to fetch results for.
+            sync (bool, optional): Whether to wait for the results synchronously.
+            sync_timeout (int, optional): Maximum time to wait for results in seconds. Defaults to 60.
+            sync_interval (int, optional): Time interval between polling attempts in seconds. Defaults to 5.
+
+        Returns:
+            str: The extracted results as a markdown string.
+            None: If the extraction is still in progress (when sync is False).
+        """
+        response = None
+        # Create the JSON payload
+        payload = {"file_id": file_id}
+        if sync:
+            start_time = time.time()
+            while time.time() < start_time + sync_timeout:
+                response = requests.post(
+                    self._async_fetch_url,
+                    headers=self._headers,
+                    data=json.dumps(payload),
+                    timeout=TIMEOUT,
+                )
+                if response.status_code == 202:
+                    print("Waiting for response...")
+                    time.sleep(sync_interval)
+                    continue
+                break
+        else:
+            response = requests.post(
+                self._async_fetch_url,
+                headers=self._headers,
+                data=json.dumps(payload),
+                timeout=TIMEOUT,
+            )
+
+        if response is None:
+            return "Error: timeout, no response received"
+        if response.status_code == 200:
+            markdown_list = response.json()["markdown"]
+            return "\n".join(markdown_list)
+        if response.status_code == 202:
+            return None
+        return f"Error: {response.status_code} {response.text}"