Skip to content

Commit

Permalink
Merge pull request #45 from CambioML/rt-migration
Browse files Browse the repository at this point in the history
RT-Migration
  • Loading branch information
Cambio ML authored Oct 1, 2024
2 parents 909ff3b + ba3289c commit bd4b8c1
Show file tree
Hide file tree
Showing 23 changed files with 1,766 additions and 228 deletions.
5 changes: 4 additions & 1 deletion .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ jobs:
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
max-parallel: 1 # Ensures the tests run sequentially

steps:
- uses: actions/checkout@v3
Expand Down Expand Up @@ -44,5 +45,7 @@ jobs:
run: |
isort . --profile=black --check-only --verbose
- name: Test with unittest
env:
API_KEY: ${{ secrets.API_KEY }}
run: |
poetry run python -m unittest discover
poetry run python -m unittest discover tests
7 changes: 4 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
repos:
- repo: https://github.com/psf/black
rev: 22.8.0
rev: 24.8.0
hooks:
- id: black
args: [--exclude=""]

# this is not technically always safe but usually is
# use comments `# isort: off` and `# isort: on` to disable/re-enable isort
- repo: https://github.com/pycqa/isort
rev: 5.12.0
rev: 5.13.2
hooks:
- id: isort
args: [--profile=black]
Expand All @@ -17,7 +17,7 @@ repos:
# and this tool removes unused imports, which may be providing
# necessary side effects for the code to run
- repo: https://github.com/PyCQA/autoflake
rev: v1.6.1
rev: v2.3.1
hooks:
- id: autoflake
args:
Expand All @@ -35,6 +35,7 @@ repos:
name: unittests
entry: ./run_tests.sh
language: script
pass_filenames: false
# Optional: Specify types of files that trigger this hook
# types: [python]
# Optional: Specify files or directories to exclude
Expand Down
86 changes: 54 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,54 +1,76 @@
# 🌊 AnyParser
<p align="center">
<a href="https://pypi.org/project/any-parser/"><img src="https://img.shields.io/pypi/v/any-parser.svg" alt="pypi_status" /></a>
<a href="https://github.com/cambioml/any-parser/graphs/commit-activity"><img alt="Commit activity" src="https://img.shields.io/github/commit-activity/m/cambioml/any-parser?style=flat-square"/></a>
<a href="https://join.slack.com/t/cambiomlworkspace/shared_invite/zt-1zes33rmt-20Rag043uvExUaUdvt5_xQ"><img src="https://badgen.net/badge/Join/Community/cyan?icon=slack" alt="Slack" /></a>
</p>

AnyParser provides an API to accurately extract your unstructured data (e.g. PDF, images, charts) into structured format.
**AnyParser** provides an API to accurately extract unstructured data (e.g., PDFs, images, charts) into a structured format.

## :seedling: Set up your AnyParser API key

You can generate your keys at the [Playground Account Page](https://www.cambioml.com/account) with up to 2 keys and 100 total free pages per account.
To get started, generate your API key from the [Playground Account Page](https://www.cambioml.com/account). Each account comes with **100 free pages**.

> ⚠️ **Note:** The free API is limited to 10 pages/call.
If you're interested in more AnyParser usage and applications, please reach out at [email protected] for details.
For more information or to inquire about larger usage plans, feel free to contact us at [email protected].

To set up your API key (`CAMBIO_API_KEY`), follow these steps:
1. Create a `.env` file in the root directory of your project.
2. Add the following line to the `.env` file:
```
CAMBIO_API_KEY=0cam************************
```

To set up your API key `CAMBIO_API_KEY`, you will need to :

1. create a `.env` file in your root folder;
2. add the following one line to your `.env file:
```
CAMBIO_API_KEY=0cam************************
```

## :computer: Installation
```
### 1. Set Up a New Conda Environment and Install AnyParser
First, create and activate a new Conda environment, then install AnyParser:
```bash
conda create -n any-parse python=3.10 -y
conda activate any-parse
pip3 install any-parser
```
### 2. Create an AnyParser Instance Using Your API Key
Use your API key to create an instance of AnyParserRT. Make sure you’ve set up your .env file to store your API key securely:
```python
import os
from dotenv import load_dotenv
from any_parser import AnyParserRT # Import the AnyParserRT class

If you want to run pdf_to_markdown.ipynb, install the following:
- Mac:
```
brew install poppler
```
- Linux:
```
sudo apt update
sudo apt install poppler-utils
```
- Windows:
```
choco install poppler
```
# Load environment variables
load_dotenv(override=True)

## :scroll: Examples
# Get the API key from the environment
example_apikey = os.getenv("CAMBIO_API_KEY")

# Create an AnyParser instance
ap = AnyParserRT(api_key=example_apikey)
```

### 3. Run Synchronous Extraction
To extract data synchronously and receive immediate results:
```python
# Extract content from the file and get the markdown output along with processing time
markdown, total_time = ap.extract(file_path="./data/test.pdf")
```

### 4. Run Asynchronous Extraction
For asynchronous extraction, send the file for processing and fetch results later:
```python
# Send the file to begin asynchronous extraction
file_id = ap.async_extract(file_path="./data/test.pdf")

AnyParser can extract text, numbers and symbols from PDF, images, etc. Check out each notebook below to run AnyParser within 10 lines of code!
# Fetch the extracted content using the file ID
markdown = ap.async_fetch(file_id=file_id)
```

## :scroll: Examples
Check out these examples to see how you can utilize **AnyParser** to extract text, numbers, and symbols in fewer than 10 lines of code!

### [Extract all text and layout from PDF into Markdown Format](https://github.com/CambioML/any-parser/blob/main/examples/pdf_to_markdown.ipynb)
Are you an AI engineer who need to ACCURATELY extract both the text and its layout (e.g. table of content or markdown headers hierarchy) from a PDF. Check out this notebook demo (3-min read)!
### [Extract all text and layout from PDF into Markdown Format](https://github.com/CambioML/any-parser/blob/rt-migration/examples/pdf_to_markdown.ipynb)
Are you an AI engineer looking to **accurately** extract both the text and layout (e.g., table of contents or Markdown headers hierarchy) from a PDF? Check out this [3-minute notebook demo](https://github.com/CambioML/any-parser/blob/rt-migration/examples/pdf_to_markdown.ipynb).

### [Extract a Table from an Image into Markdown Format](https://github.com/CambioML/any-parser/blob/main/examples/extract_table_from_image_to_markdown.ipynb)
Are you a financial analyst who need to extract ACCURATE number from a table in an image or a PDF. Check out this notebook (3-min read)!
### [Extract a Table from an Image into Markdown Format](https://github.com/CambioML/any-parser/blob/rt-migration/examples/image_to_markdown.ipynb)
Are you a financial analyst needing to **accurately** extract numbers from a table within an image? Explore this [3-minute notebook example](https://github.com/CambioML/any-parser/blob/rt-migration/examples/image_to_markdown.ipynb).

6 changes: 4 additions & 2 deletions any_parser/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from any_parser.base import AnyParser
"""AnyParser module for parsing data."""

from any_parser.any_parser import AnyParser

__all__ = ["AnyParser"]

__version__ = "0.0.13"
__version__ = "0.0.14"
222 changes: 222 additions & 0 deletions any_parser/any_parser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
"""AnyParser RT: Real-time parser for any data format."""

import base64
import json
import time
from pathlib import Path
from typing import Dict, Optional, Tuple

import requests

PUBLIC_SHARED_BASE_URL = "https://public-api.cambio-ai.com"
TIMEOUT = 60
SUPPORTED_FILE_EXTENSIONS = [
"pdf",
"doc",
"docx",
"ppt",
"pptx",
"jpg",
"jpeg",
"png",
"gif",
]


class AnyParser:
"""AnyParser RT: Real-time parser for any data format."""

def __init__(self, api_key: str, base_url: str = PUBLIC_SHARED_BASE_URL) -> None:
"""Initialize the AnyParser RT object.
Args:
api_key (str): The API key for the AnyParser
url (str): The URL of the AnyParser RT API.
Returns:
None
"""
self._sync_url = f"{base_url}/extract"
self._async_upload_url = f"{base_url}/async/upload"
self._async_fetch_url = f"{base_url}/async/fetch"
self._api_key = api_key
self._headers = {
"Content-Type": "application/json",
"x-api-key": self._api_key,
}

def extract(
self, file_path: str, extract_args: Optional[Dict] = None
) -> Tuple[str, str]:
"""Extract data in real-time.
Args:
file_path (str): The path to the file to be parsed.
extract_args (Optional[Dict]): Additional extraction arguments added to prompt
Returns:
tuple(str, str): The extracted data and the time taken.
"""
file_extension = Path(file_path).suffix.lower().lstrip(".")

# Check if the file exists
if not Path(file_path).is_file():
return f"Error: File does not exist: {file_path}", None

# Check for valid file extension
if file_extension not in SUPPORTED_FILE_EXTENSIONS:
supported_types = ", ".join(SUPPORTED_FILE_EXTENSIONS)
return (
f"Error: Unsupported file type: {file_extension}. Supported file types include {supported_types}.",
None,
)

# Encode the file content in base64
with open(file_path, "rb") as file:
encoded_file = base64.b64encode(file.read()).decode("utf-8")

# Create the JSON payload
payload = {
"file_content": encoded_file,
"file_type": file_extension,
}

if extract_args is not None and isinstance(extract_args, dict):
payload["extract_args"] = extract_args

# Send the POST request
start_time = time.time()
response = requests.post(
self._sync_url,
headers=self._headers,
data=json.dumps(payload),
timeout=TIMEOUT,
)
end_time = time.time()

# Check if the request was successful
if response.status_code == 200:
try:
response_data = response.json()
response_list = []
for text in response_data["markdown"]:
response_list.append(text)
markdown_text = "\n".join(response_list)
return (
markdown_text,
f"Time Elapsed: {end_time - start_time:.2f} seconds",
)
except json.JSONDecodeError:
return f"Error: Invalid JSON response: {response.text}", None
else:
return f"Error: {response.status_code} {response.text}", None

def async_extract(self, file_path: str, extract_args: Optional[Dict] = None) -> str:
"""Extract data asyncronously.
Args:
file_path (str): The path to the file to be parsed.
extract_args (Optional[Dict]): Additional extraction arguments added to prompt
Returns:
str: The file id of the uploaded file.
"""
file_extension = Path(file_path).suffix.lower().lstrip(".")

# Check if the file exists
if not Path(file_path).is_file():
return f"Error: File does not exist: {file_path}"

# Check for valid file extension
if file_extension not in SUPPORTED_FILE_EXTENSIONS:
supported_types = ", ".join(SUPPORTED_FILE_EXTENSIONS)
return f"Error: Unsupported file type: {file_extension}. Supported file types include {supported_types}."

file_name = Path(file_path).name
# Create the JSON payload
payload = {
"file_name": file_name,
}

if extract_args is not None and isinstance(extract_args, dict):
payload["extract_args"] = extract_args

# Send the POST request
response = requests.post(
self._async_upload_url,
headers=self._headers,
data=json.dumps(payload),
timeout=TIMEOUT,
)

# Check if the request was successful
if response.status_code == 200:
try:
file_id = response.json().get("fileId")
presigned_url = response.json().get("presignedUrl")
with open(file_path, "rb") as file_to_upload:
files = {"file": (file_path, file_to_upload)}
upload_resp = requests.post(
presigned_url["url"],
data=presigned_url["fields"],
files=files,
timeout=TIMEOUT,
)
if upload_resp.status_code != 204:
return f"Error: {upload_resp.status_code} {upload_resp.text}"
return file_id
except json.JSONDecodeError:
return "Error: Invalid JSON response"
else:
return f"Error: {response.status_code} {response.text}"

def async_fetch(
self,
file_id: str,
sync: bool = True,
sync_timeout: int = 60,
sync_interval: int = 5,
) -> str:
"""Fetches extraction results asynchronously.
Args:
file_id (str): The ID of the file to fetch results for.
sync (bool, optional): Whether to wait for the results synchronously.
sync_timeout (int, optional): Maximum time to wait for results in seconds. Defaults to 60.
sync_interval (int, optional): Time interval between polling attempts in seconds. Defaults to 5.
Returns:
str: The extracted results as a markdown string.
None: If the extraction is still in progress (when sync is False).
"""
response = None
# Create the JSON payload
payload = {"file_id": file_id}
if sync:
start_time = time.time()
while time.time() < start_time + sync_timeout:
response = requests.post(
self._async_fetch_url,
headers=self._headers,
data=json.dumps(payload),
timeout=TIMEOUT,
)
if response.status_code == 202:
print("Waiting for response...")
time.sleep(sync_interval)
continue
break
else:
response = requests.post(
self._async_fetch_url,
headers=self._headers,
data=json.dumps(payload),
timeout=TIMEOUT,
)

if response is None:
return "Error: timeout, no response received"
if response.status_code == 200:
markdown_list = response.json()["markdown"]
return "\n".join(markdown_list)
if response.status_code == 202:
return None
return f"Error: {response.status_code} {response.text}"
Loading

0 comments on commit bd4b8c1

Please sign in to comment.