-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #45 from CambioML/rt-migration
RT-Migration
- Loading branch information
Showing
23 changed files
with
1,766 additions
and
228 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,54 +1,76 @@ | ||
# 🌊 AnyParser | ||
<p align="center"> | ||
<a href="https://pypi.org/project/any-parser/"><img src="https://img.shields.io/pypi/v/any-parser.svg" alt="pypi_status" /></a> | ||
<a href="https://github.com/cambioml/any-parser/graphs/commit-activity"><img alt="Commit activity" src="https://img.shields.io/github/commit-activity/m/cambioml/any-parser?style=flat-square"/></a> | ||
<a href="https://join.slack.com/t/cambiomlworkspace/shared_invite/zt-1zes33rmt-20Rag043uvExUaUdvt5_xQ"><img src="https://badgen.net/badge/Join/Community/cyan?icon=slack" alt="Slack" /></a> | ||
</p> | ||
|
||
AnyParser provides an API to accurately extract your unstructured data (e.g. PDF, images, charts) into structured format. | ||
**AnyParser** provides an API to accurately extract unstructured data (e.g., PDFs, images, charts) into a structured format. | ||
|
||
## :seedling: Set up your AnyParser API key | ||
|
||
You can generate your keys at the [Playground Account Page](https://www.cambioml.com/account) with up to 2 keys and 100 total free pages per account. | ||
To get started, generate your API key from the [Playground Account Page](https://www.cambioml.com/account). Each account comes with **100 free pages**. | ||
|
||
> ⚠️ **Note:** The free API is limited to 10 pages/call. | ||
If you're interested in more AnyParser usage and applications, please reach out at [email protected] for details. | ||
For more information or to inquire about larger usage plans, feel free to contact us at [email protected]. | ||
|
||
To set up your API key (`CAMBIO_API_KEY`), follow these steps: | ||
1. Create a `.env` file in the root directory of your project. | ||
2. Add the following line to the `.env` file: | ||
``` | ||
CAMBIO_API_KEY=0cam************************ | ||
``` | ||
|
||
To set up your API key `CAMBIO_API_KEY`, you will need to : | ||
|
||
1. create a `.env` file in your root folder; | ||
2. add the following one line to your `.env file: | ||
``` | ||
CAMBIO_API_KEY=0cam************************ | ||
``` | ||
|
||
## :computer: Installation | ||
``` | ||
### 1. Set Up a New Conda Environment and Install AnyParser | ||
First, create and activate a new Conda environment, then install AnyParser: | ||
```bash | ||
conda create -n any-parse python=3.10 -y | ||
conda activate any-parse | ||
pip3 install any-parser | ||
``` | ||
### 2. Create an AnyParser Instance Using Your API Key | ||
Use your API key to create an instance of AnyParserRT. Make sure you’ve set up your .env file to store your API key securely: | ||
```python | ||
import os | ||
from dotenv import load_dotenv | ||
from any_parser import AnyParserRT # Import the AnyParserRT class | ||
|
||
If you want to run pdf_to_markdown.ipynb, install the following: | ||
- Mac: | ||
``` | ||
brew install poppler | ||
``` | ||
- Linux: | ||
``` | ||
sudo apt update | ||
sudo apt install poppler-utils | ||
``` | ||
- Windows: | ||
``` | ||
choco install poppler | ||
``` | ||
# Load environment variables | ||
load_dotenv(override=True) | ||
|
||
## :scroll: Examples | ||
# Get the API key from the environment | ||
example_apikey = os.getenv("CAMBIO_API_KEY") | ||
|
||
# Create an AnyParser instance | ||
ap = AnyParserRT(api_key=example_apikey) | ||
``` | ||
|
||
### 3. Run Synchronous Extraction | ||
To extract data synchronously and receive immediate results: | ||
```python | ||
# Extract content from the file and get the markdown output along with processing time | ||
markdown, total_time = ap.extract(file_path="./data/test.pdf") | ||
``` | ||
|
||
### 4. Run Asynchronous Extraction | ||
For asynchronous extraction, send the file for processing and fetch results later: | ||
```python | ||
# Send the file to begin asynchronous extraction | ||
file_id = ap.async_extract(file_path="./data/test.pdf") | ||
|
||
AnyParser can extract text, numbers and symbols from PDF, images, etc. Check out each notebook below to run AnyParser within 10 lines of code! | ||
# Fetch the extracted content using the file ID | ||
markdown = ap.async_fetch(file_id=file_id) | ||
``` | ||
|
||
## :scroll: Examples | ||
Check out these examples to see how you can utilize **AnyParser** to extract text, numbers, and symbols in fewer than 10 lines of code! | ||
|
||
### [Extract all text and layout from PDF into Markdown Format](https://github.com/CambioML/any-parser/blob/main/examples/pdf_to_markdown.ipynb) | ||
Are you an AI engineer who need to ACCURATELY extract both the text and its layout (e.g. table of content or markdown headers hierarchy) from a PDF. Check out this notebook demo (3-min read)! | ||
### [Extract all text and layout from PDF into Markdown Format](https://github.com/CambioML/any-parser/blob/rt-migration/examples/pdf_to_markdown.ipynb) | ||
Are you an AI engineer looking to **accurately** extract both the text and layout (e.g., table of contents or Markdown headers hierarchy) from a PDF? Check out this [3-minute notebook demo](https://github.com/CambioML/any-parser/blob/rt-migration/examples/pdf_to_markdown.ipynb). | ||
|
||
### [Extract a Table from an Image into Markdown Format](https://github.com/CambioML/any-parser/blob/main/examples/extract_table_from_image_to_markdown.ipynb) | ||
Are you a financial analyst who need to extract ACCURATE number from a table in an image or a PDF. Check out this notebook (3-min read)! | ||
### [Extract a Table from an Image into Markdown Format](https://github.com/CambioML/any-parser/blob/rt-migration/examples/image_to_markdown.ipynb) | ||
Are you a financial analyst needing to **accurately** extract numbers from a table within an image? Explore this [3-minute notebook example](https://github.com/CambioML/any-parser/blob/rt-migration/examples/image_to_markdown.ipynb). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,7 @@ | ||
from any_parser.base import AnyParser | ||
"""AnyParser module for parsing data.""" | ||
|
||
from any_parser.any_parser import AnyParser | ||
|
||
__all__ = ["AnyParser"] | ||
|
||
__version__ = "0.0.13" | ||
__version__ = "0.0.14" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,222 @@ | ||
"""AnyParser RT: Real-time parser for any data format.""" | ||
|
||
import base64 | ||
import json | ||
import time | ||
from pathlib import Path | ||
from typing import Dict, Optional, Tuple | ||
|
||
import requests | ||
|
||
PUBLIC_SHARED_BASE_URL = "https://public-api.cambio-ai.com" | ||
TIMEOUT = 60 | ||
SUPPORTED_FILE_EXTENSIONS = [ | ||
"pdf", | ||
"doc", | ||
"docx", | ||
"ppt", | ||
"pptx", | ||
"jpg", | ||
"jpeg", | ||
"png", | ||
"gif", | ||
] | ||
|
||
|
||
class AnyParser: | ||
"""AnyParser RT: Real-time parser for any data format.""" | ||
|
||
def __init__(self, api_key: str, base_url: str = PUBLIC_SHARED_BASE_URL) -> None: | ||
"""Initialize the AnyParser RT object. | ||
Args: | ||
api_key (str): The API key for the AnyParser | ||
url (str): The URL of the AnyParser RT API. | ||
Returns: | ||
None | ||
""" | ||
self._sync_url = f"{base_url}/extract" | ||
self._async_upload_url = f"{base_url}/async/upload" | ||
self._async_fetch_url = f"{base_url}/async/fetch" | ||
self._api_key = api_key | ||
self._headers = { | ||
"Content-Type": "application/json", | ||
"x-api-key": self._api_key, | ||
} | ||
|
||
def extract( | ||
self, file_path: str, extract_args: Optional[Dict] = None | ||
) -> Tuple[str, str]: | ||
"""Extract data in real-time. | ||
Args: | ||
file_path (str): The path to the file to be parsed. | ||
extract_args (Optional[Dict]): Additional extraction arguments added to prompt | ||
Returns: | ||
tuple(str, str): The extracted data and the time taken. | ||
""" | ||
file_extension = Path(file_path).suffix.lower().lstrip(".") | ||
|
||
# Check if the file exists | ||
if not Path(file_path).is_file(): | ||
return f"Error: File does not exist: {file_path}", None | ||
|
||
# Check for valid file extension | ||
if file_extension not in SUPPORTED_FILE_EXTENSIONS: | ||
supported_types = ", ".join(SUPPORTED_FILE_EXTENSIONS) | ||
return ( | ||
f"Error: Unsupported file type: {file_extension}. Supported file types include {supported_types}.", | ||
None, | ||
) | ||
|
||
# Encode the file content in base64 | ||
with open(file_path, "rb") as file: | ||
encoded_file = base64.b64encode(file.read()).decode("utf-8") | ||
|
||
# Create the JSON payload | ||
payload = { | ||
"file_content": encoded_file, | ||
"file_type": file_extension, | ||
} | ||
|
||
if extract_args is not None and isinstance(extract_args, dict): | ||
payload["extract_args"] = extract_args | ||
|
||
# Send the POST request | ||
start_time = time.time() | ||
response = requests.post( | ||
self._sync_url, | ||
headers=self._headers, | ||
data=json.dumps(payload), | ||
timeout=TIMEOUT, | ||
) | ||
end_time = time.time() | ||
|
||
# Check if the request was successful | ||
if response.status_code == 200: | ||
try: | ||
response_data = response.json() | ||
response_list = [] | ||
for text in response_data["markdown"]: | ||
response_list.append(text) | ||
markdown_text = "\n".join(response_list) | ||
return ( | ||
markdown_text, | ||
f"Time Elapsed: {end_time - start_time:.2f} seconds", | ||
) | ||
except json.JSONDecodeError: | ||
return f"Error: Invalid JSON response: {response.text}", None | ||
else: | ||
return f"Error: {response.status_code} {response.text}", None | ||
|
||
def async_extract(self, file_path: str, extract_args: Optional[Dict] = None) -> str: | ||
"""Extract data asyncronously. | ||
Args: | ||
file_path (str): The path to the file to be parsed. | ||
extract_args (Optional[Dict]): Additional extraction arguments added to prompt | ||
Returns: | ||
str: The file id of the uploaded file. | ||
""" | ||
file_extension = Path(file_path).suffix.lower().lstrip(".") | ||
|
||
# Check if the file exists | ||
if not Path(file_path).is_file(): | ||
return f"Error: File does not exist: {file_path}" | ||
|
||
# Check for valid file extension | ||
if file_extension not in SUPPORTED_FILE_EXTENSIONS: | ||
supported_types = ", ".join(SUPPORTED_FILE_EXTENSIONS) | ||
return f"Error: Unsupported file type: {file_extension}. Supported file types include {supported_types}." | ||
|
||
file_name = Path(file_path).name | ||
# Create the JSON payload | ||
payload = { | ||
"file_name": file_name, | ||
} | ||
|
||
if extract_args is not None and isinstance(extract_args, dict): | ||
payload["extract_args"] = extract_args | ||
|
||
# Send the POST request | ||
response = requests.post( | ||
self._async_upload_url, | ||
headers=self._headers, | ||
data=json.dumps(payload), | ||
timeout=TIMEOUT, | ||
) | ||
|
||
# Check if the request was successful | ||
if response.status_code == 200: | ||
try: | ||
file_id = response.json().get("fileId") | ||
presigned_url = response.json().get("presignedUrl") | ||
with open(file_path, "rb") as file_to_upload: | ||
files = {"file": (file_path, file_to_upload)} | ||
upload_resp = requests.post( | ||
presigned_url["url"], | ||
data=presigned_url["fields"], | ||
files=files, | ||
timeout=TIMEOUT, | ||
) | ||
if upload_resp.status_code != 204: | ||
return f"Error: {upload_resp.status_code} {upload_resp.text}" | ||
return file_id | ||
except json.JSONDecodeError: | ||
return "Error: Invalid JSON response" | ||
else: | ||
return f"Error: {response.status_code} {response.text}" | ||
|
||
def async_fetch( | ||
self, | ||
file_id: str, | ||
sync: bool = True, | ||
sync_timeout: int = 60, | ||
sync_interval: int = 5, | ||
) -> str: | ||
"""Fetches extraction results asynchronously. | ||
Args: | ||
file_id (str): The ID of the file to fetch results for. | ||
sync (bool, optional): Whether to wait for the results synchronously. | ||
sync_timeout (int, optional): Maximum time to wait for results in seconds. Defaults to 60. | ||
sync_interval (int, optional): Time interval between polling attempts in seconds. Defaults to 5. | ||
Returns: | ||
str: The extracted results as a markdown string. | ||
None: If the extraction is still in progress (when sync is False). | ||
""" | ||
response = None | ||
# Create the JSON payload | ||
payload = {"file_id": file_id} | ||
if sync: | ||
start_time = time.time() | ||
while time.time() < start_time + sync_timeout: | ||
response = requests.post( | ||
self._async_fetch_url, | ||
headers=self._headers, | ||
data=json.dumps(payload), | ||
timeout=TIMEOUT, | ||
) | ||
if response.status_code == 202: | ||
print("Waiting for response...") | ||
time.sleep(sync_interval) | ||
continue | ||
break | ||
else: | ||
response = requests.post( | ||
self._async_fetch_url, | ||
headers=self._headers, | ||
data=json.dumps(payload), | ||
timeout=TIMEOUT, | ||
) | ||
|
||
if response is None: | ||
return "Error: timeout, no response received" | ||
if response.status_code == 200: | ||
markdown_list = response.json()["markdown"] | ||
return "\n".join(markdown_list) | ||
if response.status_code == 202: | ||
return None | ||
return f"Error: {response.status_code} {response.text}" |
Oops, something went wrong.