-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add batch api #71
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
@@ -20,6 +21,10 @@ | |||
from any_parser.utils import validate_file_inputs | |||
|
|||
PUBLIC_SHARED_BASE_URL = "https://public-api.cambio-ai.com" | |||
# TODO: Update this to the correct batch endpoint | |||
PUBLIC_BATCH_BASE_URL = ( | |||
"http://AnyPar-ApiCo-cuKOBXasmUF1-1986145995.us-west-2.elb.amazonaws.com" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me link this to our DNS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested http://batch-api.cambio-ai.com, and the domain name forward is working.
### 5. Run Batch Extraction (Beta) | ||
For batch extraction, send the file to begin processing and fetch results later: | ||
```python | ||
# Send the file to begin batch extraction | ||
response = ap.batches.create(file_path="./data/test.pdf") | ||
request_id = response.requestId | ||
|
||
# Fetch the extracted content using the request ID | ||
markdown = ap.batches.retrieve(request_id) | ||
``` | ||
|
||
> ⚠️ **Note:** Batch extraction is currently in beta testing. Processing time may take up to 12 hours to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job in updating the README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also make sure you update the README that the cambioml.com website created API key will not be added into the batch api usage group and user should contact us to make sure your API key will have batch process permission manually added at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Let me update the readme here.
@@ -20,6 +21,10 @@ | |||
from any_parser.utils import validate_file_inputs | |||
|
|||
PUBLIC_SHARED_BASE_URL = "https://public-api.cambio-ai.com" | |||
# TODO: Update this to the correct batch endpoint | |||
PUBLIC_BATCH_BASE_URL = ( | |||
"http://AnyPar-ApiCo-cuKOBXasmUF1-1986145995.us-west-2.elb.amazonaws.com" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -133,6 +138,7 @@ def __init__(self, api_key: str, base_url: str = PUBLIC_SHARED_BASE_URL) -> None | |||
) | |||
self._sync_extract_pii = ExtractPIISyncParser(api_key, base_url) | |||
self._sync_extract_tables = ExtractTablesSyncParser(api_key, base_url) | |||
self.batches = BatchParser(api_key, PUBLIC_BATCH_BASE_URL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use _batch to indicate this is a private method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we pass another batch_url and input and pass in the PUBLIC_BATCH_BASE_URL as argument to improve readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial design is, I want the usage to be such that the user can use ap.batches.retrieve(request_id) to invoke the API, similar to what OpenAI does. Let's sync later offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we pass another batch_url and input and pass in the PUBLIC_BATCH_BASE_URL as argument to improve readability.
Let me update this one
class UploadResponse(BaseModel): | ||
fileName: str | ||
requestId: str | ||
requestStatus: str | ||
|
||
|
||
class UsageResponse(BaseModel): | ||
pageLimit: int | ||
pageRemaining: int | ||
|
||
|
||
class FileStatusResponse(BaseModel): | ||
fileName: str | ||
fileType: str | ||
requestId: str | ||
requestStatus: str | ||
uploadTime: str | ||
completionTime: Optional[str] = None | ||
result: Optional[List[str]] = Field(default_factory=list) | ||
error: Optional[List[str]] = Field(default_factory=list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a docstring and describe what each attribute means to ease the review process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring added. Thanks!
# remove "Content-Type" from headers | ||
self._headers.pop("Content-Type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: why we need to remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to upload a file in our request using batch api in any parser SDK, and the Content-Type in base-parser does not work for this scenario.
files=files, | ||
timeout=TIMEOUT, | ||
) | ||
print(response.json()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's not use print, but logger.info
Make you you install all required dependency and update pyproject.toml. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to address issues in future PR
User description
Description
This PR introduces a new batch processing API feature to AnyParser, allowing users to process files asynchronously with longer processing times. The batch API includes functionality for file upload, status checking, and usage quota monitoring.
Key additions:
BatchParser
class for handling batch processing operationsType of Change
How Has This Been Tested?
tests/test_batch_api.py
Checklist
Additional Notes
PR Type
Enhancement, Documentation, Tests
Description
BatchParser
class to handle batch processing operations, including file upload, status checking, and usage quota monitoring.BatchParser
into theAnyParser
class with a dedicatedbatches
attribute.BatchParser
functionality, covering file upload, status retrieval, and usage quota checking.PUBLIC_BATCH_BASE_URL
for the batch API endpoint.Changes walkthrough 📝
any_parser.py
Integrate `BatchParser` into `AnyParser` for batch processing.
any_parser/any_parser.py
BatchParser
integration for batch processing.PUBLIC_BATCH_BASE_URL
for batch APIendpoint.
AnyParser
class to include abatches
attribute for batchoperations.
batch_parser.py
Add `BatchParser` class for batch processing operations.
any_parser/batch_parser.py
BatchParser
class for batch file processing.checking.
test_batch_api.py
Add unit tests for `BatchParser` functionality.
tests/test_batch_api.py
BatchParser
functionality.AnyParser
.README.md
Update documentation to include batch API usage.
README.md
extraction.