Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add batch api #71

Merged
merged 2 commits into from
Dec 8, 2024
Merged

feat: add batch api #71

merged 2 commits into from
Dec 8, 2024

Conversation

SeisSerenata
Copy link
Collaborator

@SeisSerenata SeisSerenata commented Dec 5, 2024

User description

Description

This PR introduces a new batch processing API feature to AnyParser, allowing users to process files asynchronously with longer processing times. The batch API includes functionality for file upload, status checking, and usage quota monitoring.

Key additions:

  • New BatchParser class for handling batch processing operations
  • Batch API endpoints integration
  • Usage quota monitoring
  • Documentation updates for batch processing
  • Unit tests for the new batch API functionality

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring
  • Performance improvement

How Has This Been Tested?

  • Added unit tests in tests/test_batch_api.py
  • Tests cover:
    • Batch creation
    • Status retrieval
    • Usage quota checking
  • Tested with sample PDF files

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

  • Batch processing is currently in beta
  • Processing time may take up to 12 hours to complete
  • Added appropriate warning notes in the documentation
  • The batch API uses a separate base URL for processing

PR Type

Enhancement, Documentation, Tests


Description

  • Introduced a new BatchParser class to handle batch processing operations, including file upload, status checking, and usage quota monitoring.
  • Integrated BatchParser into the AnyParser class with a dedicated batches attribute.
  • Added unit tests for the BatchParser functionality, covering file upload, status retrieval, and usage quota checking.
  • Updated the documentation to include examples and notes about the batch API usage.
  • Added a new constant PUBLIC_BATCH_BASE_URL for the batch API endpoint.

Changes walkthrough 📝

Relevant files
Enhancement
any_parser.py
Integrate `BatchParser` into `AnyParser` for batch processing.

any_parser/any_parser.py

  • Added BatchParser integration for batch processing.
  • Introduced a new constant PUBLIC_BATCH_BASE_URL for batch API
    endpoint.
  • Updated the AnyParser class to include a batches attribute for batch
    operations.
  • +6/-0     
    batch_parser.py
    Add `BatchParser` class for batch processing operations. 

    any_parser/batch_parser.py

  • Implemented BatchParser class for batch file processing.
  • Added methods for file upload, status retrieval, and usage quota
    checking.
  • Defined response models for upload, status, and usage responses.
  • +113/-0 
    Tests
    test_batch_api.py
    Add unit tests for `BatchParser` functionality.                   

    tests/test_batch_api.py

  • Added unit tests for the BatchParser functionality.
  • Tested file upload, status retrieval, and usage quota checking.
  • Verified integration with AnyParser.
  • +37/-0   
    Documentation
    README.md
    Update documentation to include batch API usage.                 

    README.md

  • Documented the usage of the new batch processing API.
  • Added example code for batch file upload and status retrieval.
  • Included a note about the beta status and processing time of batch
    extraction.
  • +13/-0   

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link

    github-actions bot commented Dec 5, 2024

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Hardcoded URL
    The batch endpoint URL is hardcoded, which could lead to issues in different environments or during deployment.

    Exception Handling
    The exception handling in the batch parser methods could be improved by using more specific exception types rather than the general Exception.

    Security Concern
    The removal of the "Content-Type" header in requests could lead to security vulnerabilities or incorrect handling of the HTTP requests.

    Copy link

    github-actions bot commented Dec 5, 2024

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Security
    Enhance security by using HTTPS instead of HTTP for the batch endpoint URL

    Replace the HTTP protocol with HTTPS for PUBLIC_BATCH_BASE_URL to ensure secure data
    transmission.

    any_parser/any_parser.py [26]

    -"http://AnyPar-ApiCo-cuKOBXasmUF1-1986145995.us-west-2.elb.amazonaws.com"
    +"https://AnyPar-ApiCo-cuKOBXasmUF1-1986145995.us-west-2.elb.amazonaws.com"
    Suggestion importance[1-10]: 9

    Why: Switching from HTTP to HTTPS is crucial for secure data transmission, especially for endpoints that might handle sensitive data.

    9
    Possible issue
    Prevent runtime errors by validating file paths before file operations

    Validate the file_path parameter to ensure it points to a valid file before
    attempting to open it.

    any_parser/batch_parser.py [54]

    +if not os.path.isfile(file_path):
    +    raise FileNotFoundError("The file path specified does not exist.")
     with open(file_path, "rb") as f:
    Suggestion importance[1-10]: 8

    Why: Validating file paths before attempting to open files is a good practice to prevent runtime errors and enhance the robustness of file handling.

    8
    Improve error handling by using specific exceptions for different failure scenarios

    Handle potential exceptions from file operations and network requests more
    gracefully instead of raising a generic exception.

    any_parser/batch_parser.py [65]

    -raise Exception(f"Upload failed: {response.text}")
    +raise UploadFailedException(f"Upload failed: {response.text}")
    Suggestion importance[1-10]: 7

    Why: Using specific exceptions can help in better understanding and handling of errors, making the code more robust and maintainable.

    7
    General
    Replace print statements with logging for better control over output and diagnostics

    Remove the debug print statement or replace it with a logging statement at an
    appropriate log level.

    any_parser/batch_parser.py [62]

    -print(response.json())
    +logger.debug("Response received: %s", response.json())
    Suggestion importance[1-10]: 6

    Why: Replacing print statements with logging is beneficial for production environments as it allows better control over what is logged and at what level, aiding in diagnostics and debugging.

    6

    @@ -20,6 +21,10 @@
    from any_parser.utils import validate_file_inputs

    PUBLIC_SHARED_BASE_URL = "https://public-api.cambio-ai.com"
    # TODO: Update this to the correct batch endpoint
    PUBLIC_BATCH_BASE_URL = (
    "http://AnyPar-ApiCo-cuKOBXasmUF1-1986145995.us-west-2.elb.amazonaws.com"
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    let me link this to our DNS.

    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    image I just created batch-api domain name forward for batch-api.cambio-ai.com. However, one thing I am not 100% sure is that for API gateway, we used to setup custom domain names in it. This is a directly public elb, so do we have to do anything on the AWS side besides the squarespace domains work that I have already done. I can merge this in first and you can address in a future PR.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I just tested http://batch-api.cambio-ai.com, and the domain name forward is working.

    Comment on lines +68 to +79
    ### 5. Run Batch Extraction (Beta)
    For batch extraction, send the file to begin processing and fetch results later:
    ```python
    # Send the file to begin batch extraction
    response = ap.batches.create(file_path="./data/test.pdf")
    request_id = response.requestId

    # Fetch the extracted content using the request ID
    markdown = ap.batches.retrieve(request_id)
    ```

    > ⚠️ **Note:** Batch extraction is currently in beta testing. Processing time may take up to 12 hours to complete.
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Good job in updating the README.

    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Also make sure you update the README that the cambioml.com website created API key will not be added into the batch api usage group and user should contact us to make sure your API key will have batch process permission manually added at this moment.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Got it. Let me update the readme here.

    @@ -20,6 +21,10 @@
    from any_parser.utils import validate_file_inputs

    PUBLIC_SHARED_BASE_URL = "https://public-api.cambio-ai.com"
    # TODO: Update this to the correct batch endpoint
    PUBLIC_BATCH_BASE_URL = (
    "http://AnyPar-ApiCo-cuKOBXasmUF1-1986145995.us-west-2.elb.amazonaws.com"
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    image I just created batch-api domain name forward for batch-api.cambio-ai.com. However, one thing I am not 100% sure is that for API gateway, we used to setup custom domain names in it. This is a directly public elb, so do we have to do anything on the AWS side besides the squarespace domains work that I have already done. I can merge this in first and you can address in a future PR.

    @@ -133,6 +138,7 @@ def __init__(self, api_key: str, base_url: str = PUBLIC_SHARED_BASE_URL) -> None
    )
    self._sync_extract_pii = ExtractPIISyncParser(api_key, base_url)
    self._sync_extract_tables = ExtractTablesSyncParser(api_key, base_url)
    self.batches = BatchParser(api_key, PUBLIC_BATCH_BASE_URL)
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    nit: use _batch to indicate this is a private method.

    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Also, can we pass another batch_url and input and pass in the PUBLIC_BATCH_BASE_URL as argument to improve readability.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The initial design is, I want the usage to be such that the user can use ap.batches.retrieve(request_id) to invoke the API, similar to what OpenAI does. Let's sync later offline.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Also, can we pass another batch_url and input and pass in the PUBLIC_BATCH_BASE_URL as argument to improve readability.

    Let me update this one

    Comment on lines +13 to +32
    class UploadResponse(BaseModel):
    fileName: str
    requestId: str
    requestStatus: str


    class UsageResponse(BaseModel):
    pageLimit: int
    pageRemaining: int


    class FileStatusResponse(BaseModel):
    fileName: str
    fileType: str
    requestId: str
    requestStatus: str
    uploadTime: str
    completionTime: Optional[str] = None
    result: Optional[List[str]] = Field(default_factory=list)
    error: Optional[List[str]] = Field(default_factory=list)
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    nit: add a docstring and describe what each attribute means to ease the review process.

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Docstring added. Thanks!

    Comment on lines +42 to +43
    # remove "Content-Type" from headers
    self._headers.pop("Content-Type")
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    qq: why we need to remove this?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    We need to upload a file in our request using batch api in any parser SDK, and the Content-Type in base-parser does not work for this scenario.

    files=files,
    timeout=TIMEOUT,
    )
    print(response.json())
    Copy link
    Member

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    nit: let's not use print, but logger.info

    @lingjiekong
    Copy link
    Member

    lingjiekong commented Dec 8, 2024

    Make you you install all required dependency and update pyproject.toml.

    Copy link
    Member

    @lingjiekong lingjiekong left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM to address issues in future PR

    @lingjiekong lingjiekong merged commit 989734a into main Dec 8, 2024
    5 checks passed
    @SeisSerenata SeisSerenata deleted the seis-dev branch December 10, 2024 16:29
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    3 participants