Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support file_content as input #64

Merged
merged 7 commits into from
Nov 19, 2024
Merged

feat: support file_content as input #64

merged 7 commits into from
Nov 19, 2024

Conversation

SeisSerenata
Copy link
Collaborator

Here's a summary of the changes in this PR:

Main Changes

  1. File Content Support

    • Added ability to accept direct file content as base64 encoded strings in addition to file paths
    • Added file type validation when using file content input
  2. Code Restructuring

    • Split parser functionality into separate classes:
      • BaseParser: Common base functionality
      • SyncParser: Synchronous parsing operations
      • AsyncParser: Asynchronous parsing operations
    • Added decorators for common parsing logic
    • Improved input validation
  3. API Changes

    • Modified method signatures to support both file_path and file_content:
      def parse(self, file_path=None, file_content=None, file_type=None, extract_args=None)
    • Added validation for file inputs
    • Improved error handling and messages
  4. New Files Added

    • async_parser.py: Handles async parsing operations
    • sync_parser.py: Handles sync parsing operations
    • base_parser.py: Contains shared base functionality
  5. Testing

    • Added test cases for file content input
    • Updated existing tests to use new method signatures

Impact

  • Backward compatible changes that add support for direct file content input
  • More modular and maintainable code structure
  • Improved input validation and error handling
  • Better separation of concerns between sync and async operations

PARSE = "parse"
PARSE_WITH_OCR = "parse_with_ocr"
PARSE_WITH_LAYOUT = "parse_with_layout"
def handle_parsing(func):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This decorator name is very confusing because handle_parsing decorator is used for both parsing and extracting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really raise our bar for naming convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I urge you to read https://www.notion.so/goldpiggy/Software-Engineering-Best-Practice-2b4c5b4883104ba5877ca8ee36a51133?pvs=4 code complete variables section III to further improve your code quality on variable and method naming first!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe but not take my words for granted that convert_b64_to_url

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now remaned to handle_sync_file_processing and handle_async_file_processing

Comment on lines +23 to +30
def wrapper(
self,
file_path=None,
file_content=None,
file_type=None,
*args,
**kwargs,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need clean docstring here especially how we are using file_path or file_content. We haven't properly raise our AnyParaser linter bar yet, but please still use the best practice especially for these complicated logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring updated.

**kwargs,
):
# Validate inputs
is_valid, error_message = validate_parser_inputs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this validate is only for parser or it is used for both parser and extractor? If so, we should come up with better name to improve our code readability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to validate_file_inputs

file_type = Path(file_path).suffix.lower().lstrip(".")
else:
file_path = NamedTemporaryFile(delete=False, suffix=f".{file_type}").name
print(file_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's use logger instead of print, so we can control the log level from info, warning, and error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it with thanks. I have deleted the print here.


return response, f"{end_time - start_time:.2f} seconds"
class AnyParser:
"""AnyParser RT: Real-time parser for any data format."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there is not RT anymore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines +138 to +140
result = "\n".join(
response_data["markdown"]
) # Using direct extraction instead of extract_key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: is this to combine multiple pages response into a single page? I feel a clean API should be just return list of markdown instead of combining them together into a single response. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that returning a list of markdown is a better practice than joining them together. However, I’m concerned that this would constitute a schema change, and I would prefer to avoid altering the output schema in this PR. Could we create a separate PR for this optimization?

except json.JSONDecodeError:
return f"Error: Invalid JSON response: {response.text}", ""

@handle_parsing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as I stated above, this is the problem to use a parsing decorator for extract. This will make the code really confusing and hard to read, extend, and maintain in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated this naming here

response_data["pii_extraction"],
f"Time Elapsed: {info}",
)
result = response_data["pii_extraction"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking with @Sdddell offline that it can be confusing to use "markdown", "pii_extraction" as the response_data, we should make a TODO in the future to unify our backend to has a key for all response.

Also, it is a very bad idea to access a response_data through dict. After the above change is done, I suggest you to have a dataclass to always convert the response_data into a class and access through variable instead of key. This is dramatically improve the code readability and robustness to avoid suddenly key not in dict issue. TL'DR to access a data structure as dict is the worse idea and you should always try to serialized and deserialized.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree, and let's create a new pr later to address this issue.

self._async_upload_url = f"{self._base_url}/async/upload"
self._async_fetch_url = f"{self._base_url}/async/fetch"

def send_async_request(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another big problem of OOP is suddenly introduce new method which is not shown in the base class. Python is a very tolerable language, but we should still follow the best standard.

For example, in the BaseParser, there is neither send_async_request for the async and get_sync_response for the sync. I would suggest you to add a parse method for BaseParser first and raise NotImplemented exception first. Then, you should implement it for both sync and async. This will only only make our code more OOP, but also improved the readability.

Copy link
Collaborator Author

@SeisSerenata SeisSerenata Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your great insights, but I believe the situation is slightly different for asynchronous operations. It resembles performing an upload rather than sending requests for direct parsing. This is why I chose not to create a parse method in the BaseParser, opting instead to implement it in both the synchronous and asynchronous parsers.

)
self._sync_parse_with_ocr = f"{self._base_url}/parse_with_ocr"

def get_sync_response(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here as I commented on the async.

@lingjiekong
Copy link
Member

Also, make sure you rebase the latest code.

try:
with open(file_path, "rb") as file:
file_content = base64.b64encode(file.read()).decode("utf-8")
file_type = Path(file_path).suffix.lower().lstrip(".")
except Exception as e:
return f"Error: {e}", ""
else:
# generate a random file path for genrating presigned url
file_path = f"/tmp/{uuid.uuid4()}.{file_type}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE!

Copy link
Member

@lingjiekong lingjiekong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lingjiekong lingjiekong merged commit ad7cb88 into main Nov 19, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants