-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support file_content as input #64
Conversation
any_parser/any_parser.py
Outdated
PARSE = "parse" | ||
PARSE_WITH_OCR = "parse_with_ocr" | ||
PARSE_WITH_LAYOUT = "parse_with_layout" | ||
def handle_parsing(func): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This decorator name is very confusing because handle_parsing decorator is used for both parsing and extracting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really raise our bar for naming convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I urge you to read https://www.notion.so/goldpiggy/Software-Engineering-Best-Practice-2b4c5b4883104ba5877ca8ee36a51133?pvs=4 code complete variables section III to further improve your code quality on variable and method naming first!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe but not take my words for granted that convert_b64_to_url
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now remaned to handle_sync_file_processing and handle_async_file_processing
def wrapper( | ||
self, | ||
file_path=None, | ||
file_content=None, | ||
file_type=None, | ||
*args, | ||
**kwargs, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really need clean docstring here especially how we are using file_path or file_content. We haven't properly raise our AnyParaser linter bar yet, but please still use the best practice especially for these complicated logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring updated.
any_parser/any_parser.py
Outdated
**kwargs, | ||
): | ||
# Validate inputs | ||
is_valid, error_message = validate_parser_inputs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this validate is only for parser or it is used for both parser and extractor? If so, we should come up with better name to improve our code readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to validate_file_inputs
any_parser/any_parser.py
Outdated
file_type = Path(file_path).suffix.lower().lstrip(".") | ||
else: | ||
file_path = NamedTemporaryFile(delete=False, suffix=f".{file_type}").name | ||
print(file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's use logger instead of print, so we can control the log level from info, warning, and error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it with thanks. I have deleted the print here.
any_parser/any_parser.py
Outdated
|
||
return response, f"{end_time - start_time:.2f} seconds" | ||
class AnyParser: | ||
"""AnyParser RT: Real-time parser for any data format.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there is not RT anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
result = "\n".join( | ||
response_data["markdown"] | ||
) # Using direct extraction instead of extract_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: is this to combine multiple pages response into a single page? I feel a clean API should be just return list of markdown instead of combining them together into a single response. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that returning a list of markdown is a better practice than joining them together. However, I’m concerned that this would constitute a schema change, and I would prefer to avoid altering the output schema in this PR. Could we create a separate PR for this optimization?
any_parser/any_parser.py
Outdated
except json.JSONDecodeError: | ||
return f"Error: Invalid JSON response: {response.text}", "" | ||
|
||
@handle_parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: as I stated above, this is the problem to use a parsing decorator for extract. This will make the code really confusing and hard to read, extend, and maintain in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have updated this naming here
response_data["pii_extraction"], | ||
f"Time Elapsed: {info}", | ||
) | ||
result = response_data["pii_extraction"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was talking with @Sdddell offline that it can be confusing to use "markdown", "pii_extraction" as the response_data, we should make a TODO in the future to unify our backend to has a key for all response.
Also, it is a very bad idea to access a response_data through dict. After the above change is done, I suggest you to have a dataclass to always convert the response_data into a class and access through variable instead of key. This is dramatically improve the code readability and robustness to avoid suddenly key not in dict issue. TL'DR to access a data structure as dict is the worse idea and you should always try to serialized and deserialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree, and let's create a new pr later to address this issue.
self._async_upload_url = f"{self._base_url}/async/upload" | ||
self._async_fetch_url = f"{self._base_url}/async/fetch" | ||
|
||
def send_async_request( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another big problem of OOP is suddenly introduce new method which is not shown in the base class. Python is a very tolerable language, but we should still follow the best standard.
For example, in the BaseParser, there is neither send_async_request
for the async and get_sync_response
for the sync. I would suggest you to add a parse
method for BaseParser first and raise NotImplemented exception first. Then, you should implement it for both sync and async. This will only only make our code more OOP, but also improved the readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your great insights, but I believe the situation is slightly different for asynchronous operations. It resembles performing an upload rather than sending requests for direct parsing. This is why I chose not to create a parse method in the BaseParser, opting instead to implement it in both the synchronous and asynchronous parsers.
) | ||
self._sync_parse_with_ocr = f"{self._base_url}/parse_with_ocr" | ||
|
||
def get_sync_response( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here as I commented on the async.
Also, make sure you rebase the latest code. |
try: | ||
with open(file_path, "rb") as file: | ||
file_content = base64.b64encode(file.read()).decode("utf-8") | ||
file_type = Path(file_path).suffix.lower().lstrip(".") | ||
except Exception as e: | ||
return f"Error: {e}", "" | ||
else: | ||
# generate a random file path for genrating presigned url | ||
file_path = f"/tmp/{uuid.uuid4()}.{file_type}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NICE!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Here's a summary of the changes in this PR:
Main Changes
File Content Support
Code Restructuring
BaseParser
: Common base functionalitySyncParser
: Synchronous parsing operationsAsyncParser
: Asynchronous parsing operationsAPI Changes
New Files Added
async_parser.py
: Handles async parsing operationssync_parser.py
: Handles sync parsing operationsbase_parser.py
: Contains shared base functionalityTesting
Impact