feat: support file_content as input #64

SeisSerenata · 2024-11-18T15:27:08Z

Here's a summary of the changes in this PR:

Main Changes

File Content Support
- Added ability to accept direct file content as base64 encoded strings in addition to file paths
- Added file type validation when using file content input
Code Restructuring
- Split parser functionality into separate classes:
  - BaseParser: Common base functionality
  - SyncParser: Synchronous parsing operations
  - AsyncParser: Asynchronous parsing operations
- Added decorators for common parsing logic
- Improved input validation
API Changes
- Modified method signatures to support both file_path and file_content:
```
def parse(self, file_path=None, file_content=None, file_type=None, extract_args=None)
```
- Added validation for file inputs
- Improved error handling and messages
New Files Added
- async_parser.py: Handles async parsing operations
- sync_parser.py: Handles sync parsing operations
- base_parser.py: Contains shared base functionality
Testing
- Added test cases for file content input
- Updated existing tests to use new method signatures

Impact

Backward compatible changes that add support for direct file content input
More modular and maintainable code structure
Improved input validation and error handling
Better separation of concerns between sync and async operations

lingjiekong · 2024-11-18T19:30:34Z

any_parser/any_parser.py

-    PARSE = "parse"
-    PARSE_WITH_OCR = "parse_with_ocr"
-    PARSE_WITH_LAYOUT = "parse_with_layout"
+def handle_parsing(func):


This decorator name is very confusing because handle_parsing decorator is used for both parsing and extracting.

We should really raise our bar for naming convention.

I urge you to read https://www.notion.so/goldpiggy/Software-Engineering-Best-Practice-2b4c5b4883104ba5877ca8ee36a51133?pvs=4 code complete variables section III to further improve your code quality on variable and method naming first!

maybe but not take my words for granted that convert_b64_to_url

Now remaned to handle_sync_file_processing and handle_async_file_processing

lingjiekong · 2024-11-18T19:46:39Z

any_parser/any_parser.py

+    def wrapper(
+        self,
+        file_path=None,
+        file_content=None,
+        file_type=None,
+        *args,
+        **kwargs,
+    ):


We really need clean docstring here especially how we are using file_path or file_content. We haven't properly raise our AnyParaser linter bar yet, but please still use the best practice especially for these complicated logic.

Docstring updated.

lingjiekong · 2024-11-18T19:47:38Z

any_parser/any_parser.py

+        **kwargs,
+    ):
+        # Validate inputs
+        is_valid, error_message = validate_parser_inputs(


is this validate is only for parser or it is used for both parser and extractor? If so, we should come up with better name to improve our code readability.

Updated to validate_file_inputs

lingjiekong · 2024-11-18T19:48:45Z

any_parser/any_parser.py

+            file_type = Path(file_path).suffix.lower().lstrip(".")
+        else:
+            file_path = NamedTemporaryFile(delete=False, suffix=f".{file_type}").name
+            print(file_path)


nit: let's use logger instead of print, so we can control the log level from info, warning, and error

Got it with thanks. I have deleted the print here.

lingjiekong · 2024-11-18T19:49:23Z

any_parser/any_parser.py


-        return response, f"{end_time - start_time:.2f} seconds"
+class AnyParser:
+    """AnyParser RT: Real-time parser for any data format."""


nit: there is not RT anymore.

lingjiekong · 2024-11-18T20:22:21Z

any_parser/any_parser.py

+            result = "\n".join(
+                response_data["markdown"]
+            )  # Using direct extraction instead of extract_key


qq: is this to combine multiple pages response into a single page? I feel a clean API should be just return list of markdown instead of combining them together into a single response. What do you think?

I agree that returning a list of markdown is a better practice than joining them together. However, I’m concerned that this would constitute a schema change, and I would prefer to avoid altering the output schema in this PR. Could we create a separate PR for this optimization?

lingjiekong · 2024-11-18T20:22:58Z

any_parser/any_parser.py

        except json.JSONDecodeError:
            return f"Error: Invalid JSON response: {response.text}", ""

+    @handle_parsing


nit: as I stated above, this is the problem to use a parsing decorator for extract. This will make the code really confusing and hard to read, extend, and maintain in the future.

Have updated this naming here

lingjiekong · 2024-11-18T20:26:09Z

any_parser/any_parser.py

-                response_data["pii_extraction"],
-                f"Time Elapsed: {info}",
-            )
+            result = response_data["pii_extraction"]


I was talking with @Sdddell offline that it can be confusing to use "markdown", "pii_extraction" as the response_data, we should make a TODO in the future to unify our backend to has a key for all response.

Also, it is a very bad idea to access a response_data through dict. After the above change is done, I suggest you to have a dataclass to always convert the response_data into a class and access through variable instead of key. This is dramatically improve the code readability and robustness to avoid suddenly key not in dict issue. TL'DR to access a data structure as dict is the worse idea and you should always try to serialized and deserialized.

Totally agree, and let's create a new pr later to address this issue.

lingjiekong · 2024-11-18T20:29:41Z

any_parser/async_parser.py

+        self._async_upload_url = f"{self._base_url}/async/upload"
+        self._async_fetch_url = f"{self._base_url}/async/fetch"
+
+    def send_async_request(


Another big problem of OOP is suddenly introduce new method which is not shown in the base class. Python is a very tolerable language, but we should still follow the best standard.

For example, in the BaseParser, there is neither send_async_request for the async and get_sync_response for the sync. I would suggest you to add a parse method for BaseParser first and raise NotImplemented exception first. Then, you should implement it for both sync and async. This will only only make our code more OOP, but also improved the readability.

I agree with your great insights, but I believe the situation is slightly different for asynchronous operations. It resembles performing an upload rather than sending requests for direct parsing. This is why I chose not to create a parse method in the BaseParser, opting instead to implement it in both the synchronous and asynchronous parsers.

lingjiekong · 2024-11-18T20:30:03Z

any_parser/sync_parser.py

+        )
+        self._sync_parse_with_ocr = f"{self._base_url}/parse_with_ocr"
+
+    def get_sync_response(


same here as I commented on the async.

lingjiekong · 2024-11-18T22:18:37Z

Also, make sure you rebase the latest code.

lingjiekong · 2024-11-19T03:48:17Z

any_parser/any_parser.py

            try:
                with open(file_path, "rb") as file:
                    file_content = base64.b64encode(file.read()).decode("utf-8")
                    file_type = Path(file_path).suffix.lower().lstrip(".")
            except Exception as e:
                return f"Error: {e}", ""
+        else:
+            # generate a random file path for genrating presigned url
+            file_path = f"/tmp/{uuid.uuid4()}.{file_type}"


lingjiekong

LGTM.

SeisSerenata added 3 commits November 18, 2024 13:13

feat: support file_content as input

f339176

feat: support file_content as input

5654cac

test: update test cases

1feec68

SeisSerenata requested review from Sdddell, goldmermaid and lingjiekong as code owners November 18, 2024 15:27

fix: fix python 3.9 build issue

e9d302c

lingjiekong reviewed Nov 18, 2024

View reviewed changes

SeisSerenata and others added 3 commits November 19, 2024 03:14

chore: update naming, docstring and decorators

fcf97b2

Merge branch 'main' into seis-dev

33857a1

Merge branch 'seis-dev' of github.com:CambioML/any-parser into seis-dev

0087248

lingjiekong reviewed Nov 19, 2024

View reviewed changes

lingjiekong approved these changes Nov 19, 2024

View reviewed changes

lingjiekong merged commit ad7cb88 into main Nov 19, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support file_content as input #64

feat: support file_content as input #64

SeisSerenata commented Nov 18, 2024

lingjiekong Nov 18, 2024

lingjiekong Nov 18, 2024

lingjiekong Nov 18, 2024

lingjiekong Nov 18, 2024

SeisSerenata Nov 19, 2024

lingjiekong Nov 18, 2024

SeisSerenata Nov 19, 2024

lingjiekong Nov 18, 2024

SeisSerenata Nov 19, 2024

lingjiekong Nov 18, 2024

SeisSerenata Nov 19, 2024

lingjiekong Nov 18, 2024

SeisSerenata Nov 19, 2024

lingjiekong Nov 18, 2024

SeisSerenata Nov 19, 2024

lingjiekong Nov 18, 2024

SeisSerenata Nov 19, 2024

lingjiekong Nov 18, 2024

SeisSerenata Nov 19, 2024

lingjiekong Nov 18, 2024

SeisSerenata Nov 19, 2024 •

edited

Loading

lingjiekong Nov 18, 2024

lingjiekong commented Nov 18, 2024

lingjiekong Nov 19, 2024

lingjiekong left a comment

feat: support file_content as input #64

feat: support file_content as input #64

Conversation

SeisSerenata commented Nov 18, 2024

Main Changes

Impact

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SeisSerenata Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lingjiekong commented Nov 18, 2024

Choose a reason for hiding this comment

lingjiekong left a comment

Choose a reason for hiding this comment

SeisSerenata Nov 19, 2024 •

edited

Loading