Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(file): fix OLE-based file-type auto-detection #3437

Merged
merged 2 commits into from
Jul 25, 2024

Conversation

scanny
Copy link
Collaborator

@scanny scanny commented Jul 24, 2024

Summary
A DOC, PPT, or XLS file sent to partition() as a file-like object is misidentified as a MSG file and raises an exception in python-oxmsg (which is used to process MSG files).

Fix
DOC, PPT, XLS, and MSG are all Microsoft OLE-based files, aka. Compound File Binary Format (CFBF). These can be reliably distinguished by inspecting magic bytes in certain locations. libmagic is unreliable at this or doesn't try, reporting the generic "application/x-ole-storage" which corresponds to the "container" CFBF format (vaguely like a Microsoft Zip format) that all these document types are stored in.

Unconditionally use filetype.guess_mime() provided by the filetype package that is part of the base unstructured install. Unlike libmagic, this package reliably detects the distinguished MIME-type (e.g. "application/msword") for OLE file subtypes.

Fixes #3364

Fixes mis-identification of DOC, PPT, XLS, and MSG files.
Comment on lines +640 to +672
# ================================================================================================
# Describe `is_json_processable()`
# ================================================================================================


def it_affirms_JSON_is_array_of_objects_from_a_file_path():
assert is_json_processable(example_doc_path("simple.json")) is True


def and_it_affirms_JSON_is_NOT_an_array_of_objects_from_a_file_path():
assert is_json_processable(example_doc_path("not-unstructured-payload.json")) is False


def it_affirms_JSON_is_array_of_objects_from_a_file_like_object_open_for_reading_bytes():
with open(example_doc_path("simple.json"), "rb") as f:
assert is_json_processable(file=f) is True


def and_it_affirms_JSON_is_NOT_an_array_of_objects_from_a_file_like_object_open_for_reading_bytes():
with open(example_doc_path("not-unstructured-payload.json"), "rb") as f:
assert is_json_processable(file=f) is False


def it_affirms_JSON_is_array_of_objects_from_text():
with open(example_doc_path("simple.json")) as f:
text = f.read()
assert is_json_processable(file_text=text) is True


def and_it_affirms_JSON_is_NOT_an_array_of_objects_from_text():
with open(example_doc_path("not-unstructured-payload.json")) as f:
text = f.read()
assert is_json_processable(file_text=text) is False
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new tests have nothing to do with this fix per se, there were just no tests for this interface function in the test suite.

Comment on lines +117 to +119
file_text = _FileTypeDetectionContext.new(
file_path=filename, file=file, encoding=encoding
).text_head
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this change subsuming _read_file_start_for_file_check(), this just got left out of the prior refactoring PR.

@scanny scanny requested a review from Coniferish July 24, 2024 22:02
@@ -351,19 +358,19 @@ def mime_type(self) -> str | None:

A `str` return value is always in lower-case.
"""
file_path = self.file_path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why declare this instead of using self.file_path down below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just efficiency, a little more compact too. It's referred to four times in the function. I think more the latter in this case since only two of those dereferencings will happen for any given file.

Needs tests, consider reusing `_FileTypeDetectionContext`.text_head.
Copy link
Collaborator

@Coniferish Coniferish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@scanny scanny added this pull request to the merge queue Jul 25, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Jul 25, 2024
@scanny scanny added this pull request to the merge queue Jul 25, 2024
Merged via the queue into main with commit 4e61acc Jul 25, 2024
51 checks passed
@scanny scanny deleted the scanny/fix-ole-file-autodetect branch July 25, 2024 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(doc): DOC, PPT, and XLS file-like-objects misidentified as MSG file
2 participants