-
Notifications
You must be signed in to change notification settings - Fork 820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(file): fix OLE-based file-type auto-detection #3437
Conversation
Fixes mis-identification of DOC, PPT, XLS, and MSG files.
# ================================================================================================ | ||
# Describe `is_json_processable()` | ||
# ================================================================================================ | ||
|
||
|
||
def it_affirms_JSON_is_array_of_objects_from_a_file_path(): | ||
assert is_json_processable(example_doc_path("simple.json")) is True | ||
|
||
|
||
def and_it_affirms_JSON_is_NOT_an_array_of_objects_from_a_file_path(): | ||
assert is_json_processable(example_doc_path("not-unstructured-payload.json")) is False | ||
|
||
|
||
def it_affirms_JSON_is_array_of_objects_from_a_file_like_object_open_for_reading_bytes(): | ||
with open(example_doc_path("simple.json"), "rb") as f: | ||
assert is_json_processable(file=f) is True | ||
|
||
|
||
def and_it_affirms_JSON_is_NOT_an_array_of_objects_from_a_file_like_object_open_for_reading_bytes(): | ||
with open(example_doc_path("not-unstructured-payload.json"), "rb") as f: | ||
assert is_json_processable(file=f) is False | ||
|
||
|
||
def it_affirms_JSON_is_array_of_objects_from_text(): | ||
with open(example_doc_path("simple.json")) as f: | ||
text = f.read() | ||
assert is_json_processable(file_text=text) is True | ||
|
||
|
||
def and_it_affirms_JSON_is_NOT_an_array_of_objects_from_text(): | ||
with open(example_doc_path("not-unstructured-payload.json")) as f: | ||
text = f.read() | ||
assert is_json_processable(file_text=text) is False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new tests have nothing to do with this fix per se, there were just no tests for this interface function in the test suite.
file_text = _FileTypeDetectionContext.new( | ||
file_path=filename, file=file, encoding=encoding | ||
).text_head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with this change subsuming _read_file_start_for_file_check()
, this just got left out of the prior refactoring PR.
@@ -351,19 +358,19 @@ def mime_type(self) -> str | None: | |||
|
|||
A `str` return value is always in lower-case. | |||
""" | |||
file_path = self.file_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why declare this instead of using self.file_path
down below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just efficiency, a little more compact too. It's referred to four times in the function. I think more the latter in this case since only two of those dereferencings will happen for any given file.
206a8f0
to
3827678
Compare
Needs tests, consider reusing `_FileTypeDetectionContext`.text_head.
3827678
to
3f87de3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary
A DOC, PPT, or XLS file sent to partition() as a file-like object is misidentified as a MSG file and raises an exception in python-oxmsg (which is used to process MSG files).
Fix
DOC, PPT, XLS, and MSG are all Microsoft OLE-based files, aka. Compound File Binary Format (CFBF). These can be reliably distinguished by inspecting magic bytes in certain locations.
libmagic
is unreliable at this or doesn't try, reporting the generic"application/x-ole-storage"
which corresponds to the "container" CFBF format (vaguely like a Microsoft Zip format) that all these document types are stored in.Unconditionally use
filetype.guess_mime()
provided by thefiletype
package that is part of the base unstructured install. Unlikelibmagic
, this package reliably detects the distinguished MIME-type (e.g."application/msword"
) for OLE file subtypes.Fixes #3364