fix(docx): PAPP-35228 Make the parser smart enough to extract IOCs from docx files with embedded HTML #42
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Type
Bugfix: A customer reported that not all link targets were being extracted from their docx files. Upon further examination, it became apparent that the docx files for which the parser connector was failing to extract all IOCs were built in an uncommon way: rather than text (and links) being written directly into the docx file, the document had embedded an HTML page. Under the hood, Microsoft stores that HTML separately from the text of a docx file, so we were simply not parsing it at all.
Release Notes
PAPP-35228 Fix: Extract Indicators of Compromise from docx files with embedded HTML
What is the current behavior?
The parser connector does not attempt to parse HTML embedded in docx files when running the
extract ioc
actionWhat is the new behavior?
Now, the connector does parse both the core document and embedded HTML sections of docx files
Other information
TypedDict
anddataclass
to make the connector code significantly clearerextract ioc
action into a separate function, as the pre-commit hook got upset at the complexity of the original functionpython-docx
. I confirmed it has an MIT license. Related to this change, thedefusedxml
dependency was dropped, as it was only used for manual parsing of docx files. I figured the interface provided bypython-docx
was much more ergonomic and less error pronesimplejson
dependency as the standard libraryjson
package is the same thing_docx_to_text()
function of parser_methods.py