Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(docx): PAPP-35228 Make the parser smart enough to extract IOCs from docx files with embedded HTML #42

Merged
merged 10 commits into from
Dec 23, 2024

Conversation

phantom-jacob
Copy link
Contributor

@phantom-jacob phantom-jacob commented Dec 19, 2024

Pull Request Type

Bugfix: A customer reported that not all link targets were being extracted from their docx files. Upon further examination, it became apparent that the docx files for which the parser connector was failing to extract all IOCs were built in an uncommon way: rather than text (and links) being written directly into the docx file, the document had embedded an HTML page. Under the hood, Microsoft stores that HTML separately from the text of a docx file, so we were simply not parsing it at all.

Release Notes

PAPP-35228 Fix: Extract Indicators of Compromise from docx files with embedded HTML

What is the current behavior?

The parser connector does not attempt to parse HTML embedded in docx files when running the extract ioc action

What is the new behavior?

Now, the connector does parse both the core document and embedded HTML sections of docx files

Other information

  • While I was in the connector, I added type hints to all the functions defined within. I also took advantaged of TypedDict and dataclass to make the connector code significantly clearer
  • I had to move parameter validation for the extract ioc action into a separate function, as the pre-commit hook got upset at the complexity of the original function
  • I added a new dependency, python-docx. I confirmed it has an MIT license. Related to this change, the defusedxml dependency was dropped, as it was only used for manual parsing of docx files. I figured the interface provided by python-docx was much more ergonomic and less error prone
  • I removed the simplejson dependency as the standard library json package is the same thing
  • The only real functional changes in this PR are in the _docx_to_text() function of parser_methods.py

@phantom-jacob phantom-jacob merged commit f78b3a1 into next Dec 23, 2024
8 checks passed
@phantom-jacob phantom-jacob deleted the jacobd-PAPP-35228 branch December 23, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants