fix(docx): PAPP-35228 Make the parser smart enough to extract IOCs from docx files with embedded HTML #42

phantom-jacob · 2024-12-19T01:17:28Z

Pull Request Type

Bugfix: A customer reported that not all link targets were being extracted from their docx files. Upon further examination, it became apparent that the docx files for which the parser connector was failing to extract all IOCs were built in an uncommon way: rather than text (and links) being written directly into the docx file, the document had embedded an HTML page. Under the hood, Microsoft stores that HTML separately from the text of a docx file, so we were simply not parsing it at all.

Release Notes

PAPP-35228 Fix: Extract Indicators of Compromise from docx files with embedded HTML

What is the current behavior?

The parser connector does not attempt to parse HTML embedded in docx files when running the extract ioc action

What is the new behavior?

Now, the connector does parse both the core document and embedded HTML sections of docx files

Other information

While I was in the connector, I added type hints to all the functions defined within. I also took advantaged of TypedDict and dataclass to make the connector code significantly clearer
I had to move parameter validation for the extract ioc action into a separate function, as the pre-commit hook got upset at the complexity of the original function
I added a new dependency, python-docx. I confirmed it has an MIT license. Related to this change, the defusedxml dependency was dropped, as it was only used for manual parsing of docx files. I figured the interface provided by python-docx was much more ergonomic and less error prone
I removed the simplejson dependency as the standard library json package is the same thing
The only real functional changes in this PR are in the _docx_to_text() function of parser_methods.py

parser_methods.py

phantom-jacob added 2 commits December 17, 2024 17:24

fix(deps): ran pre-commit which added cffi and cryptography wheels

b1d6573

fix(docx): PAPP-35228 Extract IOCs from docx files with embedded HTML

007d9ce

splunk-soar-connectors-bot added the splunk-supported label Dec 19, 2024

phantom-jacob and others added 8 commits December 19, 2024 10:56

fix(deps): Remove no-longer used defusedxml dependency

1ceb49e

Update README.md

b8c31d2

fix(deps): git add wheels

77d74e4

chore(types): Add type hinting to parser_email.py

ef58ead

chore(ci): Revert CI changes

b1edbaa

revert(main): Revert changes to __main__

dd5d859

fix(debug): Remove debug statements

a89d0f9

chore(docs): Add release notes

c6c9b2e

phantom-jacob commented Dec 23, 2024

View reviewed changes

parser_methods.py Show resolved Hide resolved

mnordby-splunk approved these changes Dec 23, 2024

View reviewed changes

splunk-jessica approved these changes Dec 23, 2024

View reviewed changes

phantom-jacob merged commit f78b3a1 into next Dec 23, 2024
8 checks passed

phantom-jacob deleted the jacobd-PAPP-35228 branch December 23, 2024 21:00

phantom-jacob mentioned this pull request Dec 23, 2024

chore(docs): Fix release notes #43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docx): PAPP-35228 Make the parser smart enough to extract IOCs from docx files with embedded HTML #42

fix(docx): PAPP-35228 Make the parser smart enough to extract IOCs from docx files with embedded HTML #42

phantom-jacob commented Dec 19, 2024 •

edited

Loading

fix(docx): PAPP-35228 Make the parser smart enough to extract IOCs from docx files with embedded HTML #42

fix(docx): PAPP-35228 Make the parser smart enough to extract IOCs from docx files with embedded HTML #42

Conversation

phantom-jacob commented Dec 19, 2024 • edited Loading

Pull Request Type

Release Notes

What is the current behavior?

What is the new behavior?

Other information

phantom-jacob commented Dec 19, 2024 •

edited

Loading