Skip to content

Commit

Permalink
fix: partition_pdf() removes spaces from the text (#3106)
Browse files Browse the repository at this point in the history
Closes #2896.

This PR aims to fix `partition_pdf()` to keep spaces in text. The
control character `\t` is now replaced with a space instead of being
removed when merging inferred and embedded elements.

### Testing
PDF:
[rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf)
```
elements = partition_pdf(
    filename="rok_20230930_1-1.pdf",
    strategy="hi_res",
)

print(str(elements[20]))
```
**Results:**
- PR
```
Name of each exchange on which registered New York Stock Exchange
```
- main branch
```
Nameofeachexchangeonwhichregistered NewYorkStockExchange
```
  • Loading branch information
christinestraub authored May 29, 2024
1 parent 3158169 commit f445724
Show file tree
Hide file tree
Showing 4 changed files with 5 additions and 4 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.14.3-dev5
## 0.14.3

### Enhancements

Expand All @@ -10,6 +10,7 @@

### Fixes

* **Fix `partition_pdf()` to keep spaces in the text**. The control character `\t` is now replaced with a space instead of being removed when merging inferred elements with embedded elements.
* **Turn off XML resolve entities** Sets `resolve_entities=False` for XML parsing with `lxml`
to avoid text being dynamically injected into the XML document.
* **Add backward compatibility for the deprecated pdf_infer_table_structure parameter**.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,7 @@ def test_annotate_layout_elements_file_not_found_error():

@pytest.mark.parametrize(
("text", "expected"),
[("c\to\x0cn\ftrol\ncharacter\rs\b", "control characters"), ("\"'\\", "\"'\\")],
[("test\tco\x0cn\ftrol\ncharacter\rs\b", "test control characters"), ("\"'\\", "\"'\\")],
)
def test_remove_control_characters(text, expected):
assert pdf_image_utils.remove_control_characters(text) == expected
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.14.3-dev5" # pragma: no cover
__version__ = "0.14.3" # pragma: no cover
2 changes: 1 addition & 1 deletion unstructured/partition/pdf_image/pdf_image_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -427,7 +427,7 @@ def remove_control_characters(text: str) -> str:
"""Removes control characters from text."""

# Replace newline character with a space
text = text.replace("\n", " ")
text = text.replace("\t", " ").replace("\n", " ")
# Remove other control characters
out_text = "".join(c for c in text if unicodedata.category(c)[0] != "C")
return out_text

0 comments on commit f445724

Please sign in to comment.