Skip to content

Commit

Permalink
feat: enhance pdfminer element cleanup (#3593)
Browse files Browse the repository at this point in the history
This PR aims to expand removal of `pdfminer` elements to include those
inside all `non-pdfminer` elements, not just `tables`.

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
  • Loading branch information
3 people authored Sep 4, 2024
1 parent d51fb13 commit acd070c
Show file tree
Hide file tree
Showing 5 changed files with 9 additions and 176 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
## 0.15.10-dev2
## 0.15.10-dev3

### Enhancements

* **Enhance `pdfminer` element cleanup** Expand removal of `pdfminer` elements to include those inside all `non-pdfminer` elements, not just `tables`.
* **Modified analysis drawing tools to dump to files and draw from dumps** If the parameter `analysis` of the `partition_pdf` function is set to `True`, the layout for Object Detection, Pdfminer Extraction, OCR and final layouts will be dumped as json files. The drawers now accept dict (dump) objects instead of internal classes instances.
* **Vectorize pdfminer elements deduplication computation**. Use `numpy` operations to compute IOU and sub-region membership instead of using simply loop. This improves the speed of deduplicating elements for pages with a lot of elements.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3595,133 +3595,9 @@
}
}
},
{
"type": "Image",
"element_id": "b0197950e1af5c2aac10f5b67d61524a",
"text": "",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 8,
"data_source": {
"url": "https://drive.google.com/uc?id=1m1TUgyLv0hHdlsuL7DOWBAKQtvrhWNiV&export=download",
"record_locator": {
"file_id": "1m1TUgyLv0hHdlsuL7DOWBAKQtvrhWNiV"
},
"date_created": "1718723636.34",
"date_modified": "1676196572.0",
"permissions_data": [
{
"id": "18298851591250030956",
"displayName": "[email protected]",
"type": "user",
"kind": "drive#permission",
"photoLink": "https://lh3.googleusercontent.com/a/ACg8ocJok2KRwwYvrEDkeZVCYosHOMoa52GZa2qIIC1jScCRoFLHaQ=s64",
"emailAddress": "[email protected]",
"role": "writer",
"deleted": false,
"pendingOwner": false
},
{
"id": "04774006893477068632",
"displayName": "ryan",
"type": "user",
"kind": "drive#permission",
"photoLink": "https://lh3.googleusercontent.com/a-/ALV-UjXeWpu7QcZuYqIl3p1mwqzS8XGFJ4RqA3Xjljfkm1DcFZ9M7A=s64",
"emailAddress": "[email protected]",
"role": "writer",
"deleted": false,
"pendingOwner": false
},
{
"id": "anyoneWithLink",
"type": "anyone",
"kind": "drive#permission",
"role": "reader",
"allowFileDiscovery": false
},
{
"id": "09147371668407854156",
"displayName": "roman",
"type": "user",
"kind": "drive#permission",
"photoLink": "https://lh3.googleusercontent.com/a-/ALV-UjWoGrFCgXcF6CtiBIBLnAfM68qUnQaJOcgvg3qzfQ3W8Ch6dA=s64",
"emailAddress": "[email protected]",
"role": "owner",
"deleted": false,
"pendingOwner": false
}
]
}
}
},
{
"type": "Image",
"element_id": "34d2dd4af420ea3fdddc8fc5d581cac2",
"text": "",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 8,
"data_source": {
"url": "https://drive.google.com/uc?id=1m1TUgyLv0hHdlsuL7DOWBAKQtvrhWNiV&export=download",
"record_locator": {
"file_id": "1m1TUgyLv0hHdlsuL7DOWBAKQtvrhWNiV"
},
"date_created": "1718723636.34",
"date_modified": "1676196572.0",
"permissions_data": [
{
"id": "18298851591250030956",
"displayName": "[email protected]",
"type": "user",
"kind": "drive#permission",
"photoLink": "https://lh3.googleusercontent.com/a/ACg8ocJok2KRwwYvrEDkeZVCYosHOMoa52GZa2qIIC1jScCRoFLHaQ=s64",
"emailAddress": "[email protected]",
"role": "writer",
"deleted": false,
"pendingOwner": false
},
{
"id": "04774006893477068632",
"displayName": "ryan",
"type": "user",
"kind": "drive#permission",
"photoLink": "https://lh3.googleusercontent.com/a-/ALV-UjXeWpu7QcZuYqIl3p1mwqzS8XGFJ4RqA3Xjljfkm1DcFZ9M7A=s64",
"emailAddress": "[email protected]",
"role": "writer",
"deleted": false,
"pendingOwner": false
},
{
"id": "anyoneWithLink",
"type": "anyone",
"kind": "drive#permission",
"role": "reader",
"allowFileDiscovery": false
},
{
"id": "09147371668407854156",
"displayName": "roman",
"type": "user",
"kind": "drive#permission",
"photoLink": "https://lh3.googleusercontent.com/a-/ALV-UjWoGrFCgXcF6CtiBIBLnAfM68qUnQaJOcgvg3qzfQ3W8Ch6dA=s64",
"emailAddress": "[email protected]",
"role": "owner",
"deleted": false,
"pendingOwner": false
}
]
}
}
},
{
"type": "FigureCaption",
"element_id": "a8ac039aa1d77ac96ecd4c8c14a556d5",
"element_id": "7803862f2804d04dfe8c38c4a353001d",
"text": "Equally, it is well established that living without access to electricity results in illness and death around the world, caused by everything from not having access to modern healthcare to household air pollution. As of today, 770 million people around the world do not have access to electricity, with over 75% of that population living in Sub-Saharan Africa. The world's poorest 4 billion people consume a mere 5% of the energy used in developed economies, and we need to find ways of delivering reliable electricity to the entire human population in a fashion that is sustainable. Household and ambient air pollution causes 8.7 million deaths each year, largely because of the continued use of fossil fuels. Widespread electrification is a key tool for delivering a just energy transition. Investment in nuclear, has become an urgent necessity. Discarding it, based on risk perceptions divorced from science, would be to abandon the moral obligation to ensure affordable, reliable, and sustainable energy for every community around the world.",
"metadata": {
"filetype": "application/pdf",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1275,53 +1275,9 @@
}
}
},
{
"type": "Image",
"element_id": "b0197950e1af5c2aac10f5b67d61524a",
"text": "",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 8,
"data_source": {
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/recalibrating-risk-report.pdf",
"version": "e690f37ef36368a509d150f373a0bbe0",
"record_locator": {
"protocol": "s3",
"remote_file_path": "s3://utic-dev-tech-fixtures/small-pdf-set/"
},
"date_created": "1676196572.0",
"date_modified": "1676196572.0"
}
}
},
{
"type": "Image",
"element_id": "34d2dd4af420ea3fdddc8fc5d581cac2",
"text": "",
"metadata": {
"filetype": "application/pdf",
"languages": [
"eng"
],
"page_number": 8,
"data_source": {
"url": "s3://utic-dev-tech-fixtures/small-pdf-set/recalibrating-risk-report.pdf",
"version": "e690f37ef36368a509d150f373a0bbe0",
"record_locator": {
"protocol": "s3",
"remote_file_path": "s3://utic-dev-tech-fixtures/small-pdf-set/"
},
"date_created": "1676196572.0",
"date_modified": "1676196572.0"
}
}
},
{
"type": "FigureCaption",
"element_id": "a8ac039aa1d77ac96ecd4c8c14a556d5",
"element_id": "7803862f2804d04dfe8c38c4a353001d",
"text": "Equally, it is well established that living without access to electricity results in illness and death around the world, caused by everything from not having access to modern healthcare to household air pollution. As of today, 770 million people around the world do not have access to electricity, with over 75% of that population living in Sub-Saharan Africa. The world's poorest 4 billion people consume a mere 5% of the energy used in developed economies, and we need to find ways of delivering reliable electricity to the entire human population in a fashion that is sustainable. Household and ambient air pollution causes 8.7 million deaths each year, largely because of the continued use of fossil fuels. Widespread electrification is a key tool for delivering a just energy transition. Investment in nuclear, has become an urgent necessity. Discarding it, based on risk perceptions divorced from science, would be to abandon the moral obligation to ensure affordable, reliable, and sustainable energy for every community around the world.",
"metadata": {
"filetype": "application/pdf",
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.15.10-dev2" # pragma: no cover
__version__ = "0.15.10-dev3" # pragma: no cover
8 changes: 4 additions & 4 deletions unstructured/partition/pdf_image/pdfminer_processing.py
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ def clean_pdfminer_inner_elements(document: "DocumentLayout") -> "DocumentLayout
"""

for page in document.pages:
table_boxes = [e.bbox for e in page.elements if e.type == ElementType.TABLE]
non_pdfminer_element_boxes = [e.bbox for e in page.elements if e.source != Source.PDFMINER]
element_boxes = []
element_to_subregion_map = {}
subregion_indice = 0
Expand All @@ -234,10 +234,10 @@ def clean_pdfminer_inner_elements(document: "DocumentLayout") -> "DocumentLayout
element_to_subregion_map[i] = subregion_indice
subregion_indice += 1

is_element_subregion_of_tables = (
is_element_subregion_of_other_elements = (
bboxes1_is_almost_subregion_of_bboxes2(
element_boxes,
table_boxes,
non_pdfminer_element_boxes,
env_config.EMBEDDED_TEXT_AGGREGATION_SUBREGION_THRESHOLD,
).sum(axis=1)
== 1
Expand All @@ -248,7 +248,7 @@ def clean_pdfminer_inner_elements(document: "DocumentLayout") -> "DocumentLayout
for i, e in enumerate(page.elements)
if (
(i not in element_to_subregion_map)
or not is_element_subregion_of_tables[element_to_subregion_map[i]]
or not is_element_subregion_of_other_elements[element_to_subregion_map[i]]
)
]

Expand Down

0 comments on commit acd070c

Please sign in to comment.