Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into potter/update-astradb…
Browse files Browse the repository at this point in the history
…-naming
  • Loading branch information
potter-potter committed Aug 5, 2024
2 parents 7bd61fc + b749b89 commit 44804d8
Show file tree
Hide file tree
Showing 19 changed files with 468 additions and 385 deletions.
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
## 0.15.1-dev7
## 0.15.1-dev9

### Enhancements

* **Improve `pdfminer` embedded `image` extraction to exclude text elements and produce more accurate bounding boxes.** This results in cleaner, more precise element extraction in `pdf` partitioning.

### Features

* **Update partition_eml and partition_msg to capture cc, bcc, and message_id fields** Cc, bcc, and message_id information is captured in element metadata for both msg and email partitioning and `Recipient` elements are generated for cc and bcc when `include_headers=True` for email partitioning.
* **Mark ingest as deprecated** Begin sunset of ingest code in this repo as it's been moved to a dedicated repo.

* **Add `pdf_hi_res_max_pages` argument for partitioning, which allows rejecting PDF files that exceed this page number limit, when the `high_res` strategy is chosen.** By default, it will allow parsing PDF files with an unlimited number of pages.

### Fixes
Expand Down
4 changes: 3 additions & 1 deletion example-docs/eml/fake-email-header.eml
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
Received: from ABCDEFG-000.ABC.guide (00.0.0.00) by ABCDEFG-000.ABC.guide
([ba23::58b5:2236:45g2:88h2]) with Unstructured TTTT Server (version=ABC0_0,
cipher=ABC_ABCDE_ABC_NOPE_ABC_000_ABC_ABC000) id 00.0.000.0 via Techbox
Transport; Wed, 20 Feb 2023 10:03:18 +1200
Transport; Wed, 20 Feb 2023 10:03:18 +1200
MIME-Version: 1.0
Date: Fri, 16 Dec 2022 17:04:16 -0500
Bcc: Hello <[email protected]>
Message-ID: <CADc-_xaLB2FeVQ7mNsoX+NJb_7hAJhBKa_zet-rtgPGenj0uVw@mail.gmail.com>
Subject: Test Email
From: Matthew Robinson <[email protected]>
To: Matthew Robinson <[email protected]>
Cc: Fake Email <[email protected]>, [email protected]
Content-Type: multipart/alternative; boundary="00000000000095c9b205eff92630"

--00000000000095c9b205eff92630
Expand Down
Binary file added example-docs/fake-email-with-cc-and-bcc.msg
Binary file not shown.
5 changes: 5 additions & 0 deletions test_unstructured/partition/pdf_image/test_pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -1372,14 +1372,18 @@ def test_analysis_artifacts_saved():
("pdf/layout-parser-paper-with-empty-pages.pdf", 3, True),
("pdf/reliance.pdf", 3, False),
("pdf/reliance.pdf", 2, True),
("img/DA-1p.jpg", None, False),
("img/DA-1p.jpg", 2, False),
],
)
def test_pdf_hi_res_max_pages_argument(filename, pdf_hi_res_max_pages, expected_error):
is_image = not Path(filename).suffix.endswith("pdf")
if not expected_error:
pdf.partition_pdf_or_image(
filename=example_doc_path(filename),
strategy=PartitionStrategy.HI_RES,
pdf_hi_res_max_pages=pdf_hi_res_max_pages,
is_image=is_image,
)

else:
Expand All @@ -1388,4 +1392,5 @@ def test_pdf_hi_res_max_pages_argument(filename, pdf_hi_res_max_pages, expected_
filename=example_doc_path(filename),
strategy=PartitionStrategy.HI_RES,
pdf_hi_res_max_pages=pdf_hi_res_max_pages,
is_image=is_image,
)
Loading

0 comments on commit 44804d8

Please sign in to comment.