Skip to content

Commit

Permalink
refactor: restructure PDF/Image example document organization (#3410)
Browse files Browse the repository at this point in the history
This PR aims to improve the organization and readability of our example
documents used in unit tests, specifically focusing on PDF and image
files.

### Summary
- Created two new subdirectories in the `example-docs` folder:
  - `pdf/`: for all PDF example files
  - `img/`: for all image example files
- Moved relevant PDF files from `example-docs/` to `example-docs/pdf/`
- Moved relevant image files from `example-docs/` to `example-docs/img/`
- Updated file paths in affected unit & ingest tests to reflect the new
directory structure

### Testing
All unit & ingest tests should be updated and verified to work with the
new file structure.

## Notes
Other file types (e.g., office documents, HTML files) remain in the root
of `example-docs/` for now.

## Next Steps
Consider similar reorganization for other file types if this structure
proves to be beneficial.

---------

Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: christinestraub <[email protected]>
  • Loading branch information
3 people authored Jul 18, 2024
1 parent 5d38703 commit 0eb461a
Show file tree
Hide file tree
Showing 80 changed files with 206 additions and 217 deletions.
1 change: 0 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
### Fixes

* **Remedy error on Windows when `nltk` binaries are downloaded.** Work around a quirk in the Windows implementation of `tempfile.NamedTemporaryFile` where accessing the temporary file by name raises `PermissionError`.

* **Move Astra embedded_dimension to write config**

## 0.14.10
Expand Down
File renamed without changes.
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
6 changes: 3 additions & 3 deletions test_unstructured/file_utils/test_filetype.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,9 +99,9 @@ def test_detect_filetype_from_filename_with_extension(
@pytest.mark.parametrize(
("file_name", "expected_value"),
[
("layout-parser-paper-fast.pdf", [FileType.PDF]),
("pdf/layout-parser-paper-fast.pdf", [FileType.PDF]),
("fake.docx", [FileType.DOCX]),
("example.jpg", [FileType.JPG]),
("img/example.jpg", [FileType.JPG]),
("fake-text.txt", [FileType.TXT]),
("eml/fake-email.eml", [FileType.EML]),
("factbook.xml", [FileType.XML]),
Expand Down Expand Up @@ -424,7 +424,7 @@ def test_detect_BMP_from_file_path():


def test_detect_BMP_from_file_no_extension():
with open(example_doc_path("bmp_24.bmp"), "rb") as f:
with open(example_doc_path("img/bmp_24.bmp"), "rb") as f:
file = io.BytesIO(f.read())
assert detect_filetype(file=file) == FileType.BMP

Expand Down
3 changes: 2 additions & 1 deletion test_unstructured/file_utils/test_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@
import pytest

import unstructured.file_utils.metadata as meta
from test_unstructured.unit_utils import example_doc_path

DIRECTORY = pathlib.Path(__file__).parent.resolve()
EXAMPLE_JPG_FILENAME = os.path.join(DIRECTORY, "..", "..", "example-docs", "example.jpg")
EXAMPLE_JPG_FILENAME = example_doc_path("img/example.jpg")


def test_get_docx_metadata_from_filename(tmpdir):
Expand Down
9 changes: 5 additions & 4 deletions test_unstructured/metrics/test_table_structure.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import numpy as np
import pytest

from test_unstructured.unit_utils import example_doc_path
from unstructured.metrics.table.table_alignment import TableAlignment
from unstructured.metrics.table.table_eval import TableEvalProcessor
from unstructured.metrics.table_structure import (
Expand All @@ -14,8 +15,8 @@
@pytest.mark.parametrize(
"filename",
[
"example-docs/table-multi-row-column-cells.png",
"example-docs/table-multi-row-column-cells.pdf",
example_doc_path("img/table-multi-row-column-cells.png"),
example_doc_path("pdf/table-multi-row-column-cells.pdf"),
],
)
def test_image_or_pdf_to_dataframe(filename):
Expand All @@ -25,8 +26,8 @@ def test_image_or_pdf_to_dataframe(filename):

def test_eval_table_transformer_for_file():
score = eval_table_transformer_for_file(
"example-docs/table-multi-row-column-cells.png",
"example-docs/table-multi-row-column-cells-actual.csv",
example_doc_path("img/table-multi-row-column-cells.png"),
example_doc_path("table-multi-row-column-cells-actual.csv"),
)
# avoid severe degradation of performance
assert 0.8 < score < 1
Expand Down
3 changes: 2 additions & 1 deletion test_unstructured/partition/pdf_image/test_chipper.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
import pytest

from test_unstructured.unit_utils import example_doc_path
from unstructured.partition import pdf
from unstructured.partition.utils.constants import PartitionStrategy


@pytest.fixture(scope="session")
def chipper_results():
elements = pdf.partition_pdf(
"example-docs/layout-parser-paper-fast.pdf",
filename=example_doc_path("pdf/layout-parser-paper-fast.pdf"),
strategy=PartitionStrategy.HI_RES,
model_name="chipper",
)
Expand Down
Loading

0 comments on commit 0eb461a

Please sign in to comment.