Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/gale local ocr #671

Merged
merged 26 commits into from
Sep 20, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
9a45cd9
Initial commit
laurejt Sep 18, 2024
58eacbf
Add local settings config for Gale local ocr path
laurejt Sep 18, 2024
be9cabe
Added a local settings parameter for gale local ocr path
laurejt Sep 18, 2024
5ee0a99
Fix failing unit tests by setting GALE_LOCAL_OCR setting
laurejt Sep 19, 2024
973dbbc
Correcting typo to fix unit test
laurejt Sep 19, 2024
1d5fbee
* Fixed typo when setting content to ocrText.
laurejt Sep 19, 2024
d9c73fb
Unit text fix of malformed overriding of settings
laurejt Sep 19, 2024
6170366
Corrected improper calls to get_local_ocr in unit tests
laurejt Sep 19, 2024
f4926cc
Small pathlib fix (join --> join_path)
laurejt Sep 19, 2024
7e2af1b
Small pathlib fix (join_path --> joinpath)
laurejt Sep 19, 2024
c017179
Ficing filename typo in unit test
laurejt Sep 19, 2024
e7e1ee3
Fixed typo of stub dir name in get_local_ocr unit test
laurejt Sep 19, 2024
c84d75b
Added missed mocked input parameter
laurejt Sep 19, 2024
e6c64d6
Fixed typo
laurejt Sep 19, 2024
f16d3d5
Attempt to fix unit test by resetting mock method
laurejt Sep 19, 2024
463dc4d
Fixed typo from last commit
laurejt Sep 19, 2024
b8513ef
Fixed return value error for mocked method (resetting was not enough)
laurejt Sep 19, 2024
1a94dd0
Another typo fixed
laurejt Sep 19, 2024
5f28589
Updated gale label logic (and unit tests accordingly)
laurejt Sep 19, 2024
f377789
Updated gale page indexing test
laurejt Sep 19, 2024
4cbe4bf
One more typo squashed
laurejt Sep 19, 2024
ef6b224
Dealt with state modification in unit test
laurejt Sep 19, 2024
f2e3a11
Apply suggestions from code review
laurejt Sep 20, 2024
debf92e
Additional fix to unit test suggestion update
laurejt Sep 20, 2024
3d70d8c
Fixed path issue and added additional comments
laurejt Sep 20, 2024
c8e4b0a
Added missing module import
laurejt Sep 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions DEPLOYNOTES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Deploy and Upgrade notes
----

* EEBO-TPC import requires configuring **EEBO_DATA** path in local settings.
* To use local OCR for Gale page content, configure **GALE_LOCAL_OCR** path in local settings.

3.12.1
------
Expand Down
33 changes: 27 additions & 6 deletions ppa/archive/gale.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,22 @@
logger = logging.getLogger(__name__)


def get_local_ocr(item_id, page_num):
"""
Get local OCR page text for specified page of a single Gale volume
"""
laurejt marked this conversation as resolved.
Show resolved Hide resolved
ocr_dir = getattr(settings, "GALE_LOCAL_OCR", None)
if not ocr_dir:
raise ImproperlyConfigured(
"GALE_LOCAL_OCR configuration is required for indexing Gale page content"
)

stub_dir = item_id[::3][1:] # Following conventions set in ppa-nlp
ocr_txt_fp = f"{ocr_dir}/{stub_dir}/{item_id}/{item_id}_{page_num}0.txt"
with open(ocr_txt_fp) as reader:

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix AI about 1 month ago

To fix the problem, we need to ensure that the constructed file path is contained within a safe root directory. This can be achieved by normalizing the path using os.path.normpath and then checking that the normalized path starts with the root directory. This approach will prevent path traversal attacks by ensuring that the final path does not escape the intended directory.

  1. Normalize the constructed file path using os.path.normpath.
  2. Check that the normalized path starts with the root directory (ocr_dir).
  3. Raise an exception if the check fails.
Suggested changeset 1
ppa/archive/gale.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/ppa/archive/gale.py b/ppa/archive/gale.py
--- a/ppa/archive/gale.py
+++ b/ppa/archive/gale.py
@@ -38,3 +38,5 @@
     stub_dir = item_id[::3][1:]  # Following conventions set in ppa-nlp
-    ocr_txt_fp = os.path.join(ocr_dir, stub_dir, item_id, f"{item_id}_{page_num}0.txt")
+    ocr_txt_fp = os.path.normpath(os.path.join(ocr_dir, stub_dir, item_id, f"{item_id}_{page_num}0.txt"))
+    if not ocr_txt_fp.startswith(os.path.normpath(ocr_dir)):
+        raise Exception("Access to the specified file is not allowed")
     with open(ocr_txt_fp) as reader:
EOF
@@ -38,3 +38,5 @@
stub_dir = item_id[::3][1:] # Following conventions set in ppa-nlp
ocr_txt_fp = os.path.join(ocr_dir, stub_dir, item_id, f"{item_id}_{page_num}0.txt")
ocr_txt_fp = os.path.normpath(os.path.join(ocr_dir, stub_dir, item_id, f"{item_id}_{page_num}0.txt"))
if not ocr_txt_fp.startswith(os.path.normpath(ocr_dir)):
raise Exception("Access to the specified file is not allowed")
with open(ocr_txt_fp) as reader:
Copilot is powered by AI and may make mistakes. Always verify output.
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copilot suggestion is interesting, I wasn't familiar with normpath. This shouldn't happen in our code the way we're using the new method, although the check shouldn't hurt. I wonder if using os.path.join helps any instead of using the f-string to construct the path.

Copy link
Contributor Author

@laurejt laurejt Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using os.path.join seems like a better idea to me.

return reader.read()


class GaleAPIError(Exception):
"""Base exception class for Gale API errors"""

Expand Down Expand Up @@ -192,14 +208,19 @@
# iterate through the pages in the response
for page in gale_record["pageResponse"]["pages"]:
page_number = page["pageNumber"]
rlskoeser marked this conversation as resolved.
Show resolved Hide resolved
# page label (original page number) should be set in folioNumber,
# but is not set for all volumes; fallback to page number
# converted to integer to drop leading zeroes
page_label = page.get("folioNumber", int(page_number))
tags = []
laurejt marked this conversation as resolved.
Show resolved Hide resolved
try:
ocr_text = get_local_ocr(item_id, page_number)
tags = ["local_ocr"]
except FileNotFoundError as e:
ocr_text = page.get("ocrText") # some pages have no text
logger.warning(f'Local OCR not found for {item_id} {page_number}')

info = {
"page_id": page_number,
"content": page.get("ocrText"), # some pages have no text
"label": page_label,
"content": ocr_text,
"label": page.get("folioNumber"),
laurejt marked this conversation as resolved.
Show resolved Hide resolved
"tags": tags,
# image id needed for thumbnail url; use solr dynamic field
"image_id_s": page["image"]["id"],
# index image url since we will need it when Gale API changes
Expand Down
2 changes: 1 addition & 1 deletion ppa/archive/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -1277,7 +1277,7 @@ def page_index_data(cls, digwork, gale_record=None):
"cluster_id_s": digwork.index_cluster_id, # for grouping with cluster
"order": i,
# make sure label is set;
# fallback to sequence number if no label, but mark with brackets
# fallback to sequence number if no (or null) label, but mark with brackets
"label": page_info.get("label") or f"[{i}]",
"item_type": "page",
}
Expand Down
73 changes: 73 additions & 0 deletions ppa/archive/tests/test_gale.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,28 @@
from ppa.archive.tests.test_models import FIXTURES_PATH


@override_settings()
def test_get_local_ocr(tmp_path):
item_id = "CB0123456789"
page_num = "0001"
content = "Testing...\n1\n2\n3"
# Mock ocr files for testing
ocr_dir = tmp_path.joinpath("147", item_id)
ocr_dir.mkdir(parents=True)
ocr_file = ocr_dir.joinpath(f"{item_id}_{page_num}0.txt")
ocr_file.write_text(content)

with override_settings(GALE_LOCAL_OCR=f"{tmp_path}"):
assert content == gale.get_local_ocr(item_id, page_num)


@override_settings()
laurejt marked this conversation as resolved.
Show resolved Hide resolved
def test_get_local_ocr_config_error():
del settings.GALE_LOCAL_OCR
with pytest.raises(ImproperlyConfigured):
gale.get_local_ocr("item_id", "page_num")


@override_settings(GALE_API_USERNAME="galeuser123")
@patch("ppa.archive.gale.requests")
class TestGaleAPI(TestCase):
Expand Down Expand Up @@ -219,6 +241,57 @@ def test_get_item(self, mockrequests):
with pytest.raises(gale.GaleAPIError):
gale_api.get_item("CW123456")

@patch("ppa.archive.gale.get_local_ocr")
@patch("ppa.archive.gale.GaleAPI.get_item")
def test_get_item_pages(self, mock_get_item, mock_get_local_ocr, mockrequests):
item_id = "CW0123456789"
# Set up API
gale_api = gale.GaleAPI()
test_pages = [
{
"pageNumber": "0001",
"folioNumber": "i",
"image": {"id": "09876001234567", "url": "http://example.com/img/1"}
# some pages have no ocr text
},
{
"pageNumber": "0002",
"image": {"id": "08765002345678", "url": "http://example.com/img/2"},
"ocrText": "more test content",
},
{
"pageNumber": "0003",
"image": {"id": "0765400456789", "url": "http://example.com/img/3"},
"ocrText": "ignored text",
},
]
api_response = {
"doc": {}, # unused for this test
"pageResponse": {"pages": test_pages},
}
mock_get_item.return_value = api_response
# Set up get_local_ocr so that only the 3rd page's text is found
mock_get_local_ocr.side_effect = [FileNotFoundError, FileNotFoundError, "local ocr text"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

page_data = list(gale_api.get_item_pages(item_id))
mock_get_item.called_once()
assert mock_get_local_ocr.call_count == 3
assert len(page_data) == 3
assert [ p["page_id"] for p in page_data ] == ["0001", "0002", "0003"]
assert [ p["content"] for p in page_data ] == [None, "more test content", "local ocr text"]
assert [ p["label"] for p in page_data ] == ["i", None, None]
assert [ p["tags"] for p in page_data ] == [ [], [], ["local_ocr"] ]
assert [ p["image_id_s"] for p in page_data ] == ["09876001234567", "08765002345678", "0765400456789"]
assert [ p["image_url_s"] for p in page_data ] == [f"http://example.com/img/{i+1}" for i in range(3)]

# skip apip call if record is provided
laurejt marked this conversation as resolved.
Show resolved Hide resolved
mock_get_item.reset_mock()
mock_get_local_ocr.reset_mock()
mock_get_local_ocr.side_effect = [FileNotFoundError, FileNotFoundError, "local ocr text"]
page_data = list(gale_api.get_item_pages(item_id, api_response))
mock_get_item.assert_not_called()
assert mock_get_local_ocr.call_count == 3
assert len(page_data) == 3


@override_settings(MARC_DATA="/path/to/data/marc")
@patch("ppa.archive.gale.PairtreeStorageFactory")
Expand Down
1 change: 1 addition & 0 deletions ppa/archive/tests/test_import_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,7 @@ def test_import_digitizedwork_marc_error(self, mock_gale_api, mock_get_marc_reco
assert importer.results[test_id] == not_found_error

# username is required to init GaleAPI class, but API is not actually used
@override_settings(GALE_LOCAL_OCR="unused")
@override_settings(GALE_API_USERNAME="unused")
@patch("ppa.archive.import_util.get_marc_record")
@patch("ppa.archive.import_util.GaleAPI")
Expand Down
77 changes: 39 additions & 38 deletions ppa/archive/tests/test_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -1151,7 +1151,7 @@ def test_page_index_data_eebotcp(self):
page_data = list(page_data)
assert page_data[0]["label"] == "[1]"
assert "Licensed,\nROBERT MIDGLEY." in page_data[0]["content"]
# eebo-tcp page data logic is tested more thoroughly elsewheer
# eebo-tcp page data logic is tested more thoroughly elsewhere

def test_page_index_data_suppressed(self):
# if item is suppressed - no page data
Expand All @@ -1165,56 +1165,57 @@ def test_page_index_data_nonhathi(self):
assert not list(Page.page_index_data(nonhathi_work))

# username is required to init GaleAPI class, but API is not actually used
@override_settings(GALE_LOCAL_OCR="unused")
@override_settings(GALE_API_USERNAME="unused")
@patch.object(gale.GaleAPI, "get_item")
def test_gale_page_index_data(self, mock_gale_get_item):
@patch.object(gale.GaleAPI, "get_item_pages")
def test_gale_page_index_data(self, mock_gale_get_item_pages):
gale_work = DigitizedWork(source=DigitizedWork.GALE, source_id="CW123456")
test_pages = [
test_page_data = [
{
"pageNumber": "0001",
"folioNumber": "i",
"image": {"id": "09876001234567", "url": "http://example.com/img/1"}
# some pages have no ocr text
"page_id": "0001",
"content": None,
"label": "i",
"tags": [],
"image_id_s": "09876001234567",
"image_url_s": "http://example.com/img/1",
},
{
"pageNumber": "0002",
"image": {"id": "08765002345678", "url": "http://example.com/img/2"},
"ocrText": "more test content",
"page_id": "0002",
"content": "original ocr content",
"label": None,
"tags": [],
"image_id_s": "08765002345678",
"image_url_s": "http://example.com/img/2",
},
{
"page_id": "0003",
"content": "local ocr content",
"label": None,
"tags": ["local_ocr"],
"image_id_s": "0765400456789",
"image_url_s": "http://example.com/img/3",
}
]
api_response = {
"doc": {}, # unused for this test
"pageResponse": {"pages": test_pages},
}
mock_gale_get_item.return_value = api_response
mock_gale_get_item_pages.return_value = test_page_data
page_data = list(Page.page_index_data(gale_work))
assert len(page_data) == 2
assert len(page_data) == 3
for i, index_data in enumerate(page_data):
assert (
index_data["id"]
== f"{gale_work.source_id}.{test_pages[i]['pageNumber']}"
)
assert index_data["id"] == f"{gale_work.source_id}.000{i+1}"
assert index_data["source_id"] == gale_work.source_id
assert index_data["content"] == test_pages[i].get("ocrText")
assert index_data["group_id_s"] == gale_work.index_id()
assert index_data["cluster_id_s"] == gale_work.index_cluster_id
assert index_data["order"] == i + 1
# should use folio number when set
if "folioNumber" in test_pages[i]:
assert index_data["label"] == test_pages[i]["folioNumber"]
else:
assert index_data["label"] == int(test_pages[i]["pageNumber"])
assert index_data["label"] == test_page_data[i]["label"] or f"[{i+1}]"
assert index_data["item_type"] == "page"
assert index_data["image_id_s"] == test_pages[i]["image"]["id"]
assert index_data["image_url_s"] == test_pages[i]["image"]["url"]

# skip api call if item data is passed in
mock_gale_get_item.reset_mock()
page_data = list(Page.page_index_data(gale_work, api_response))
assert mock_gale_get_item.get_item.call_count == 0
assert len(page_data) == 2
assert "page_id" not in index_data # this field is not preserved
assert index_data["content"] == test_page_data[i]["content"]
assert index_data["tags"] == test_page_data[i]["tags"]
assert index_data["image_id_s"] == test_page_data[i]["image_id_s"]
assert index_data["image_url_s"] == test_page_data[i]["image_url_s"]

# limit if page range specified
gale_excerpt = DigitizedWork(
source=DigitizedWork.GALE, source_id="CW123456", pages_digital="2-3"
source=DigitizedWork.GALE, source_id="CW123456", pages_digital="2-4"
)
page_data = list(Page.page_index_data(gale_excerpt, api_response))
assert len(page_data) == 1
page_data = list(Page.page_index_data(gale_excerpt))
assert len(page_data) == 2
3 changes: 3 additions & 0 deletions ppa/settings/local_settings.py.sample
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,9 @@ GALE_API_USERNAME = ''
# local path for cached marc record; needed for Gale/ECCO import
MARC_DATA = ''

# local path for Gale OCR data
GALE_LOCAL_OCR = ''

# local path for importing and indexing selected EEBO-TCP content
# should contain xml and marc files named by TCP id
EEBO_DATA = ""
Expand Down
Loading