Feature/gale local ocr #671

laurejt · 2024-09-19T21:08:30Z

No description provided.

* Added units for local ocr code

rlskoeser · 2024-09-20T15:55:12Z

ppa/archive/gale.py

+
+    stub_dir = item_id[::3][1:]  # Following conventions set in ppa-nlp
+    ocr_txt_fp = f"{ocr_dir}/{stub_dir}/{item_id}/{item_id}_{page_num}0.txt"
+    with open(ocr_txt_fp) as reader:


To fix the problem, we need to ensure that the constructed file path is contained within a safe root directory. This can be achieved by normalizing the path using os.path.normpath and then checking that the normalized path starts with the root directory. This approach will prevent path traversal attacks by ensuring that the final path does not escape the intended directory.

Normalize the constructed file path using os.path.normpath.

Check that the normalized path starts with the root directory (ocr_dir).

Raise an exception if the check fails.

This copilot suggestion is interesting, I wasn't familiar with normpath. This shouldn't happen in our code the way we're using the new method, although the check shouldn't hurt. I wonder if using os.path.join helps any instead of using the f-string to construct the path.

Yeah, using os.path.join seems like a better idea to me.

rlskoeser

Looks great! The tests are nice, and thanks for catching that inferred page label issue.

I'm suggesting adding a few more comments and I think you should do something about the path issue that the AI code review flagged, but I think that's all that's needed.

ppa/archive/tests/test_gale.py

rlskoeser · 2024-09-20T15:52:27Z

ppa/archive/tests/test_gale.py

+        }
+        mock_get_item.return_value = api_response
+        # Set up get_local_ocr so that only the 3rd page's text is found
+        mock_get_local_ocr.side_effect = [FileNotFoundError, FileNotFoundError, "local ocr text"]


ppa/archive/tests/test_gale.py

rlskoeser · 2024-09-20T15:55:12Z

ppa/archive/gale.py

+
+    stub_dir = item_id[::3][1:]  # Following conventions set in ppa-nlp
+    ocr_txt_fp = f"{ocr_dir}/{stub_dir}/{item_id}/{item_id}_{page_num}0.txt"
+    with open(ocr_txt_fp) as reader:


This copilot suggestion is interesting, I wasn't familiar with normpath. This shouldn't happen in our code the way we're using the new method, although the check shouldn't hurt. I wonder if using os.path.join helps any instead of using the f-string to construct the path.

ppa/archive/gale.py

Co-authored-by: Rebecca Sutton Koeser <[email protected]>

rlskoeser

Thanks for adding so much detail on the local ocr setup, really helpful to have here.

The other revisions look good.

laurejt added 22 commits September 18, 2024 13:53

Initial commit

9a45cd9

Add local settings config for Gale local ocr path

58eacbf

Added a local settings parameter for gale local ocr path

be9cabe

Fix failing unit tests by setting GALE_LOCAL_OCR setting

5ee0a99

Correcting typo to fix unit test

973dbbc

* Fixed typo when setting content to ocrText.

1d5fbee

* Added units for local ocr code

Unit text fix of malformed overriding of settings

d9c73fb

Corrected improper calls to get_local_ocr in unit tests

6170366

Small pathlib fix (join --> join_path)

f4926cc

Small pathlib fix (join_path --> joinpath)

7e2af1b

Ficing filename typo in unit test

c017179

Fixed typo of stub dir name in get_local_ocr unit test

e7e1ee3

Added missed mocked input parameter

c84d75b

Fixed typo

e6c64d6

Attempt to fix unit test by resetting mock method

f16d3d5

Fixed typo from last commit

463dc4d

Fixed return value error for mocked method (resetting was not enough)

b8513ef

Another typo fixed

1a94dd0

Updated gale label logic (and unit tests accordingly)

5f28589

Updated gale page indexing test

f377789

One more typo squashed

4cbe4bf

Dealt with state modification in unit test

ef6b224

laurejt requested a review from rlskoeser September 19, 2024 21:08

laurejt self-assigned this Sep 19, 2024

github-advanced-security bot found potential problems Sep 19, 2024

View reviewed changes

rlskoeser approved these changes Sep 20, 2024

View reviewed changes

laurejt and others added 3 commits September 20, 2024 13:42

Apply suggestions from code review

f2e3a11

Co-authored-by: Rebecca Sutton Koeser <[email protected]>

Additional fix to unit test suggestion update

debf92e

Fixed path issue and added additional comments

3d70d8c

laurejt requested a review from rlskoeser September 20, 2024 18:06

rlskoeser approved these changes Sep 20, 2024

View reviewed changes

Added missing module import

c8e4b0a

laurejt merged commit 90d193f into develop Sep 20, 2024
10 checks passed

laurejt deleted the feature/gale-local-ocr branch September 20, 2024 18:21

@@ -38,3 +38,5 @@
                 stub_dir = item_id[::3][1:]  # Following conventions set in ppa-nlp
-                ocr_txt_fp = os.path.join(ocr_dir, stub_dir, item_id, f"{item_id}_{page_num}0.txt")
+                ocr_txt_fp = os.path.normpath(os.path.join(ocr_dir, stub_dir, item_id, f"{item_id}_{page_num}0.txt"))
+                if not ocr_txt_fp.startswith(os.path.normpath(ocr_dir)):
+                    raise Exception("Access to the specified file is not allowed")
                 with open(ocr_txt_fp) as reader:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/gale local ocr #671

Feature/gale local ocr #671

laurejt commented Sep 19, 2024

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

rlskoeser Sep 20, 2024

laurejt Sep 20, 2024 •

edited

Loading

rlskoeser left a comment

rlskoeser Sep 20, 2024

rlskoeser Sep 20, 2024

rlskoeser left a comment

Feature/gale local ocr #671

Feature/gale local ocr #671

Conversation

laurejt commented Sep 19, 2024

rlskoeser Sep 20, 2024

Choose a reason for hiding this comment

laurejt Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

rlskoeser left a comment

Choose a reason for hiding this comment

rlskoeser Sep 20, 2024

Choose a reason for hiding this comment

rlskoeser Sep 20, 2024

Choose a reason for hiding this comment

rlskoeser left a comment

Choose a reason for hiding this comment

laurejt Sep 20, 2024 •

edited

Loading