Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/gale local ocr #671

Merged
merged 26 commits into from
Sep 20, 2024
Merged

Feature/gale local ocr #671

merged 26 commits into from
Sep 20, 2024

Conversation

laurejt
Copy link
Contributor

@laurejt laurejt commented Sep 19, 2024

No description provided.

@laurejt laurejt self-assigned this Sep 19, 2024

stub_dir = item_id[::3][1:] # Following conventions set in ppa-nlp
ocr_txt_fp = f"{ocr_dir}/{stub_dir}/{item_id}/{item_id}_{page_num}0.txt"
with open(ocr_txt_fp) as reader:

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression High

This path depends on a
user-provided value
.

Copilot Autofix AI about 1 month ago

To fix the problem, we need to ensure that the constructed file path is contained within a safe root directory. This can be achieved by normalizing the path using os.path.normpath and then checking that the normalized path starts with the root directory. This approach will prevent path traversal attacks by ensuring that the final path does not escape the intended directory.

  1. Normalize the constructed file path using os.path.normpath.
  2. Check that the normalized path starts with the root directory (ocr_dir).
  3. Raise an exception if the check fails.
Suggested changeset 1
ppa/archive/gale.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/ppa/archive/gale.py b/ppa/archive/gale.py
--- a/ppa/archive/gale.py
+++ b/ppa/archive/gale.py
@@ -38,3 +38,5 @@
     stub_dir = item_id[::3][1:]  # Following conventions set in ppa-nlp
-    ocr_txt_fp = os.path.join(ocr_dir, stub_dir, item_id, f"{item_id}_{page_num}0.txt")
+    ocr_txt_fp = os.path.normpath(os.path.join(ocr_dir, stub_dir, item_id, f"{item_id}_{page_num}0.txt"))
+    if not ocr_txt_fp.startswith(os.path.normpath(ocr_dir)):
+        raise Exception("Access to the specified file is not allowed")
     with open(ocr_txt_fp) as reader:
EOF
@@ -38,3 +38,5 @@
stub_dir = item_id[::3][1:] # Following conventions set in ppa-nlp
ocr_txt_fp = os.path.join(ocr_dir, stub_dir, item_id, f"{item_id}_{page_num}0.txt")
ocr_txt_fp = os.path.normpath(os.path.join(ocr_dir, stub_dir, item_id, f"{item_id}_{page_num}0.txt"))
if not ocr_txt_fp.startswith(os.path.normpath(ocr_dir)):
raise Exception("Access to the specified file is not allowed")
with open(ocr_txt_fp) as reader:
Copilot is powered by AI and may make mistakes. Always verify output.
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copilot suggestion is interesting, I wasn't familiar with normpath. This shouldn't happen in our code the way we're using the new method, although the check shouldn't hurt. I wonder if using os.path.join helps any instead of using the f-string to construct the path.

Copy link
Contributor Author

@laurejt laurejt Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using os.path.join seems like a better idea to me.

Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! The tests are nice, and thanks for catching that inferred page label issue.

I'm suggesting adding a few more comments and I think you should do something about the path issue that the AI code review flagged, but I think that's all that's needed.

ppa/archive/tests/test_gale.py Outdated Show resolved Hide resolved
}
mock_get_item.return_value = api_response
# Set up get_local_ocr so that only the 3rd page's text is found
mock_get_local_ocr.side_effect = [FileNotFoundError, FileNotFoundError, "local ocr text"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

ppa/archive/tests/test_gale.py Outdated Show resolved Hide resolved

stub_dir = item_id[::3][1:] # Following conventions set in ppa-nlp
ocr_txt_fp = f"{ocr_dir}/{stub_dir}/{item_id}/{item_id}_{page_num}0.txt"
with open(ocr_txt_fp) as reader:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copilot suggestion is interesting, I wasn't familiar with normpath. This shouldn't happen in our code the way we're using the new method, although the check shouldn't hurt. I wonder if using os.path.join helps any instead of using the f-string to construct the path.

ppa/archive/gale.py Show resolved Hide resolved
ppa/archive/gale.py Show resolved Hide resolved
ppa/archive/gale.py Show resolved Hide resolved
ppa/archive/gale.py Show resolved Hide resolved
Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding so much detail on the local ocr setup, really helpful to have here.

The other revisions look good.

@laurejt laurejt merged commit 90d193f into develop Sep 20, 2024
10 checks passed
@laurejt laurejt deleted the feature/gale-local-ocr branch September 20, 2024 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants