-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/gale local ocr #671
Conversation
* Added units for local ocr code
ppa/archive/gale.py
Dismissed
|
||
stub_dir = item_id[::3][1:] # Following conventions set in ppa-nlp | ||
ocr_txt_fp = f"{ocr_dir}/{stub_dir}/{item_id}/{item_id}_{page_num}0.txt" | ||
with open(ocr_txt_fp) as reader: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
user-provided value
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix AI about 1 month ago
To fix the problem, we need to ensure that the constructed file path is contained within a safe root directory. This can be achieved by normalizing the path using os.path.normpath
and then checking that the normalized path starts with the root directory. This approach will prevent path traversal attacks by ensuring that the final path does not escape the intended directory.
- Normalize the constructed file path using
os.path.normpath
. - Check that the normalized path starts with the root directory (
ocr_dir
). - Raise an exception if the check fails.
-
Copy modified lines R39-R41
@@ -38,3 +38,5 @@ | ||
stub_dir = item_id[::3][1:] # Following conventions set in ppa-nlp | ||
ocr_txt_fp = os.path.join(ocr_dir, stub_dir, item_id, f"{item_id}_{page_num}0.txt") | ||
ocr_txt_fp = os.path.normpath(os.path.join(ocr_dir, stub_dir, item_id, f"{item_id}_{page_num}0.txt")) | ||
if not ocr_txt_fp.startswith(os.path.normpath(ocr_dir)): | ||
raise Exception("Access to the specified file is not allowed") | ||
with open(ocr_txt_fp) as reader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This copilot suggestion is interesting, I wasn't familiar with normpath
. This shouldn't happen in our code the way we're using the new method, although the check shouldn't hurt. I wonder if using os.path.join
helps any instead of using the f-string to construct the path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, using os.path.join
seems like a better idea to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! The tests are nice, and thanks for catching that inferred page label issue.
I'm suggesting adding a few more comments and I think you should do something about the path issue that the AI code review flagged, but I think that's all that's needed.
} | ||
mock_get_item.return_value = api_response | ||
# Set up get_local_ocr so that only the 3rd page's text is found | ||
mock_get_local_ocr.side_effect = [FileNotFoundError, FileNotFoundError, "local ocr text"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
ppa/archive/gale.py
Dismissed
|
||
stub_dir = item_id[::3][1:] # Following conventions set in ppa-nlp | ||
ocr_txt_fp = f"{ocr_dir}/{stub_dir}/{item_id}/{item_id}_{page_num}0.txt" | ||
with open(ocr_txt_fp) as reader: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This copilot suggestion is interesting, I wasn't familiar with normpath
. This shouldn't happen in our code the way we're using the new method, although the check shouldn't hurt. I wonder if using os.path.join
helps any instead of using the f-string to construct the path.
Co-authored-by: Rebecca Sutton Koeser <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding so much detail on the local ocr setup, really helpful to have here.
The other revisions look good.
No description provided.