Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security Vulnerability #3

Open
S1M0N38 opened this issue Nov 21, 2024 · 0 comments
Open

Security Vulnerability #3

S1M0N38 opened this issue Nov 21, 2024 · 0 comments

Comments

@S1M0N38
Copy link

S1M0N38 commented Nov 21, 2024

Here’s a refined version of your GitHub issue description:

Description

The formatted datasets (e.g., datasets/formatted_datasets/summarize/data.summarize.xxxx_xx_xx.json) should be JSONL files, where each line of the file is a valid JSON object.

Currently, these lines are not JSON but rather Python dictionaries (using str and int as keys/values).

When reading these custom “Python/JSONL” files, the built-in eval function is used:

def jsonl_file_read(file_name, max_num: int = -1) -> typing.Iterator:
assert file_name.endswith(".jsonl")
data_num = 0
with open(file_name, encoding="utf-8") as fin:
for idx, ln in enumerate(fin):
if max_num >= 0 and idx + 1 > max_num:
break
try:
obj = eval(ln)
yield obj
data_num += 1
except Exception as err:
print("read errors")

This approach introduces a serious security vulnerability. A user of this framework could unknowingly run scripts on a malicious dataset. For example, the text in a sample, hidden among other 1M datapoints, could be crafted to include valid Python code. When processed with eval, this could lead to arbitrary code execution on the user’s machine.

Proposed Fix

To mitigate this risk, I suggest:

  • Saving the lines in proper JSONL format.
  • Using Python’s built-in json library to read and write these files.
  • Completely avoiding the use of eval for parsing input.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant