Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracted text data must be in utf-8 #799

Open
mjordan opened this issue Jul 9, 2024 · 2 comments
Open

Extracted text data must be in utf-8 #799

mjordan opened this issue Jul 9, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@mjordan
Copy link
Owner

mjordan commented Jul 9, 2024

Within the create_media() function, extracted text data must be encoded as utf-8:

 5135         # extracted_text media must have their field_edited_text field populated for full text indexing.
 5136         if media_type == "extracted_text":
 5137             if check_file_exists(config, filename):
 5138                 media_json["field_edited_text"] = list()
 5139                 if os.path.isabs(filename) is False:
 5140                     filename = os.path.join(config["input_dir"], filename)
 5141                 extracted_text_file = open(filename, "r", -1, "utf-8")
 5142                 media_json["field_edited_text"].append(
 5143                     {"value": extracted_text_file.read()}
 5144                 )
 5145             else:
 5146                 logging.error("Extracted text file %s not found.", filename)

If it is not, line 5143 produces exceptions like UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 922: invalid start byte.

Short-term fix is to catch this error and not load the text.

@mjordan mjordan added the bug Something isn't working label Jul 9, 2024
@mjordan
Copy link
Owner Author

mjordan commented Jul 12, 2024

This also applies to media track files.

@mjordan
Copy link
Owner Author

mjordan commented Sep 13, 2024

It would be good to add to validate files as utf8, both in --check and non-check. Maybe provide a config setting so users can decide which files (based on media use tid?) are validated.

mjordan added a commit that referenced this issue Sep 14, 2024
…t determing which files to check not yet implemented.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant