Extracted text data must be in utf-8 #799

mjordan · 2024-07-09T10:48:31Z

Within the create_media() function, extracted text data must be encoded as utf-8:

 5135         # extracted_text media must have their field_edited_text field populated for full text indexing.
 5136         if media_type == "extracted_text":
 5137             if check_file_exists(config, filename):
 5138                 media_json["field_edited_text"] = list()
 5139                 if os.path.isabs(filename) is False:
 5140                     filename = os.path.join(config["input_dir"], filename)
 5141                 extracted_text_file = open(filename, "r", -1, "utf-8")
 5142                 media_json["field_edited_text"].append(
 5143                     {"value": extracted_text_file.read()}
 5144                 )
 5145             else:
 5146                 logging.error("Extracted text file %s not found.", filename)

If it is not, line 5143 produces exceptions like UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 922: invalid start byte.

Short-term fix is to catch this error and not load the text.

The text was updated successfully, but these errors were encountered:

mjordan · 2024-07-12T16:56:36Z

This also applies to media track files.

mjordan · 2024-09-13T16:08:02Z

It would be good to add to validate files as utf8, both in --check and non-check. Maybe provide a config setting so users can decide which files (based on media use tid?) are validated.

…t determing which files to check not yet implemented.

mjordan added the bug Something isn't working label Jul 9, 2024

mjordan added a commit that referenced this issue Sep 14, 2024

WIP on #799; function to check encoding, and unit tests, complete, bu…

ca745ac

…t determing which files to check not yet implemented.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracted text data must be in utf-8 #799

Extracted text data must be in utf-8 #799

mjordan commented Jul 9, 2024

mjordan commented Jul 12, 2024

mjordan commented Sep 13, 2024 •

edited

Loading

Extracted text data must be in utf-8 #799

Extracted text data must be in utf-8 #799

Comments

mjordan commented Jul 9, 2024

mjordan commented Jul 12, 2024

mjordan commented Sep 13, 2024 • edited Loading

mjordan commented Sep 13, 2024 •

edited

Loading