You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Within the create_media() function, extracted text data must be encoded as utf-8:
5135 # extracted_text media must have their field_edited_text field populated for full text indexing.
5136 if media_type == "extracted_text":
5137 if check_file_exists(config, filename):
5138 media_json["field_edited_text"] = list()
5139 if os.path.isabs(filename) is False:
5140 filename = os.path.join(config["input_dir"], filename)
5141 extracted_text_file = open(filename, "r", -1, "utf-8")
5142 media_json["field_edited_text"].append(
5143 {"value": extracted_text_file.read()}
5144 )
5145 else:
5146 logging.error("Extracted text file %s not found.", filename)
If it is not, line 5143 produces exceptions like UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 922: invalid start byte.
Short-term fix is to catch this error and not load the text.
The text was updated successfully, but these errors were encountered:
It would be good to add to validate files as utf8, both in --check and non-check. Maybe provide a config setting so users can decide which files (based on media use tid?) are validated.
Within the
create_media()
function, extracted text data must be encoded as utf-8:If it is not, line 5143 produces exceptions like
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 922: invalid start byte
.Short-term fix is to catch this error and not load the text.
The text was updated successfully, but these errors were encountered: