Load-tabby is locale-dependent w.r.t file encoding #112

mslw · 2023-09-25T10:46:48Z

load_tabby (more specifically, TabbyLoader) reads a file using:

with src.open(newline='') as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')

This means that the file is read using system default encoding (ref). This can cause problems e.g. when reading a ISO-8859-1-encoded file (presumably generated on Windows) on a Linux machine (where UTF-8 is the most likely).

Reproducing: I encountered the problem when loading a dataset tabby file that contained a phrase "2 µg/gram of weight" in its description, and was saved in ISO-8859-1 encoding (as reported by file -i). Loading crashed with:

  File "/home/mszczepanik/Documents/datalad-tabby/datalad_tabby/io/load.py", line 101, in _load_single
    for row_id, row in enumerate(reader):
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 298: invalid start byte

This happened in a data submission / data curation context. Personally, I don't mind treating this as a user error - either saving in ISO-8859-1 to begin with, or not checking the file encoding before proceeding (in the end, I converted the file with iconv).

I am not sure what would be the fix here, if any. Adding encoding to the loader, and exposing it through the API would complicate loading and still require me to check the encoding upfront. Guesswork with things like chardet or libmagic might be possible, but as far as I understand can't be perfect either.

The text was updated successfully, but these errors were encountered:

If reading a tsv file with default encoding fails, roll out a cannon (charset-normalizer) and try to guess encoding to use. By default, `Path.open()` will use `locale.getencoding()` when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job. This commit refactors tabby loader, makes it use guessed encoding (but only after the default fails) and closes psychoinformatics-de#112 https://charset-normalizer.readthedocs.io

By default, `Path.open()` uses `locale.getencoding()` when opening the file for reading. This has caused problems when loading files saved (presumably on Windows) with iso-8859-1 encoding on linux (where utf-8 is the default), see psychoinformatics-de#112 The default behaviour is maintained with `encoding=None`, and any valid encoding name can be provided as an argument to load_tabby. The encoding will be used for loading tsv files. The encoding is stored as an attribute of `_TabbyLoader` rather than passed as an input to the load functions - since they may end up being called in a few places (when sheet import statements are found), it would be too much passing around otherwise. With external libraries it might be possible to guess a file encoding that produces a correct result based on the files content, but the success is not guaranteed when there are few non-ascii characters in the entire file (think: list of authors). Here, we do not attempt to guess, instead expecting the user to know the encoding they need to use. Ref: https://docs.python.org/3/library/pathlib.html#pathlib.Path.open https://docs.python.org/3/library/functions.html#open

mslw mentioned this issue Nov 9, 2023

Add further A07 datasets sfb1451/metadata-catalog#76

Closed

mslw mentioned this issue Nov 13, 2023

Guess encoding if default does not work #114

Closed

mslw linked a pull request Nov 21, 2023 that will close this issue

Add an encoding parameter to io.load_tabby #116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load-tabby is locale-dependent w.r.t file encoding #112

Load-tabby is locale-dependent w.r.t file encoding #112

mslw commented Sep 25, 2023

Load-tabby is locale-dependent w.r.t file encoding #112

Load-tabby is locale-dependent w.r.t file encoding #112

Comments

mslw commented Sep 25, 2023