Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load-tabby is locale-dependent w.r.t file encoding #112

Open
mslw opened this issue Sep 25, 2023 · 0 comments · May be fixed by #116
Open

Load-tabby is locale-dependent w.r.t file encoding #112

mslw opened this issue Sep 25, 2023 · 0 comments · May be fixed by #116

Comments

@mslw
Copy link
Contributor

mslw commented Sep 25, 2023

load_tabby (more specifically, TabbyLoader) reads a file using:

with src.open(newline='') as tsvfile:
    reader = csv.reader(tsvfile, delimiter='\t')

This means that the file is read using system default encoding (ref). This can cause problems e.g. when reading a ISO-8859-1-encoded file (presumably generated on Windows) on a Linux machine (where UTF-8 is the most likely).

Reproducing: I encountered the problem when loading a dataset tabby file that contained a phrase "2 µg/gram of weight" in its description, and was saved in ISO-8859-1 encoding (as reported by file -i). Loading crashed with:

  File "/home/mszczepanik/Documents/datalad-tabby/datalad_tabby/io/load.py", line 101, in _load_single
    for row_id, row in enumerate(reader):
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 298: invalid start byte

This happened in a data submission / data curation context. Personally, I don't mind treating this as a user error - either saving in ISO-8859-1 to begin with, or not checking the file encoding before proceeding (in the end, I converted the file with iconv).

I am not sure what would be the fix here, if any. Adding encoding to the loader, and exposing it through the API would complicate loading and still require me to check the encoding upfront. Guesswork with things like chardet or libmagic might be possible, but as far as I understand can't be perfect either.

mslw added a commit to mslw/datalad-tabby that referenced this issue Nov 13, 2023
If reading a tsv file with default encoding fails, roll out a
cannon (charset-normalizer) and try to guess encoding to use.

By default, `Path.open()` will use `locale.getencoding()` when reading
a file (which means that we implicitly use utf-8, at least on
linux). This would fail when reading files with non-ascii characters
prepared (with not-uncommon settings) on Windows. There is no perfect
way to learn the encoding from a plain text file, but existing tools
seem to do a good job.

This commit refactors tabby loader, makes it use guessed encoding (but
only after the default fails) and closes psychoinformatics-de#112

https://charset-normalizer.readthedocs.io
mslw added a commit to mslw/datalad-tabby that referenced this issue Nov 13, 2023
If reading a tsv file with default encoding fails, roll out a
cannon (charset-normalizer) and try to guess encoding to use.

By default, `Path.open()` will use `locale.getencoding()` when reading
a file (which means that we implicitly use utf-8, at least on
linux). This would fail when reading files with non-ascii characters
prepared (with not-uncommon settings) on Windows. There is no perfect
way to learn the encoding from a plain text file, but existing tools
seem to do a good job.

This commit refactors tabby loader, makes it use guessed encoding (but
only after the default fails) and closes psychoinformatics-de#112

https://charset-normalizer.readthedocs.io
mslw added a commit to mslw/datalad-tabby that referenced this issue Nov 21, 2023
By default, `Path.open()` uses `locale.getencoding()` when opening the
file for reading. This has caused problems when loading files
saved (presumably on Windows) with iso-8859-1 encoding on linux (where
utf-8 is the default), see psychoinformatics-de#112

The default behaviour is maintained with `encoding=None`, and any
valid encoding name can be provided as an argument to load_tabby. The
encoding will be used for loading tsv files.

The encoding is stored as an attribute of `_TabbyLoader` rather than
passed as an input to the load functions - since they may end up being
called in a few places (when sheet import statements are found), it
would be too much passing around otherwise.

With external libraries it might be possible to guess a file encoding
that produces a correct result based on the files content, but the
success is not guaranteed when there are few non-ascii characters in
the entire file (think: list of authors). Here, we do not attempt to
guess, instead expecting the user to know the encoding they need to
use.

Ref:
https://docs.python.org/3/library/pathlib.html#pathlib.Path.open
https://docs.python.org/3/library/functions.html#open
@mslw mslw linked a pull request Nov 21, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant