generated from datalad/datalad-extension-template
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load-tabby is locale-dependent w.r.t file encoding #112
Comments
mslw
added a commit
to mslw/datalad-tabby
that referenced
this issue
Nov 13, 2023
If reading a tsv file with default encoding fails, roll out a cannon (charset-normalizer) and try to guess encoding to use. By default, `Path.open()` will use `locale.getencoding()` when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job. This commit refactors tabby loader, makes it use guessed encoding (but only after the default fails) and closes psychoinformatics-de#112 https://charset-normalizer.readthedocs.io
mslw
added a commit
to mslw/datalad-tabby
that referenced
this issue
Nov 13, 2023
If reading a tsv file with default encoding fails, roll out a cannon (charset-normalizer) and try to guess encoding to use. By default, `Path.open()` will use `locale.getencoding()` when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job. This commit refactors tabby loader, makes it use guessed encoding (but only after the default fails) and closes psychoinformatics-de#112 https://charset-normalizer.readthedocs.io
mslw
added a commit
to mslw/datalad-tabby
that referenced
this issue
Nov 21, 2023
By default, `Path.open()` uses `locale.getencoding()` when opening the file for reading. This has caused problems when loading files saved (presumably on Windows) with iso-8859-1 encoding on linux (where utf-8 is the default), see psychoinformatics-de#112 The default behaviour is maintained with `encoding=None`, and any valid encoding name can be provided as an argument to load_tabby. The encoding will be used for loading tsv files. The encoding is stored as an attribute of `_TabbyLoader` rather than passed as an input to the load functions - since they may end up being called in a few places (when sheet import statements are found), it would be too much passing around otherwise. With external libraries it might be possible to guess a file encoding that produces a correct result based on the files content, but the success is not guaranteed when there are few non-ascii characters in the entire file (think: list of authors). Here, we do not attempt to guess, instead expecting the user to know the encoding they need to use. Ref: https://docs.python.org/3/library/pathlib.html#pathlib.Path.open https://docs.python.org/3/library/functions.html#open
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
load_tabby
(more specifically,TabbyLoader
) reads a file using:This means that the file is read using system default encoding (ref). This can cause problems e.g. when reading a ISO-8859-1-encoded file (presumably generated on Windows) on a Linux machine (where UTF-8 is the most likely).
Reproducing: I encountered the problem when loading a dataset tabby file that contained a phrase "2 µg/gram of weight" in its description, and was saved in ISO-8859-1 encoding (as reported by
file -i
). Loading crashed with:This happened in a data submission / data curation context. Personally, I don't mind treating this as a user error - either saving in ISO-8859-1 to begin with, or not checking the file encoding before proceeding (in the end, I converted the file with
iconv
).I am not sure what would be the fix here, if any. Adding encoding to the loader, and exposing it through the API would complicate loading and still require me to check the encoding upfront. Guesswork with things like chardet or libmagic might be possible, but as far as I understand can't be perfect either.
The text was updated successfully, but these errors were encountered: