Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve detection of UTF-8 source code #2050

Open
mlcollard opened this issue Aug 20, 2024 · 0 comments
Open

Improve detection of UTF-8 source code #2050

mlcollard opened this issue Aug 20, 2024 · 0 comments
Assignees

Comments

@mlcollard
Copy link
Contributor

In a file, source code can be encoded in ASCII, an extended ASCII (e.g., ISO-8859-1 or Latin-1), UTF-8, UTF-16, or even UTF-32. Detecting the input format is trivial for UTF-16 and UTF-32 as they use a BOM (Byte Order Mark). Occasionally even UTF-8 might have an optional BOM, even though it is not needed (especially true with Windows tools).

So that leaves ASCII, ISO-8859-1, or UTF-8 without a BOM. ASCII is a subset of UTF-8, so the main problem is detecting the difference between ISO-8859-1 and UTF-8. Note that you can specify the encoding using the --src-encoding flag for the srcml client. However, when you run srcml on an entire project that encoding must be correct for the entire project. For example, on the linux kernel, you could use --src-encoding="ISO-8859-1" or --src-encoding="UTF-8". At one point, most of the linux kernel was in ISO-8859-1. However, that has changed over time with most files in UTF-8 (or ASCII) and some still encoded with ISO-8859-1.

The default input source encoding is ISO-8859-1, and probably should remain that. What we need to do is auto-detect any valid UTF-8 multi-byte sequences, and at that point switch to UTF-8 source encoding. Unlike with the BOM, this can occur at any point in the input stream. Fortunately, we use UTF-8 internally (as does libxml2), so ISO-8859-1 is converted to UTF-8 using iconv, and switching to UTF-8 can be done inexpensively. Note that previous characters in the range U+0000 to U+007F (ASCII) are the same in all three encodings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant