Improve detection of UTF-8 source code #2050

mlcollard · 2024-08-20T12:50:39Z

In a file, source code can be encoded in ASCII, an extended ASCII (e.g., ISO-8859-1 or Latin-1), UTF-8, UTF-16, or even UTF-32. Detecting the input format is trivial for UTF-16 and UTF-32 as they use a BOM (Byte Order Mark). Occasionally even UTF-8 might have an optional BOM, even though it is not needed (especially true with Windows tools).

So that leaves ASCII, ISO-8859-1, or UTF-8 without a BOM. ASCII is a subset of UTF-8, so the main problem is detecting the difference between ISO-8859-1 and UTF-8. Note that you can specify the encoding using the --src-encoding flag for the srcml client. However, when you run srcml on an entire project that encoding must be correct for the entire project. For example, on the linux kernel, you could use --src-encoding="ISO-8859-1" or --src-encoding="UTF-8". At one point, most of the linux kernel was in ISO-8859-1. However, that has changed over time with most files in UTF-8 (or ASCII) and some still encoded with ISO-8859-1.

The default input source encoding is ISO-8859-1, and probably should remain that. What we need to do is auto-detect any valid UTF-8 multi-byte sequences, and at that point switch to UTF-8 source encoding. Unlike with the BOM, this can occur at any point in the input stream. Fortunately, we use UTF-8 internally (as does libxml2), so ISO-8859-1 is converted to UTF-8 using iconv, and switching to UTF-8 can be done inexpensively. Note that previous characters in the range U+0000 to U+007F (ASCII) are the same in all three encodings.

The text was updated successfully, but these errors were encountered:

mlcollard self-assigned this Aug 20, 2024

z33kz33k mentioned this issue Oct 17, 2024

Unable to change source encoding to UTF-8 via srcml.set_src_encoding() srcML/pylibsrcml#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve detection of UTF-8 source code #2050

Improve detection of UTF-8 source code #2050

mlcollard commented Aug 20, 2024

Improve detection of UTF-8 source code #2050

Improve detection of UTF-8 source code #2050

Comments

mlcollard commented Aug 20, 2024