You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a file, source code can be encoded in ASCII, an extended ASCII (e.g., ISO-8859-1 or Latin-1), UTF-8, UTF-16, or even UTF-32. Detecting the input format is trivial for UTF-16 and UTF-32 as they use a BOM (Byte Order Mark). Occasionally even UTF-8 might have an optional BOM, even though it is not needed (especially true with Windows tools).
So that leaves ASCII, ISO-8859-1, or UTF-8 without a BOM. ASCII is a subset of UTF-8, so the main problem is detecting the difference between ISO-8859-1 and UTF-8. Note that you can specify the encoding using the --src-encoding flag for the srcml client. However, when you run srcml on an entire project that encoding must be correct for the entire project. For example, on the linux kernel, you could use --src-encoding="ISO-8859-1" or --src-encoding="UTF-8". At one point, most of the linux kernel was in ISO-8859-1. However, that has changed over time with most files in UTF-8 (or ASCII) and some still encoded with ISO-8859-1.
The default input source encoding is ISO-8859-1, and probably should remain that. What we need to do is auto-detect any valid UTF-8 multi-byte sequences, and at that point switch to UTF-8 source encoding. Unlike with the BOM, this can occur at any point in the input stream. Fortunately, we use UTF-8 internally (as does libxml2), so ISO-8859-1 is converted to UTF-8 using iconv, and switching to UTF-8 can be done inexpensively. Note that previous characters in the range U+0000 to U+007F (ASCII) are the same in all three encodings.
The text was updated successfully, but these errors were encountered:
In a file, source code can be encoded in ASCII, an extended ASCII (e.g., ISO-8859-1 or Latin-1), UTF-8, UTF-16, or even UTF-32. Detecting the input format is trivial for UTF-16 and UTF-32 as they use a BOM (Byte Order Mark). Occasionally even UTF-8 might have an optional BOM, even though it is not needed (especially true with Windows tools).
So that leaves ASCII, ISO-8859-1, or UTF-8 without a BOM. ASCII is a subset of UTF-8, so the main problem is detecting the difference between ISO-8859-1 and UTF-8. Note that you can specify the encoding using the
--src-encoding
flag for the srcml client. However, when you run srcml on an entire project that encoding must be correct for the entire project. For example, on the linux kernel, you could use--src-encoding="ISO-8859-1"
or--src-encoding="UTF-8"
. At one point, most of the linux kernel was in ISO-8859-1. However, that has changed over time with most files in UTF-8 (or ASCII) and some still encoded with ISO-8859-1.The default input source encoding is ISO-8859-1, and probably should remain that. What we need to do is auto-detect any valid UTF-8 multi-byte sequences, and at that point switch to UTF-8 source encoding. Unlike with the BOM, this can occur at any point in the input stream. Fortunately, we use UTF-8 internally (as does libxml2), so ISO-8859-1 is converted to UTF-8 using iconv, and switching to UTF-8 can be done inexpensively. Note that previous characters in the range U+0000 to U+007F (ASCII) are the same in all three encodings.
The text was updated successfully, but these errors were encountered: