-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If gb18030 is revised, consider aligning the Encoding Standard #27
Comments
I disagree. We shouldn't invent yet another new encoding anymore. |
I tend to agree with @vyv03354. Since no implementation does this and developers are asked to use utf-8, I don't really see an upside here. This only increases the chance that things break. |
Fair, changing an encoding is not inventing a new one. However, it is not clear why we should change it, since implementations mostly agree here. |
We don't convert any plane-2 characters in GBK encoder. It will be changed to a character reference (𠂇). Japanese users were suffered from encoding "improvements" of JIS standards and industrial de-facto standards. Even one character change is considered as a new encoding in ISO coded character set standards. Such a change will have more harm than good even if it is out of good will. |
@vyv03354 That's really different, cause JIS doesn't mapping any characters to PUA Unicode character, that's just because at that time, The Unicode is didn't have enough charset for GBK, but now it's has, that's totally different. |
Will you add something like a line of "note" to the description for gb18030 in the spec mentioning this issue? PUA really brings a lot of issues to users as using its codepoints without a common agreement is like inventing a nationwide Unicode dialect. To be frank I would rather leave the dialect pollution in the legacy encoder/decoder bridge than let it spread in the new world, so please consider adding:
and as a basis for these changes,
See also:
@lygstate Could you please consider reopening this issue if you find my — um — attempt helpful? |
@Artoria2e5 the "new world" should use utf-8 exclusively. |
@annevk But we still need a way to migrade from the old world. |
@annevk It's true that the modern world should use UTF-8 for information exchange, processing and storage. But given that character representations in UTF-8 relies on codepoints assigned in Unicode, it makes sense to use the formal, universal codepoint assignments in this universal encoding. As stated previously, by emitting PUA codepoints in the decoder, you are speaking in a Unicode dialect codepoint-wise, resulting in a less interchangeable UTF-8 variant, thus contradicting the point of using UTF-8 everywhere. (The use of PUA here cannot be justified by a lack of definition as these ideographs do have formal assignments.) The encoder part is more about discouraging old PUA usage.
And we need to make sure that the way gives us actual "new world" stuff. By the way, there should be 24 PUA codepoints in the 2005 standard instead of 14, according to the L2/06-394 "Update on GB 18030:2005" by Ken Lunde. An interesting but sad example of this dialect split can be shown using the character U+20087 (𠂇), assigned to PUA codepoint U+E816 () in the mapping. Search engines like Google won't do normalization on PUA forms where several different sets of agreements exist, and you can see it from the search results. |
Given that no browser implements gb18030 like that I don't see why we should change this. We could easily break those relying on these bytes mapping to PUA. I'm also somewhat reluctant to add a note, since as far as I can tell this is just someone's opinion and those maintaining gb18030 have not decided to care. |
The GB18030 mapping is naturally fungible wrt PUA characters, since Unicode continues to encode Chinese code points. I think this should be recognized by Encoding. I agree that we should not remove mapping of Unicode PUA -> GB18030 (compatibility). But the problem here is round-tripping of real Unicode code points with GB18030. If I have a U+20087, convert it to GB18030, and the later reserialize the GB data as UTF-8, I will get back U+E816 rather than the original (and correct) code point. That's undesirable and a loss of information. The fact that existing implementations haven't caught up with standardization doesn't mean that we shouldn't make this change. @annevk Under what circumstances would we change? One of the problems with establishing a standard is that implementations are trying hard to be compliant with it... |
@aphillips what standard are we talking about? The standard for gb18030 has that loss of information and Encoding doesn't modify it (it does modify some other parts). |
Newer Pan-CJK font families like Adobe's Source Han Sans (lead by @kenlunde) decide to go with Unicode instead of GB 10830-flavored Unicode.
Dr. Ken "Someone" Lunde (again!) is among the editors of UAX 38 Unihan database, and has very extensive participation of many CJK-related standardization processes in Unicode.
The Chinese SAC has decided not to care about a lot of things including their translations of ANSI C (GB/T 15272:1994, ISO/IEC 9899:1990) and UCS (GB 13000:2010, ISO/IEC 10646:2003). But this lag doesn't mean that the Chinese are not using newer revisions of the C language and Unicode. The same should apply to the UCS references in GB 18030:2005. 2016-09-12: Found out that W3C (well, that sounds impractical) has some rules regarding using PUA in i18n specs. |
@Artoria2e5: The reasons why Source Han Sans (and the Google-branded Noto Sans CJK) does not support the 24 PUA code points of GB 18030 are because 1) PUA code points should be avoided in general; 2) PUA code points should especially be avoided when mixing multiple standards, which is the case for Pan-CJK fonts; 3) a GB 18030 revision is expected to be published soon that will specify the non-PUA code points for these 24 characters, which will effectively lift the PUA requirement; 4) the 24 characters have had non-PUA code points for over a decade; and 5) the "release" branch of Source Han Sans includes a utf32-gb18030pua24.map file that provides the 24 PUA mappings for those developers who need support for these PUA code points. |
Hmm, I guess that an encoding spec for dealing with legacy encodings also falls into the scope of "mixing multiple standards". It looks like reasons 1–4 are on my side... |
I guess if gb18030 is actually revised there is a chance web-focused implementations might want to change their mapping. If that happens and implementations indeed want to make a backwards incompatible change someone should raise a new issue. |
This issue was raised by me last year-early this year in #22 (and https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c1 ). As I wrote there, the current mapping makes it impossible to display those characters involved [1] on some platforms (Android and Windows 10 [2]) when they're encoded in GB 18030 because there is NO font covering the corresponding PUA code points. This is one of the most serious consequences of the current mapping to me (besides other consequences mentioned earlier). OTOH, if there are multiple fonts covering those PUA points with different interpretations, there's no easy way to pick the right one (if the only information at hand is code points) because the identify of a PUA code point is up to private parties and is indeterministic by definition. ( Needless to say, there'd be no such problem if UTF-8 is used with regular code points and we want everybody to use UTF-8 on the web. ) Given all these, removing any mapping to PUA code points (as long as there are regular Unicode characters) is desired. As mentioned in #22, I initially thought that GB18030:2005 had fixed all these up (by 2005, all the characters originally mapped to PUA code points had been encoded in the Unicode) in a way similar to what's done for HKSCS. It turned out that that was not the case, which was rather disappointing. As a result (and 24 characters affected are rarely used - especially the U+FE1x) , the change for #22 was minimal (only one code point was fixed per GB18030:2005). Given that GB18030 will be revised soon (per @kenlunde) to eliminate the canonical mapping to PUA code points, Chromium is more than willing to go ahead with mapping the 24 byte-sequences in GB18030 to regular Unicode characters. [1] In addition to the 14 CJK ideographs/radicals listed earlier, there are vertical form variants that are still mapped to PUA code points. (well, U+FE1x will be virtually unused in gb18030-encoded documents). [2] Android (at least Google's Nexus devices) does not have any font covering the PUA code points listed in [1]. |
@kenlunde any updates on gb18030 revisions? Something that can be tracked perhaps? |
@annevk: I will ping my contact at CESI in China to get the current status of the GB 18030 revision. |
Will this result in two different byte sequences (two-byte and four-byte) decoding to the same code point for some code points? Will the PUA code points that previously had two-byte representations be left without a representation? (I have doubts that changing what a legacy encoding means in terms of mapping to Unicode at this point is a net positive change even if well-intentioned.) |
Sufficient time has passed that to implement GB 18030 in any encoding other than Unicode makes no sense. The main benefit of the GB 18030 revision is to simply remove the PUA requirement from the GB 18030 certification process. Font implementation that map from those 24 PUA code points, to be GB 18030–compliant, should already be double-mapping from the corresponding 24 non-PUA code points. |
If the purpose is to simplify the Unicode subset support certification aspect of GB18030, why is the legacy encoding aspect being changed also? |
@hsivonen: It is a bit premature to know exactly what changes to the legacy encoding will change in the forthcoming GB 18030 update. Consider a couple prototypical examples from the 24 characters that currently map to PUA code points: 0xA6D9 currently maps to U+E78D, but the non-PUA equivalent is U+FE10. The GB 18030-2005 standard indicates that U+FE10 corresponds to 0x84318236. 0xFE51 currently maps to U+E816, but the non-PUA equivalent is U+20087. The GB 18030-2005 standard indicates that U+20087 corresponds to 0x95329031. The mapping for one of the characters in GB 18030-2000 was changed in the 2005 update, which gives us a glimpse about what is likely to change in the forthcoming update: 0xA8BC originally mapped to U+E7C7, but the 2005 update changed the mapping to U+1E3F, which originally mapped from 0x8135F437. 0x8135F437 now maps to U+E7C7. Following this precedent, I would expect the two examples to be changed change to the following: 0xA6D9 → U+FE10 0xFE51 → U+20087 |
I prepared a complete gb-18030-pua-changes.txt datafile that indicates the PUA change that occurred in the 2005 update, and what we can expect for the forthcoming update for the 24 remaining PUA characters by applying the same pattern. |
It seems harmful, and against the goal of avoiding the PUA, to change byte sequences that previously decoded to non-PUA code points to decode to PUA code points. This means that data out there that previously decoded to (assigned in Unicode) non-PUA code points would start mapping to the PUA. I don't see how that could be a good thing for any practical interop purpose. (I can see how that could seem appealing to the theory that the GB18030 encoding is a bijective UTF, but that's already not the case as far as the Web is concerned due to U+3000 being double-mapped and U+E5E5 being unmappable.) |
Right. I was merely showing one possible way in which China may change GB 18030 to remove the PUA requirement, by applying the pattern that was used in the 2005 update. The single mapping change in the 2005 update may have been one-off–ish enough that China figured it would be harmless, but 24 mapping changes may be a bit much to swallow at once. The history of GB 18030 goes back to GBK, which included significantly more PUA mappings, a little over 100. The ones that could be changed to non-PUA mappings were changed, and only 25 remained in GB 18030-2000, in terms of the "required" portion. The other way to handle to remove the PUA requirement, to keep the mapping stable, is to first remove the requirement to support the following 24 characters: 0xA6D9 -> U+E78D And second, to require the following 24 characters: 0x82359037 -> U+9FB4 My guess is that the original 24 characters, in terms of supporting their mappings, will be changed from "required" to "optional," and that the additional 24 characters will be changed from "optional" to "required" if the original 24 characters are not supported. Or, something to that effect. |
In principle, I agree with you. A practical question is which way is more widely used, My guess is that 0xA6D9 has been used a lot more often to represent a character that looks like U+9FB4 (encoded in U+E78D in some fonts) than '0x82359037' in GB 18030 documents. If 4-byte sequences for the 24 characters in question is extremely rare (virtually non-existent) while 2-byte sequences are relatively common (still pretty rare), the harm of repeating the 2005 change for the 24 characters is relatively contained. It also has a benefit of being able to display legacy GB18030-encoded documents on Android and elsewhere where there's no font coverage for U+Exxx PUA code points. |
All 24 characters in question are either compatibility characters (vertical forms of punctuation that are normally accessible via their horizontal counterparts and the 'vert' GSUB feature) or ideograph components that were encoded as stand-alone ideographs. This means that the chance of encountering them in the wild, in genuine documents, is somewhere between very slim and none. Besides, supporting PUA code points in the context of the Noto CJK and Source Han fonts is a total non-starter, mainly because they are Pan-CJK typefaces, and PUA usage is extremely dangerous in such contexts. |
There's no doubt about that ! I'm in full agreement with you on that.
What I'm saying is that repeating what's done in 2005 is likely to be slightly better in terms of rendering legacy gb18030 documents on Android and other places where no font support for U+Exxx PUAs is available. (note that newer WIndows Chinese fonts do not cover those PUA code points either). However, given what @kenlunde wrote, it may not matter much whichever way we choose. |
May coming soon: |
@lygstate and others: One of my friends at CESI shared with me the text from the final draft a few days ago. This confirmed that the PUA requirement for the 24 characters is being lifted. They also removed the 21 ideographs that are in the CJK Compatibility Ideographs block. While this may sound good at first, there are actually 12 CJK Unified Ideographs among those 21 ideographs, meaning that they do not decompose. Here they are: U+FA0E 﨎, U+FA0F 﨏, U+FA11 﨑, U+FA13 﨓, U+FA14 﨔, U+FA1F 﨟, U+FA21 﨡, U+FA23 﨣, U+FA24 﨤, U+FA27 﨧, U+FA28 﨨 & U+FA29 﨩. I immediately alerted CESI about this, and it sounds like they will restore those twelve CJK Unified Ideographs. The year of the standard is likely to be 2018, because that is the year when it was finished, though its release is expected to be early next year. |
Hello. I am an independent individual who is implementing a GB 18030 encoder purely as a hobby (it is quite challenging to do concisely). Was there ever an update on GB 18030-2018? |
@kenlunde Hi, does the new 2018 standard released? |
It appears that it's still under review. See IRGN2496 that was published on 2021-09-09:
(note, this is the same update as IRGN2453, dated 2021-03-05) One thing that could have delayed the publication of the new revision could be related to the new regulations alluded to in IRGN2453:
|
@kenlunde just told me that during a meeting yesterday(!!), the delegates from China anticipate the revision to be published next year (2022). Here's some more information:
|
The GB 18030-2022 Standard has been published in July 2022. The new version includes 196 Chinese characters that are in the Table of General Standard Chinese Characters but not in GB 18030-2005, includes 17,000+ other new Chinese characters, and has established three implementation levels. This standard is mandatory in Mainland China. All software supporting Chinese information processing and exchange must support level 1, some software such as operating systems, databases, and middleware must support level 2, and systems for public services must support level 3. (The higher the level, the more Chinese characters are included.) |
Support for the mandatory characters can be achieved vie UTF-8, which the Encoding Standard already supports (and, evidently, there are now UTF-8-only software and formats). |
Do we implement 18 code point swaps, after all? |
https://unicode-org.atlassian.net/browse/ICU-22098 might have implications for what some implementations do, I presume. That raises the importance of resolving #57 somewhat. |
GB18030-2022 will take effect on 1 Aug 2023. Compliance criteria include, at a minimum, not emitting PUA characters for the 24 characters for input methods, and not using the 24 PUA codepoints for fonts. However, most existing products sold on the Chinese market fail these tests and those old versions will still be expected to be used, even though they will no longer be allowed to be sold after the effective date. Also there's existing UTF-8 content which are using those PUA codepoints. To be backwards compatible with older products based on the GB18030-2005 standards, both the PUA and the non-PUA codepoints should map to the correct GB18030-2022 2-byte sequences. Whether or not the 4-byte sequences should map to the non-PUA codepoints is less of an issue -- it is not expected that there be data in GB18030 that are stored in the 4-byte form. However, if keeping the double mapping to U+3000 is deemed web compatible, then keeping the 4-byte sequences mapped to the non-PUA codepoints should also be web compatible in the same manner. |
I created #336 which I hope addresses this. Review most welcome! |
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030. In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following: 1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030. 2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030. 3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".) The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely. Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. The aim is to complete that with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240. This supersedes #335. This fixes #27 and fixes #312.
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030. In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following: 1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030. 2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030. 3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".) The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely. Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. This is completed with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240. This supersedes #335. This fixes #27 and fixes #312. This also updates the description of index gb18030 ranges to account for #22 (the change from GB18030-2000 to -2005) which it until now did not.
Cause GB18030-2005 is already one-to-one mapping bettween Unicode & GBK18030 except
The 14 characters that still mapped into Unicode PUA that at 2005,
But nowadays, all the 14 characters have correlated mapping into Unicode,
So I suggest encoding standard mapping those characters to normal Unicode characters but PUA characters.
The following 80 characters are the GBK chracters that ever mapped to Unicode PUA, and
the corresponding Unicode non-PUA character
The following 14 characters are the GB18030-2005 chracters that are still mapped to Unicode PUA, and
I suggest the encoding standard mapping those characters into Unicode non-PUA, cause we have no need
to waiting GB18030 to update it's spec just for those 14 chracters, and we could be sure those 14 chracters's
corresponding Unicode non-PUA characters are decided.
And according to these, we can decode all GBK encoding family strings to non-PUA Unicode,
Besides these, we still have the need to convert all the historical Unicode PUA characters
to proper GBK(GB18030) characters.
The text was updated successfully, but these errors were encountered: