Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If gb18030 is revised, consider aligning the Encoding Standard #27

Closed
lygstate opened this issue Jan 17, 2016 · 45 comments · Fixed by #336
Closed

If gb18030 is revised, consider aligning the Encoding Standard #27

lygstate opened this issue Jan 17, 2016 · 45 comments · Fixed by #336
Labels
i18n-clreq Notifies Chinese script experts of relevant issues i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. normative

Comments

@lygstate
Copy link

Cause GB18030-2005 is already one-to-one mapping bettween Unicode & GBK18030 except
The 14 characters that still mapped into Unicode PUA that at 2005,
But nowadays, all the 14 characters have correlated mapping into Unicode,
So I suggest encoding standard mapping those characters to normal Unicode characters but PUA characters.

The following 80 characters are the GBK chracters that ever mapped to Unicode PUA, and
the corresponding Unicode non-PUA character

Han Character      GBK              Unicode PUA       Unicode non-PUA
                FE50                E815                2E81
                FE51                E816                20087
                FE52                E817                20089
                FE53                E818                200CC
                FE54                E819                2E84
                FE55                E81A                3473
                FE56                E81B                3447
                FE57                E81C                2E88
                FE58                E81D                2E8B
                FE59                E81E                9FB4
                FE5A                E81F                359E
                FE5B                E820                361A
                FE5C                E821                360E
                FE5D                E822                2E8C
                FE5E                E823                2E97
                FE5F                E824                396E
                FE60                E825                3918
                FE61                E826                9FB5
                FE62                E827                39CF
                FE63                E828                39DF
                FE64                E829                3A73
                FE65                E82A                39D0
                FE66                E82B                9FB6
                FE67                E82C                9FB7
                FE68                E82D                3B4E
                FE69                E82E                3C6E
                FE6A                E82F                3CE0
                FE6B                E830                2EA7
                FE6C                E831                215D7
                FE6D                E832                9FB8
                FE6E                E833                2EAA
                FE6F                E834                4056
                FE70                E835                415F
                FE71                E836                2EAE
                FE72                E837                4337
                FE73                E838                2EB3
                FE74                E839                2EB6
                FE75                E83A                2EB7
                FE76                E83B                2298F
                FE77                E83C                43B1
                FE78                E83D                43AC
                FE79                E83E                2EBB
                FE7A                E83F                43DD
                FE7B                E840                44D6
                FE7C                E841                4661
                FE7D                E842                464C
                FE7E                E843                9FB9
                FE80                E844                4723
                FE81                E845                4729
                FE82                E846                477C
                FE83                E847                478D
                FE84                E848                2ECA
                FE85                E849                4947
                FE86                E84A                497A
                FE87                E84B                497D
                FE88                E84C                4982
                FE89                E84D                4983
                FE8A                E84E                4985
                FE8B                E84F                4986
                FE8C                E850                499F
                FE8D                E851                499B
                FE8E                E852                49B7
                FE8F                E853                49B6
                FE90                E854                9FBA
                FE91                E855                241FE
                FE92                E856                4CA3
                FE93                E857                4C9F
                FE94                E858                4CA0
                FE95                E859                4CA1
                FE96                E85A                4C77
                FE97                E85B                4CA2
                FE98                E85C                4D13
                FE99                E85D                4D14
                FE9A                E85E                4D15
                FE9B                E85F                4D16
                FE9C                E860                4D17
                FE9D                E861                4D18
                FE9E                E862                4D19
                FE9F                E863                4DAE
                FEA0                E864                9FBB

The following 14 characters are the GB18030-2005 chracters that are still mapped to Unicode PUA, and
I suggest the encoding standard mapping those characters into Unicode non-PUA, cause we have no need
to waiting GB18030 to update it's spec just for those 14 chracters, and we could be sure those 14 chracters's
corresponding Unicode non-PUA characters are decided.

Han Character      GBK              Unicode PUA       Unicode non-PUA
                FE51                E816                20087
                FE52                E817                20089
                FE53                E818                200CC
                FE59                E81E                9FB4
                FE61                E826                9FB5
                FE66                E82B                9FB6
                FE67                E82C                9FB7
                FE6C                E831                215D7
                FE6D                E832                9FB8
                FE76                E83B                2298F
                FE7E                E843                9FB9
                FE90                E854                9FBA
                FE91                E855                241FE
                FEA0                E864                9FBB

And according to these, we can decode all GBK encoding family strings to non-PUA Unicode,
Besides these, we still have the need to convert all the historical Unicode PUA characters
to proper GBK(GB18030) characters.

@vyv03354
Copy link
Collaborator

I disagree. We shouldn't invent yet another new encoding anymore.

@annevk
Copy link
Member

annevk commented Jan 17, 2016

I tend to agree with @vyv03354. Since no implementation does this and developers are asked to use utf-8, I don't really see an upside here. This only increases the chance that things break.

@lygstate
Copy link
Author

@vyv03354 @annevk We are not invent new encoding, just getting exist encoding works.

@annevk
Copy link
Member

annevk commented Jan 18, 2016

Fair, changing an encoding is not inventing a new one. However, it is not clear why we should change it, since implementations mostly agree here.

@lygstate
Copy link
Author

@annevk @vyv03354 Please consider the following situation, suppose a have text with a Unicode character U20087, when I convert this character to GBK,
What I should to do? 0xFE51 or other invalid character?
So we are just refinement the exist convert table to the final state?

@vyv03354
Copy link
Collaborator

when I convert this character to GBK,

We don't convert any plane-2 characters in GBK encoder. It will be changed to a character reference (𠂇).

Japanese users were suffered from encoding "improvements" of JIS standards and industrial de-facto standards. Even one character change is considered as a new encoding in ISO coded character set standards. Such a change will have more harm than good even if it is out of good will.

@lygstate
Copy link
Author

@vyv03354 That's really different, cause JIS doesn't mapping any characters to PUA Unicode character, that's just because at that time, The Unicode is didn't have enough charset for GBK, but now it's has, that's totally different.

@annevk
Copy link
Member

annevk commented Jan 20, 2016

@lygstate as with the other issue, I recommend using utf-8 instead. I agree with @vyv03354 that changing implementations at this point is more likely to lead to breakage than happy users.

@annevk annevk closed this as completed Jan 20, 2016
@Artoria2e5
Copy link

Artoria2e5 commented Sep 5, 2016

Will you add something like a line of "note" to the description for gb18030 in the spec mentioning this issue? PUA really brings a lot of issues to users as using its codepoints without a common agreement is like inventing a nationwide Unicode dialect.

To be frank I would rather leave the dialect pollution in the legacy encoder/decoder bridge than let it spread in the new world, so please consider adding:

  • a flag that instructs the decoder to not emit PUA
  • a flag that instructs the encoder to warn against PUA usage potentially resulting from GB18030-200{0,5} decoding

and as a basis for these changes,

  • a mapping from "old world" PUAs to "new world" Unicode CJK Extensions.

See also:

@lygstate Could you please consider reopening this issue if you find my — um — attempt helpful?

gbk-gb18030-pua.txt

@annevk
Copy link
Member

annevk commented Sep 6, 2016

@Artoria2e5 the "new world" should use utf-8 exclusively.

@lygstate
Copy link
Author

lygstate commented Sep 6, 2016

@annevk But we still need a way to migrade from the old world.

@Artoria2e5
Copy link

Artoria2e5 commented Sep 6, 2016

@annevk It's true that the modern world should use UTF-8 for information exchange, processing and storage. But given that character representations in UTF-8 relies on codepoints assigned in Unicode, it makes sense to use the formal, universal codepoint assignments in this universal encoding.

As stated previously, by emitting PUA codepoints in the decoder, you are speaking in a Unicode dialect codepoint-wise, resulting in a less interchangeable UTF-8 variant, thus contradicting the point of using UTF-8 everywhere. (The use of PUA here cannot be justified by a lack of definition as these ideographs do have formal assignments.) The encoder part is more about discouraging old PUA usage.

But we still need a way to migra[t]e from the old world.

And we need to make sure that the way gives us actual "new world" stuff.


By the way, there should be 24 PUA codepoints in the 2005 standard instead of 14, according to the L2/06-394 "Update on GB 18030:2005" by Ken Lunde.


An interesting but sad example of this dialect split can be shown using the character U+20087 (𠂇), assigned to PUA codepoint U+E816 () in the mapping. Search engines like Google won't do normalization on PUA forms where several different sets of agreements exist, and you can see it from the search results.

@annevk
Copy link
Member

annevk commented Sep 6, 2016

Given that no browser implements gb18030 like that I don't see why we should change this. We could easily break those relying on these bytes mapping to PUA. I'm also somewhat reluctant to add a note, since as far as I can tell this is just someone's opinion and those maintaining gb18030 have not decided to care.

@aphillips
Copy link
Contributor

The GB18030 mapping is naturally fungible wrt PUA characters, since Unicode continues to encode Chinese code points. I think this should be recognized by Encoding.

I agree that we should not remove mapping of Unicode PUA -> GB18030 (compatibility). But the problem here is round-tripping of real Unicode code points with GB18030.

If I have a U+20087, convert it to GB18030, and the later reserialize the GB data as UTF-8, I will get back U+E816 rather than the original (and correct) code point. That's undesirable and a loss of information. The fact that existing implementations haven't caught up with standardization doesn't mean that we shouldn't make this change.

@annevk Under what circumstances would we change? One of the problems with establishing a standard is that implementations are trying hard to be compliant with it...

@annevk
Copy link
Member

annevk commented Sep 6, 2016

@aphillips what standard are we talking about? The standard for gb18030 has that loss of information and Encoding doesn't modify it (it does modify some other parts).

@Artoria2e5
Copy link

Artoria2e5 commented Sep 6, 2016

Given that no browser implements gb18030 like that I don't see why we should change this.

Newer Pan-CJK font families like Adobe's Source Han Sans (lead by @kenlunde) decide to go with Unicode instead of GB 10830-flavored Unicode.

I'm also somewhat reluctant to add a note, since as far as I can tell this is just someone's opinion

Dr. Ken "Someone" Lunde (again!) is among the editors of UAX 38 Unihan database, and has very extensive participation of many CJK-related standardization processes in Unicode.

and those maintaining gb18030 have not decided to care.

The Chinese SAC has decided not to care about a lot of things including their translations of ANSI C (GB/T 15272:1994, ISO/IEC 9899:1990) and UCS (GB 13000:2010, ISO/IEC 10646:2003). But this lag doesn't mean that the Chinese are not using newer revisions of the C language and Unicode. The same should apply to the UCS references in GB 18030:2005.


2016-09-12: Found out that W3C (well, that sounds impractical) has some rules regarding using PUA in i18n specs.

@kenlunde
Copy link

kenlunde commented Sep 6, 2016

@Artoria2e5: The reasons why Source Han Sans (and the Google-branded Noto Sans CJK) does not support the 24 PUA code points of GB 18030 are because 1) PUA code points should be avoided in general; 2) PUA code points should especially be avoided when mixing multiple standards, which is the case for Pan-CJK fonts; 3) a GB 18030 revision is expected to be published soon that will specify the non-PUA code points for these 24 characters, which will effectively lift the PUA requirement; 4) the 24 characters have had non-PUA code points for over a decade; and 5) the "release" branch of Source Han Sans includes a utf32-gb18030pua24.map file that provides the 24 PUA mappings for those developers who need support for these PUA code points.

@Artoria2e5
Copy link

Artoria2e5 commented Sep 6, 2016

PUA code points should especially be avoided when mixing multiple standards, which is the case for Pan-CJK fonts;

Hmm, I guess that an encoding spec for dealing with legacy encodings also falls into the scope of "mixing multiple standards". It looks like reasons 1–4 are on my side...

@annevk
Copy link
Member

annevk commented Sep 7, 2016

I guess if gb18030 is actually revised there is a chance web-focused implementations might want to change their mapping. If that happens and implementations indeed want to make a backwards incompatible change someone should raise a new issue.

@jungshik
Copy link

This issue was raised by me last year-early this year in #22 (and https://www.w3.org/Bugs/Public/show_bug.cgi?id=28740#c1 ).

As I wrote there, the current mapping makes it impossible to display those characters involved [1] on some platforms (Android and Windows 10 [2]) when they're encoded in GB 18030 because there is NO font covering the corresponding PUA code points. This is one of the most serious consequences of the current mapping to me (besides other consequences mentioned earlier).

OTOH, if there are multiple fonts covering those PUA points with different interpretations, there's no easy way to pick the right one (if the only information at hand is code points) because the identify of a PUA code point is up to private parties and is indeterministic by definition.

( Needless to say, there'd be no such problem if UTF-8 is used with regular code points and we want everybody to use UTF-8 on the web. )

Given all these, removing any mapping to PUA code points (as long as there are regular Unicode characters) is desired.

As mentioned in #22, I initially thought that GB18030:2005 had fixed all these up (by 2005, all the characters originally mapped to PUA code points had been encoded in the Unicode) in a way similar to what's done for HKSCS. It turned out that that was not the case, which was rather disappointing. As a result (and 24 characters affected are rarely used - especially the U+FE1x) , the change for #22 was minimal (only one code point was fixed per GB18030:2005).

Given that GB18030 will be revised soon (per @kenlunde) to eliminate the canonical mapping to PUA code points, Chromium is more than willing to go ahead with mapping the 24 byte-sequences in GB18030 to regular Unicode characters.

[1] In addition to the 14 CJK ideographs/radicals listed earlier, there are vertical form variants that are still mapped to PUA code points. (well, U+FE1x will be virtually unused in gb18030-encoded documents).
\xA6\xD9 U+E78D U+0fe10
\xA6\xDA U+E78E U+0fe12
\xA6\xDB U+E78F U+0fe11
\xA6\xDC U+E790 U+0fe13
\xA6\xDD U+E791 U+0fe14
\xA6\xDE U+E792 U+0fe15
\xA6\xDF U+E793 U+0fe16
\xA6\xEC U+E794 U+0fe17
\xA6\xED U+E795 U+0fe18
\xA6\xF3 U+E796 U+0fe19
\xFE\x51 U+E816 U+20087
\xFE\x52 U+E817 U+20089
\xFE\x53 U+E818 U+200cc
\xFE\x59 U+E81E U+09fb4
\xFE\x61 U+E826 U+09fb5
\xFE\x66 U+E82B U+09fb6
\xFE\x67 U+E82C U+09fb7
\xFE\x6C U+E831 U+215d7
\xFE\x6D U+E832 U+09fb8
\xFE\x76 U+E83B U+2298f
\xFE\x7E U+E843 U+09fb9
\xFE\x90 U+E854 U+09fba
\xFE\x91 U+E855 U+241fe
\xFE\xA0 U+E864 U+09fbb

[2] Android (at least Google's Nexus devices) does not have any font covering the PUA code points listed in [1].
Out of the box (perhaps unless your UI language is Simplified Chinese), Windows 10 does not have Simsun with the PUA code point coverage while it has a newer Chinese font - Microsoft YaHei - with the corresponding regular code point coverage. One can manually add Simsun, though.
At the moment, Chrome OS does have a font covering them (MSung GB18030), but may not in the future.

@annevk
Copy link
Member

annevk commented Mar 19, 2017

@kenlunde any updates on gb18030 revisions? Something that can be tracked perhaps?

@annevk annevk changed the title Remove the last 14 characters PUA of GB18030-2005 If gb18030 is revised, consider aligning the Encoding Standard Mar 19, 2017
@kenlunde
Copy link

@annevk: I will ping my contact at CESI in China to get the current status of the GB 18030 revision.

@kenlunde
Copy link

@annevk: My contact at CESI told me that a draft of the GB 18030 is expected to be available sometime this year, and is expected to fix known issues, such as this one and the presence of PUA code points when a non-PUA code point is available.

@hsivonen
Copy link
Member

hsivonen commented Mar 20, 2017

the presence of PUA code points when a non-PUA code point is available.

Will this result in two different byte sequences (two-byte and four-byte) decoding to the same code point for some code points?

Will the PUA code points that previously had two-byte representations be left without a representation?

(I have doubts that changing what a legacy encoding means in terms of mapping to Unicode at this point is a net positive change even if well-intentioned.)

@kenlunde
Copy link

Sufficient time has passed that to implement GB 18030 in any encoding other than Unicode makes no sense. The main benefit of the GB 18030 revision is to simply remove the PUA requirement from the GB 18030 certification process. Font implementation that map from those 24 PUA code points, to be GB 18030–compliant, should already be double-mapping from the corresponding 24 non-PUA code points.

@hsivonen
Copy link
Member

If the purpose is to simplify the Unicode subset support certification aspect of GB18030, why is the legacy encoding aspect being changed also?

@kenlunde
Copy link

@hsivonen: It is a bit premature to know exactly what changes to the legacy encoding will change in the forthcoming GB 18030 update.

Consider a couple prototypical examples from the 24 characters that currently map to PUA code points:

0xA6D9 currently maps to U+E78D, but the non-PUA equivalent is U+FE10. The GB 18030-2005 standard indicates that U+FE10 corresponds to 0x84318236.

0xFE51 currently maps to U+E816, but the non-PUA equivalent is U+20087. The GB 18030-2005 standard indicates that U+20087 corresponds to 0x95329031.

The mapping for one of the characters in GB 18030-2000 was changed in the 2005 update, which gives us a glimpse about what is likely to change in the forthcoming update:

0xA8BC originally mapped to U+E7C7, but the 2005 update changed the mapping to U+1E3F, which originally mapped from 0x8135F437. 0x8135F437 now maps to U+E7C7. Following this precedent, I would expect the two examples to be changed change to the following:

0xA6D9 → U+FE10
0x84318236 → U+E78D

0xFE51 → U+20087
0x95329031 → U+E816

@kenlunde
Copy link

I prepared a complete gb-18030-pua-changes.txt datafile that indicates the PUA change that occurred in the 2005 update, and what we can expect for the forthcoming update for the 24 remaining PUA characters by applying the same pattern.

@hsivonen
Copy link
Member

0x84318236 → U+E78D
...
0x95329031 → U+E816

It seems harmful, and against the goal of avoiding the PUA, to change byte sequences that previously decoded to non-PUA code points to decode to PUA code points. This means that data out there that previously decoded to (assigned in Unicode) non-PUA code points would start mapping to the PUA.

I don't see how that could be a good thing for any practical interop purpose. (I can see how that could seem appealing to the theory that the GB18030 encoding is a bijective UTF, but that's already not the case as far as the Web is concerned due to U+3000 being double-mapped and U+E5E5 being unmappable.)

@kenlunde
Copy link

Right. I was merely showing one possible way in which China may change GB 18030 to remove the PUA requirement, by applying the pattern that was used in the 2005 update. The single mapping change in the 2005 update may have been one-off–ish enough that China figured it would be harmless, but 24 mapping changes may be a bit much to swallow at once.

The history of GB 18030 goes back to GBK, which included significantly more PUA mappings, a little over 100. The ones that could be changed to non-PUA mappings were changed, and only 25 remained in GB 18030-2000, in terms of the "required" portion.

The other way to handle to remove the PUA requirement, to keep the mapping stable, is to first remove the requirement to support the following 24 characters:

0xA6D9 -> U+E78D
0xA6DA -> U+E78E
0xA6DB -> U+E78F
0xA6DC -> U+E790
0xA6DD -> U+E791
0xA6DE -> U+E792
0xA6DF -> U+E793
0xA6EC -> U+E794
0xA6ED -> U+E795
0xA6F3 -> U+E796
0xFE51 -> U+E816
0xFE52 -> U+E817
0xFE53 -> U+E818
0xFE59 -> U+E81E
0xFE61 -> U+E826
0xFE66 -> U+E82B
0xFE67 -> U+E82C
0xFE6C -> U+E831
0xFE6D -> U+E832
0xFE76 -> U+E83B
0xFE7E -> U+E843
0xFE90 -> U+E854
0xFE91 -> U+E855
0xFEA0 -> U+E864

And second, to require the following 24 characters:

0x82359037 -> U+9FB4
0x82359038 -> U+9FB5
0x82359039 -> U+9FB6
0x82359130 -> U+9FB7
0x82359131 -> U+9FB8
0x82359132 -> U+9FB9
0x82359133 -> U+9FBA
0x82359134 -> U+9FBB
0x84318236 -> U+FE10
0x84318237 -> U+FE11
0x84318238 -> U+FE12
0x84318239 -> U+FE13
0x84318330 -> U+FE14
0x84318331 -> U+FE15
0x84318332 -> U+FE16
0x84318333 -> U+FE17
0x84318334 -> U+FE18
0x84318335 -> U+FE19
0x95329031 -> U+20087
0x95329033 -> U+20089
0x95329730 -> U+200CC
0x9536B937 -> U+215D7
0x9630BA35 -> U+2298F
0x9635B630 -> U+241FE

My guess is that the original 24 characters, in terms of supporting their mappings, will be changed from "required" to "optional," and that the additional 24 characters will be changed from "optional" to "required" if the original 24 characters are not supported. Or, something to that effect.

@jungshik
Copy link

It seems harmful, and against the goal of avoiding the PUA, to change byte sequences that previously decoded to non-PUA code points to decode to PUA code points. This means that data out there that previously decoded to (assigned in Unicode) non-PUA code points would start mapping to the PUA.

In principle, I agree with you. A practical question is which way is more widely used,
"0xA6D9 -> U+E78D" or "0x82359037 -> U+9FB4" to represent a character whose glyph looks like that of U+9FB4 ?

My guess is that 0xA6D9 has been used a lot more often to represent a character that looks like U+9FB4 (encoded in U+E78D in some fonts) than '0x82359037' in GB 18030 documents.

If 4-byte sequences for the 24 characters in question is extremely rare (virtually non-existent) while 2-byte sequences are relatively common (still pretty rare), the harm of repeating the 2005 change for the 24 characters is relatively contained. It also has a benefit of being able to display legacy GB18030-encoded documents on Android and elsewhere where there's no font coverage for U+Exxx PUA code points.

@kenlunde
Copy link

All 24 characters in question are either compatibility characters (vertical forms of punctuation that are normally accessible via their horizontal counterparts and the 'vert' GSUB feature) or ideograph components that were encoded as stand-alone ideographs. This means that the chance of encountering them in the wild, in genuine documents, is somewhere between very slim and none.

Besides, supporting PUA code points in the context of the Noto CJK and Source Han fonts is a total non-starter, mainly because they are Pan-CJK typefaces, and PUA usage is extremely dangerous in such contexts.

@jungshik
Copy link

Besides, supporting PUA code points in the context of the Noto CJK and Source Han fonts is a total non-starter, mainly because they are Pan-CJK typefaces, and PUA usage is extremely dangerous in such contexts.

There's no doubt about that ! I'm in full agreement with you on that.

This means that the chance of encountering them in the wild, in genuine documents, is somewhere between very slim and none.

What I'm saying is that repeating what's done in 2005 is likely to be slightly better in terms of rendering legacy gb18030 documents on Android and other places where no font support for U+Exxx PUAs is available. (note that newer WIndows Chinese fonts do not cover those PUA code points either).

However, given what @kenlunde wrote, it may not matter much whichever way we choose.

@lygstate
Copy link
Author

May coming soon:
http://www.cesi.cn/201810/4436.html

@kenlunde
Copy link

kenlunde commented Oct 22, 2018

@lygstate and others: One of my friends at CESI shared with me the text from the final draft a few days ago. This confirmed that the PUA requirement for the 24 characters is being lifted. They also removed the 21 ideographs that are in the CJK Compatibility Ideographs block. While this may sound good at first, there are actually 12 CJK Unified Ideographs among those 21 ideographs, meaning that they do not decompose. Here they are: U+FA0E 﨎, U+FA0F 﨏, U+FA11 﨑, U+FA13 﨓, U+FA14 﨔, U+FA1F 﨟, U+FA21 﨡, U+FA23 﨣, U+FA24 﨤, U+FA27 﨧, U+FA28 﨨 & U+FA29 﨩.

I immediately alerted CESI about this, and it sounds like they will restore those twelve CJK Unified Ideographs. The year of the standard is likely to be 2018, because that is the year when it was finished, though its release is expected to be early next year.

@wswartzendruber
Copy link

Hello. I am an independent individual who is implementing a GB 18030 encoder purely as a hobby (it is quite challenging to do concisely). Was there ever an update on GB 18030-2018?

@lygstate
Copy link
Author

lygstate commented Sep 5, 2021

@kenlunde Hi, does the new 2018 standard released?

@TimothyGu
Copy link
Member

TimothyGu commented Sep 14, 2021

It appears that it's still under review. See IRGN2496 that was published on 2021-09-09:

1. Progress of the revision of GB 18030 Information technology -- Chinese coded character set

GB 18030 Information technology -- Chinese coded character set is one of the most important mandatory national standards of PRC. It has been revised once and the latest edition is GB 18030-2005. The third edition is now under review by the related authority.

(note, this is the same update as IRGN2453, dated 2021-03-05)

One thing that could have delayed the publication of the new revision could be related to the new regulations alluded to in IRGN2453:

To be noticed, according to the Measures for the Administration of Mandatory National Standards issued by the State Administration for Market Regulation of PRC in 2020, there is a big change that in the third edition the full text of GB 18030 should be all mandatory instead of partially mandatory in its previous edition.

@TimothyGu
Copy link
Member

@kenlunde just told me that during a meeting yesterday(!!), the delegates from China anticipate the revision to be published next year (2022). Here's some more information:

Note that Implementation Level 1 is the current status quo for GB 18030 support. Implementation Level 2 requires the following 199 additional ideographs from TGH-2013 (通用规范汉字表):

All glyphs

U+9FCD 鿍
U+9FCE 鿎
U+9FCF 鿏
U+20164 𠅤
U+20676 𠙶
U+20CD0 𠳐
U+2139A 𡎚
U+21413 𡐓
U+235CB 𣗋
U+23C97 𣲗
U+23C98 𣲘
U+23E23 𣸣
U+249DB 𤧛
U+24A7D 𤩽
U+24AC9 𤫉
U+25532 𥔲
U+25562 𥕢
U+255A8 𥖨
U+25ED7 𥻗
U+26221 𦈡
U+2648D 𦒍
U+26676 𦙶
U+2677C 𦝼
U+26B5C 𦭜
U+26C21 𦰡
U+27FF9 𧿹
U+28408 𨐈
U+28678 𨙸
U+28695 𨚕
U+287E0 𨟠
U+28B49 𨭉
U+28C47 𨱇
U+28C4F 𨱏
U+28C51 𨱑
U+28C54 𨱔
U+28E99 𨺙
U+29F7E 𩽾
U+29F83 𩾃
U+29F8C 𩾌
U+2A7DD 𪟝
U+2A8FB 𪣻
U+2A917 𪤗
U+2AA30 𪨰
U+2AA36 𪨶
U+2AA58 𪩘
U+2AFA2 𪾢
U+2B127 𫄧
U+2B128 𫄨
U+2B137 𫄷
U+2B138 𫄸
U+2B1ED 𫇭
U+2B300 𫌀
U+2B363 𫍣
U+2B36F 𫍯
U+2B372 𫍲
U+2B37D 𫍽
U+2B404 𫐄
U+2B410 𫐐
U+2B413 𫐓
U+2B461 𫑡
U+2B4E7 𫓧
U+2B4EF 𫓯
U+2B4F6 𫓶
U+2B4F9 𫓹
U+2B50D 𫔍
U+2B50E 𫔎
U+2B536 𫔶
U+2B5AE 𫖮
U+2B5AF 𫖯
U+2B5B3 𫖳
U+2B5E7 𫗧
U+2B5F4 𫗴
U+2B61C 𫘜
U+2B61D 𫘝
U+2B626 𫘦
U+2B627 𫘧
U+2B628 𫘨
U+2B62A 𫘪
U+2B62C 𫘬
U+2B695 𫚕
U+2B696 𫚖
U+2B6AD 𫚭
U+2B6ED 𫛭
U+2B7A9 𫞩
U+2B7C5 𫟅
U+2B7E6 𫟦
U+2B7F9 𫟹
U+2B7FC 𫟼
U+2B806 𫠆
U+2B80A 𫠊
U+2B81C 𫠜
U+2B8B8 𫢸
U+2BAC7 𫫇
U+2BB5F 𫭟
U+2BB62 𫭢
U+2BB7C 𫭼
U+2BB83 𫮃
U+2BC1B 𫰛
U+2BD77 𫵷
U+2BD87 𫶇
U+2BDF7 𫷷
U+2BE29 𫸩
U+2C029 𬀩
U+2C02A 𬀪
U+2C0A9 𬂩
U+2C0CA 𬃊
U+2C1D5 𬇕
U+2C1D9 𬇙
U+2C1F9 𬇹
U+2C27C 𬉼
U+2C288 𬊈
U+2C2A4 𬊤
U+2C317 𬌗
U+2C35B 𬍛
U+2C361 𬍡
U+2C364 𬍤
U+2C488 𬒈
U+2C494 𬒔
U+2C497 𬒗
U+2C542 𬕂
U+2C613 𬘓
U+2C618 𬘘
U+2C621 𬘡
U+2C629 𬘩
U+2C62B 𬘫
U+2C62C 𬘬
U+2C62D 𬘭
U+2C62F 𬘯
U+2C642 𬙂
U+2C64A 𬙊
U+2C64B 𬙋
U+2C72C 𬜬
U+2C72F 𬜯
U+2C79F 𬞟
U+2C7C1 𬟁
U+2C7FD 𬟽
U+2C8D9 𬣙
U+2C8DE 𬣞
U+2C8E1 𬣡
U+2C8F3 𬣳
U+2C907 𬤇
U+2C90A 𬤊
U+2C91D 𬤝
U+2CA02 𬨂
U+2CA0E 𬨎
U+2CA7D 𬩽
U+2CAA9 𬪩
U+2CB29 𬬩
U+2CB2D 𬬭
U+2CB2E 𬬮
U+2CB31 𬬱
U+2CB38 𬬸
U+2CB39 𬬹
U+2CB3B 𬬻
U+2CB3F 𬬿
U+2CB41 𬭁
U+2CB4A 𬭊
U+2CB4E 𬭎
U+2CB5A 𬭚
U+2CB5B 𬭛
U+2CB64 𬭤
U+2CB69 𬭩
U+2CB6C 𬭬
U+2CB6F 𬭯
U+2CB73 𬭳
U+2CB76 𬭶
U+2CB78 𬭸
U+2CB7C 𬭼
U+2CBB1 𬮱
U+2CBBF 𬮿
U+2CBC0 𬯀
U+2CBCE 𬯎
U+2CC56 𬱖
U+2CC5F 𬱟
U+2CCF5 𬳵
U+2CCF6 𬳶
U+2CCFD 𬳽
U+2CCFF 𬳿
U+2CD02 𬴂
U+2CD03 𬴃
U+2CD0A 𬴊
U+2CD8B 𬶋
U+2CD8D 𬶍
U+2CD8F 𬶏
U+2CD90 𬶐
U+2CD9F 𬶟
U+2CDA0 𬶠
U+2CDA8 𬶨
U+2CDAD 𬶭
U+2CDAE 𬶮
U+2CDD5 𬷕
U+2CE18 𬸘
U+2CE1A 𬸚
U+2CE23 𬸣
U+2CE26 𬸦
U+2CE2A 𬸪
U+2CE7C 𬹼
U+2CE88 𬺈
U+2CE93 𬺓

The PingFang, Source Han, and Noto CJK fonts support Implementation Level 2.

@xfq
Copy link

xfq commented Aug 17, 2022

The GB 18030-2022 Standard has been published in July 2022. The new version includes 196 Chinese characters that are in the Table of General Standard Chinese Characters but not in GB 18030-2005, includes 17,000+ other new Chinese characters, and has established three implementation levels.

This standard is mandatory in Mainland China. All software supporting Chinese information processing and exchange must support level 1, some software such as operating systems, databases, and middleware must support level 2, and systems for public services must support level 3. (The higher the level, the more Chinese characters are included.)

@xfq xfq added i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. i18n-clreq Notifies Chinese script experts of relevant issues labels Aug 17, 2022
@hsivonen
Copy link
Member

The GB 18030-2022 Standard has been published in July 2022.

More information in English

This standard is mandatory in Mainland China.

Support for the mandatory characters can be achieved vie UTF-8, which the Encoding Standard already supports (and, evidently, there are now UTF-8-only software and formats).

@vyv03354
Copy link
Collaborator

Do we implement 18 code point swaps, after all?

@annevk
Copy link
Member

annevk commented Oct 27, 2022

https://unicode-org.atlassian.net/browse/ICU-22098 might have implications for what some implementations do, I presume. That raises the importance of resolving #57 somewhat.

@hfhchan
Copy link

hfhchan commented Oct 28, 2022

GB18030-2022 will take effect on 1 Aug 2023. Compliance criteria include, at a minimum, not emitting PUA characters for the 24 characters for input methods, and not using the 24 PUA codepoints for fonts.

However, most existing products sold on the Chinese market fail these tests and those old versions will still be expected to be used, even though they will no longer be allowed to be sold after the effective date. Also there's existing UTF-8 content which are using those PUA codepoints.

To be backwards compatible with older products based on the GB18030-2005 standards, both the PUA and the non-PUA codepoints should map to the correct GB18030-2022 2-byte sequences.

Whether or not the 4-byte sequences should map to the non-PUA codepoints is less of an issue -- it is not expected that there be data in GB18030 that are stored in the 4-byte form. However, if keeping the double mapping to U+3000 is deemed web compatible, then keeping the 4-byte sequences mapped to the non-PUA codepoints should also be web compatible in the same manner.

@annevk
Copy link
Member

annevk commented Sep 17, 2024

I created #336 which I hope addresses this. Review most welcome!

annevk added a commit that referenced this issue Sep 18, 2024
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030.

In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following:

1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030.
2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030.
3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".)

The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely.

Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. The aim is to complete that with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240.

This supersedes #335. This fixes #27 and fixes #312.
@annevk annevk mentioned this issue Sep 18, 2024
5 tasks
annevk added a commit that referenced this issue Oct 4, 2024
This implements the Unicode Technical Committee recommendation around GB18030-2022 in a matter suitable for this standard, taking into account existing practice and the closeness between GBK and gb18030.

In particular, using the text file attached to https://www.unicode.org/L2/L2023/23003r-gb18030-recommendations.pdf this does the following:

1. Merges the first set of 18 mappings, which are bidirectional, directly into index gb18030, replacing existing PUA entries. This ends up impacting GBK and gb18030.
2. The second set of 18 mappings (from PUA to bytes) are encoded as an encoder only table, for both GBK and gb18030.
3. The third set of 18 mappings (from bytes to code points) are ignored, as they are already covered by index gb18030 ranges. (Presumably they are included because the recommendation covers the transition from "Previous Mappings" to "Current Mappings" to "Recommended Mappings", whereas we are going directly from "Previous Mappings" to "Recommended Mappings".)

The reason for changing GBK as well is because Chromium and WebKit have already code in the wild that impacts GBK to some degree (although the encoder only table is excluded for GBK only at the moment, including that would make the most sense compatibility-wise) and no fallout has been recorded. Additionally GBK is already positioned as a rough subset of gb18030 in this standard, with the decoder being shared completely.

Tests: encoding/legacy-mb-schinese has some GB18030-2022 coverage already. This is completed with web-platform-tests/wpt#48239 and web-platform-tests/wpt#48240.

This supersedes #335. This fixes #27 and fixes #312.

This also updates the description of index gb18030 ranges to account for #22 (the change from GB18030-2000 to -2005) which it until now did not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
i18n-clreq Notifies Chinese script experts of relevant issues i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. normative
Development

Successfully merging a pull request may close this issue.