Skip to main content

Software  > Globalization > 

Globalize your On Demand Business

Structure

GB 18030-2000 encodes characters in sequences of one, two, or four bytes. Valid byte sequences are as follows (byte values are hexadecimal):

  • Single-byte: 00-80 (*)
  • Two-byte: 81-fe | 40-7e, 80-fe
  • Four-byte: 81-fe | 30-39 | 81-fe | 30-39

 

(*) Note: At the time of this writing, it seems that the single byte 0x80 should be treated as valid but unassigned, while the single byte 0xff should be treated as illegal.

GB 18030 was created with GBK as a basis. The Unicode mapping table for GB 18030 starts with the same mappings for single-byte and double-byte sequences as the Unicode mapping table for GBK, except for a few dozen characters. These characters were not assigned in Unicode 2.1 and were mapped in the GBK mapping table to Unicode Private-Use code points. GB 18030 maps them to the newly-assigned code points in Unicode 3.0 for the corresponding characters. This keeps the GBK byte sequences the same for these characters, but the Unicode mapping table yields different results for them.

In addition, all Unicode code points that are not mapped by this updated GBK portion are mapped to four-byte sequences, which are new in GB 18030. They are simply enumerated beginning at the lowest such Unicode code point (U+0080) and at the lowest such four-byte sequence (GB+81308130). One such enumeration fills in the 40,000 or so Unicode BMP code points that were not covered by GBK (GB lead bytes 0x81..0x84). Another such enumeration covers the 1 million supplementary Unicode code points (GB lead bytes 0x90..0xe3).

One of the biggest changes with the re-released mapping table from November, compared to the initial one, is that all of the 40,000 mappings to BMP code points were changed. This is mainly (but not only) due to starting the BMP enumeration at U+0080 instead of U+0081.

The current Unicode mapping table in the XML format as described in Unicode Technical Report 22 is available on the ICU Web site (see Resources).

The current Unicode mapping table contains only round-trip mappings. The original mapping table contained fallback mappings for the GBK characters that were updated according to Unicode 3.0: Their old GBK Private-Use code points were mapped unidirectionally to the GB codes, while the round-trip mappings were changed (compared to GBK) to be from the GB codes to the new (Unicode 3.0) code points. In the new mapping table, the fallback mappings are removed, and the Private-Use code points instead map to new four-byte sequences with round-trip mappings.

Note: Like some GBK implementations, the original publication of GB 18030-2000 assigned the Euro currency symbol to the single byte 0x80. The updated mapping table from November leaves 0x80 unassigned and instead maps 0xa2e3 U+20ac for the Euro symbol.

GB 18030 has 1.6 million valid byte sequences, but there are only 1.1 million code points in Unicode, so there are about 500,000 byte sequences in GB 18030 that are currently unassigned.

Continue to Challenges for implementations of GB 18030


E-mail us
Easy ways to get the answers you need.
E-mail us

Events,  briefings and webcasts

Globalization Events
Events, briefings and webcasts

Topic contents
page link Executive summary
page link A brief history of major GB codepages
page link Structure
page link Challenges for implementations of GB 18030
page link Suggestions for dealing with these challenges
page link Algorithm for mapping contiguously-enumerated mappings between GB 18030 and Unicode
page link Conclusion and outlook
page link Resources
page link About the author
Relevant topics

Supporting GB18030 In Web Applications