Skip to main content

Software  > Globalization > 

Globalize your On Demand Business

Challenges for implementations of GB 18030

GB 18030 has some unusual properties that present challenges for an implementation of a codepage converter as well as for in-process use:

  • It is huge: With the encoding structure as described above, there are more than 1.6 million valid byte sequences -- probably the largest codepage.
  • It is similar to a UTF: All 1.1 million Unicode code points U+0000-U+10ffff except for surrogates U+d800-U+dfff map to and from GB 18030 codes. This includes unassigned and "not-a-character" code points.
  • GB 18030 is defined as much with charts of assigned characters as with a mapping table to and from Unicode.
  • It is not possible for all codepage byte sequences to determine the length of the sequence from the first byte.
  • The four-byte sequences use trail byte values 0x30..0x39, while common, ASCII-based multi-byte encodings are using trail byte values of 0x40 and above. (0x30..0x39 are the ASCII code values for the decimal digits.) This means that there is an even larger overlap between single-byte values and trail-byte values, which makes random access in GB 18030 text even more difficult than in other multi-byte codepages.

Continue to Suggestions for dealing with these challenges


E-mail us
Easy ways to get the answers you need.
E-mail us

Events,  briefings and webcasts

Globalization Events
Events, briefings and webcasts

Topic contents
page link Executive summary
page link A brief history of major GB codepages
page link Structure
page link Challenges for implementations of GB 18030
page link Suggestions for dealing with these challenges
page link Algorithm for mapping contiguously-enumerated mappings between GB 18030 and Unicode
page link Conclusion and outlook
page link Resources
page link About the author
Relevant topics

Supporting GB18030 In Web Applications