Skip to main content

Software  > Globalization > 

Globalize your On Demand Business

Suggestions for dealing with these challenges

An implementation of GB 18030 needs to be able to determine the length of a byte sequence by examining not only the lead byte, but at least the second byte of a multi-byte sequence as well. This could be hard-coded for GB 18030, or could be done in a more general way with a state machine that represents the entire validity structure of this codepage. Such a state machine could be purely data-driven and would be useful for all multi-byte encodings. It provides a general approach for checking that any byte sequence is valid in a given codepage.

For full support of GB 18030, there are basically only two options because it is specified with a Unicode mapping table for all code points:

  • GB 18030 could be used directly as an in-process encoding. An application needs to be aware of the complex multi-byte structure that includes four-byte sequences. Almost all of the single-byte values are also valid for trail bytes.
  • It can be converted to and from Unicode without loss due to its Unicode-based specification. An application only needs a converter for this and can process text in Unicode. Converting GB 18030 into any non-Unicode encoding can result in losing some of the text.

The number of valid byte sequences -- of Unicode code points covered and of mappings defined between them -- makes it impractical to directly use a normal, purely mapping-table-based codepage converter. With about 1.1 million mappings, a simple mapping table would be several megabytes in size. Most likely, some initial implementations will not support GB 18030 fully, but only some subset of it.

A simple and effective way to handle the large number of defined mappings is to handle most of the four-byte sequences algorithmically. This is possible because the mappings between four-byte GB 18030 sequences and Unicode code points are a result of an enumeration process (see the Structure description above). Large portions of the mapping table contain entries that differ by exactly one position in both Unicode code points and byte sequences. It is possible to extract a small number of such contiguously-enumerated ranges mechanically (for details about how to do this, see this page). The result is that only the remaining mappings need to be stored in an actual mapping table, while the ranges are mapped by special code in a converter.

The XML mapping file mentioned above contains 13 such ranges to cover all but 31,000 mappings. This number is not unusual for mapping tables between Unicode and East Asian codepages. A converter using such a mapping table would first use the explicit mappings; when a result is "unassigned", then it would need to find a range that contains the input, and map algorithmically if such a range exists or otherwise treat the input as unassigned. (Of course, illegal sequences must be handled, as usual, according to the application.)

Handling the one range for the supplementary Unicode code points algorithmically eliminates all non-BMP Unicode code point mappings from the actual mapping table.

In principle, it is possible to handle all mappings involving four-byte sequences algorithmically by extracting all of them as contiguous ranges. Some of these will only contain a single mapping. Doing this would slow down the conversion for four-byte sequences but would allow the remaining mapping table to contain only mappings between single-byte and double-byte GB 18030 sequences and Unicode BMP code points. The remaining mapping table would contain only about 24,000 entries.

Continue to Algorithm for mapping contiguously-enumerated mappings between GB 18030 and Unicode


E-mail us
Easy ways to get the answers you need.
E-mail us

Events,  briefings and webcasts

Globalization Events
Events, briefings and webcasts

Topic contents
page link Executive summary
page link A brief history of major GB codepages
page link Structure
page link Challenges for implementations of GB 18030
page link Suggestions for dealing with these challenges
page link Algorithm for mapping contiguously-enumerated mappings between GB 18030 and Unicode
page link Conclusion and outlook
page link Resources
page link About the author
Relevant topics

Supporting GB18030 In Web Applications