Skip to main content

Software  > Globalization > 

Globalize your On Demand Business

Algorithm for mapping contiguously-enumerated mappings between GB 18030 and Unicode

The following is an example of an algorithm for mapping between GB 18030 and Unicode within a contiguously-enumerated range of the mapping specification. Code snippets are pseudo-code. It is possible to implement this algorithm in a general way, storing the range information alongside the mapping table. Currently, however, GB 18030 is the only codepage where this algorithm is really useful, if not necessary.

Consider the following example for a range of enumerated mappings from the XML file (this range covers all supplementary Unicode code points):

 <range uFirst="10000" uLast="10FFFF"
           bFirst="90 30 81 30" bLast="E3 32 9A 35"
           bMin="81 30 81 30" bMax="FE 39 FE 39"/>

Note that all byte and code point values in the XML file are hexadecimal.

In order to handle GB 18030 four-byte sequences algorithmically, one needs to linearize them, i.e., generate a number for each four-byte sequence so that the difference between two such numbers is the same as the lexical difference between the byte sequences:

  

  int linear(byte bytes[4]) {
        return ((bytes[0]*10+bytes[1])*126+bytes[2])*10+bytes[3];
    }

 

The factors 10 and 126 are the numbers of byte values in the byte positions according to bMin and bMax: 10 values 0x30..0x39 and 126 values 0x81..0xfe. The result of this function is an ordinal number that follows the lexical order of four-byte sequences.

Given a linear value for a byte sequence, the byte sequence itself can be calculated:

 byte[4] unLinear(int lin) {
        byte result[4];
        lin-=linear(0x81, 0x30, 0x81, 0x30); // zero-base the linear value
        result[3]=0x30+lin%10;  lin/=10;
        result[2]=0x81+lin%126; lin/=126;
        result[1]=0x30+lin%10;  lin/=10;
        result[0]=0x81+lin;
        return result;
    }

For each contiguously enumerated range, the following must be true: uLast-uFirst == linear(bLast)-linear(bFirst)

Mapping from a GB 18030 four-byte sequence to a Unicode code point:


  

 int mapToUnicode(byte bytes[4]) {
        int lin=linear(bytes);
        for each range {
            if(linear(bFirst)&lt;=lin&lt=linear(bLast)) {
                // range found
                return uFirst+(lin-linear(bFirst));
            }
        }
        // the byte sequence is not in any known range
        return error;
    }
Mapping from a Unicode code point to a GB 18030 four-byte sequence: 
 byte[4] mapFromUnicode(int u) {
        for each range {
            if(uFirst&lt;=u&lt;=uLast) {
                // range found
                return unLinear(linear(bFirst)+(u-uFirst));
            }
        }
        // code point u is not in any known range
        return error;
    }

 

An example implementation of the techniques and algorithms discussed here can be found in ICU's ucnvmbcs.c.

Continue to Conclusion and outlook


E-mail us
Easy ways to get the answers you need.
E-mail us

Events,  briefings and webcasts

Globalization Events
Events, briefings and webcasts

Topic contents
page link Executive summary
page link A brief history of major GB codepages
page link Structure
page link Challenges for implementations of GB 18030
page link Suggestions for dealing with these challenges
page link Algorithm for mapping contiguously-enumerated mappings between GB 18030 and Unicode
page link Conclusion and outlook
page link Resources
page link About the author
Relevant topics

Supporting GB18030 In Web Applications