The following is an example of an algorithm for mapping between GB 18030 and Unicode within a contiguously-enumerated range of the mapping specification. Code snippets are pseudo-code. It is possible to implement this algorithm in a general way, storing the range information alongside the mapping table. Currently, however, GB 18030 is the only codepage where this algorithm is really useful, if not necessary.
Consider the following example for a range of enumerated mappings from the XML file (this range covers all supplementary Unicode code points):
|
<range uFirst="10000" uLast="10FFFF"
bFirst="90 30 81 30" bLast="E3 32 9A 35"
bMin="81 30 81 30" bMax="FE 39 FE 39"/>
|
Note that all byte and code point values in the XML file are hexadecimal.
In order to handle GB 18030 four-byte sequences algorithmically, one needs to linearize them, i.e., generate a number for each four-byte sequence so that the difference between two such numbers is the same as the lexical difference between the byte sequences:
int linear(byte bytes[4]) {
return ((bytes[0]*10+bytes[1])*126+bytes[2])*10+bytes[3];
} |
The factors 10 and 126 are the numbers of byte values in the byte positions according to bMin and bMax: 10 values 0x30..0x39 and 126 values 0x81..0xfe. The result of this function is an ordinal number that follows the lexical order of four-byte sequences.
Given a linear value for a byte sequence, the byte sequence itself can be calculated:
byte[4] unLinear(int lin) {
byte result[4];
lin-=linear(0x81, 0x30, 0x81, 0x30); // zero-base the linear value
result[3]=0x30+lin%10; lin/=10;
result[2]=0x81+lin%126; lin/=126;
result[1]=0x30+lin%10; lin/=10;
result[0]=0x81+lin;
return result;
} |
For each contiguously enumerated range, the following must be true: uLast-uFirst == linear(bLast)-linear(bFirst)
Mapping from a GB 18030 four-byte sequence to a Unicode code point:
int mapToUnicode(byte bytes[4]) {
int lin=linear(bytes);
for each range {
if(linear(bFirst)<=lin<=linear(bLast)) {
// range found
return uFirst+(lin-linear(bFirst));
}
}
// the byte sequence is not in any known range
return error;
} |
Mapping from a Unicode code point to a GB 18030 four-byte sequence:
byte[4] mapFromUnicode(int u) {
for each range {
if(uFirst<=u<=uLast) {
// range found
return unLinear(linear(bFirst)+(u-uFirst));
}
}
// code point u is not in any known range
return error;
}
|
An example implementation of the techniques and algorithms discussed here can be found in ICU's ucnvmbcs.c. |