 |
|
|
 | GB 18030 has some unusual properties that present challenges for an implementation of a codepage converter as well as for in-process use:
-
It is huge: With the encoding structure as described above, there are more than 1.6 million valid byte sequences -- probably the largest codepage.
-
It is similar to a UTF: All 1.1 million Unicode code points U+0000-U+10ffff except for surrogates U+d800-U+dfff map to and from GB 18030 codes. This includes unassigned and "not-a-character" code points.
-
GB 18030 is defined as much with charts of assigned characters as with a mapping table to and from Unicode.
-
It is not possible for all codepage byte sequences to determine the length of the sequence from the first byte.
-
The four-byte sequences use trail byte values 0x30..0x39, while common, ASCII-based multi-byte encodings are using trail byte values of 0x40 and above. (0x30..0x39 are the ASCII code values for the decimal digits.) This means that there is an even larger overlap between single-byte values and trail-byte values, which makes random access in GB 18030 text even more difficult than in other multi-byte codepages.
|
|
Continue to Suggestions for dealing with these challenges |
|
|
|
|
|
|  |
|
Easy ways to get the answers you need. |
| |  |
|
|
|