Skip to main content

Software  > Globalization > 

Globalize your On Demand Business

General structure of Unicode

The Unicode standard defines the range of valid Unicode code points as a hexadecimal value between the values 0 and 0x10FFFF.  This range can be subdivided into 17 groups of 64K characters, known as planes.  Unicode values between the values 0x0000 and 0xFFFF are known as the Basic Multilingual Plane (BMP), or plane 0.  The BMP contains most of the characters in common usage.  The most recent versions of Unicode also define characters in planes 1, 2, and 14.  When choosing a Unicode transformation format, it helps to have some knowledge of which planes most of your character data will reside in.

As you can tell there are multiple formats to Unicode and the reason simply has to do with the history of how the Unicode standard evolved over time.  However, the answer really deals with the consistent struggle that programmers face when trying to balance simplicity versus memory and disk performance considerations.  It would appear that the simplest way to represent each character with a 4-byte quantity between 0 and 0x1FFFF.  This is exactly what UTF-32 does.  It is a simple and straightforward way of representing the data.  However, the downside to using UTF-32 is that every character consumes 4 bytes in memory or on disk.  In many cases, this can mean that the total amount of storage required for a given application may be as much as four times as much compared to the historical standard of  using 1 byte per character for Latin and Cyrillic based characters, and 2 bytes for Asian characters (Chinese, Japanese, Korean, etc.).  This additional overhead can make the use of UTF-32 for data storage impractical when memory or disk space is at a premium.


gray line

Continue to Multiple Formats


E-mail us
Easy ways to get the answers you need.
E-mail us