Skip to main content

Software  > Globalization > 

Globalize your On Demand Business

Multiple formats

UTF-16 - The Two Byte Alternative
Prior to release of Unicode 3.2, all characters defined by the Unicode standard were contained in the BMP, and could be represented by a hexadecimal value between 0x0000 and 0xFFFF, thus a 2-byte quantity. Unicode also defines a mechanism by which characters outside the BMP range can be represented by a combination of two 2-byte values, known as a surrogate pair. Use of this type of character encoding is known as UTF-16. In UTF-16, all BMP characters consume 2 bytes of storage, with the value being the character value as defined in Unicode, and non-BMP characters consume 4 bytes of storage. You can refer to the Unicode standard for more detailed information about how to represent non-BMP characters in UTF-16 with surrogate pairs.

Endianness Issues
Because UTF-32 and UTF-16 define 4 byte and 2 byte quantities, respectively, programmers must potentially deal with the issue of endianness when storing and transferring this type of data. Endianness refers to the order in which differing computer architectures store quantities of more than one byte. Systems that store the most significant byte first in memory (highest value) are referred to as big endian. Most UNIX based systems are big endian. Systems that store the least significant byte first in memory (lowest value) are referred to as little endian. Most personal computers, such as Windows based systems, use the little endian architecture.

If you know the type of architecture you are going to be using the data on, you can explicitly specify the data as UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE, where the BE or LE stands for big and little endianness. Many operating systems can use these designations to perform appropriate data conversions when these encoding names are specified.

Example: Consider the different ways to encode the letter 'A' in UTF-16 and UTF-32:

UTF-32BE (big endian)

00 00 00 41
UTF-32LE (little endian)
41 00 00 00
UTF-16BE (big endian) 00 41
UTF-16LE (little endian) 41 00

Determining Endianness
Unicode provides a method by which programmers can "tag" UTF-16 or UTF-32 data files to specifically designate whether the data in it is big or little endianness. This is done by placing a special character, called the byte order mark (BOM) at the beginning of the data. The Unicode value for the byte order mark is 0xFEFF. Similarly, Unicode also guarantees that the "byte-swapped" version of this character (i.e. 0xFFFE) is NOT a valid character. So if you are reading a file encoded as UTF-16, but don't know anything about the endianness of the data, a BOM of FE FF as the first 2 bytes would denote "big endian" data, while a BOM of FF FE would denote "little endian" data to follow. Similarly, in UTF-32, a BOM of 00 00 FE FF would designate big endianess, while a BOM of FF FE 00 00 would designate little endianness.

UTF-8
The last Unicode transformation format that we will examine is UTF-8. UTF-8 defines each character as an ORDERED sequence of bytes. Depending on the Unicode value of the character, the UTF-8 representation may consume 1, 2, 3, or 4 bytes of storage. Because it is an ordered byte sequence, users of UTF-8 do not have to worry about endianness issues. If the data being represented uses primarily the Latin, Greek, Cyrillic, Hebrew or Arabic scripts, these characters will consume only 1 or 2 bytes of storage for each character, which can make it significantly more compact than UTF-32. Other characters in the BMP use 3 bytes of storage in UTF-8, while non-BMP characters will consume 4 bytes. The real disadvantage to using UTF-8 is the fact that each character can potentially consume a different amount of storage (1, 2, 3, or 4 bytes per character, depending on its value), so you have to do some additional processing to efficiently read and process UTF-8 data. However, the structure of UTF-8 is such that you can tell which type of character is being represented (i.e. 1, 2, 3, or 4-bytes) by looking at the value of the first byte in any given character. The structure of UTF-8 is represented in the following table:

Table 1: The structure of UTF-8 (click to enlarge)

Table 1: The structure of UTF-8 (click to enlarge)

In the table, the x's represent the significant bits from the Unicode value in the 1st column. So you can see that by looking at the 1st byte of a UTF-8 sequence, we can determine if the character is 1, 2, 3 or 4 bytes. Byte values 00 - 7F are 1 byte UTF-8 characters. Byte values C0 - DF will be the 1st of a 2-byte UTF-8 quantity. Byte values E0 - EF are the 1st of 3, and byte values F0 - F7 are the 1st of 4. Similarly, a byte value between 80 and BF indicates a byte that is somewhere in the middle of a UTF-8 character.


gray line

Continue to Summary


E-mail us
Easy ways to get the answers you need.
E-mail us