From the perspective of computer technology, East Asian languages are Japanese, Chinese, Korean, and related languages and dialects of East Asia. They are used in populous countries with large and growing internet usage. (See 'Online Language Population' http://www.glreach.com/globstats/index.php3 by Global Reach.) These languages share a common feature: They are written with several thousand characters in common, modern use, and many more of historical and academic interest. Compared with English and other languages which use alphabetic scripts, the large sets of characters present some challenges for software developers. However, modern industry standards and best practices provide solutions.
Each script also has other unique features. The Chinese script only uses ideographic symbols which represent things or ideas. Japanese uses some of the same Chinese ideographs, but also phonetic characters (with one character per syllable) for suffixes, particles and loan words. Korean is actually an alphabetic script, but is written with the characters of each syllable forming a square glyph. Computers often handle each syllable as a unit, resulting in thousands of characters despite their composition out of a few dozen alphabetic components. Korean newspapers also use some Chinese ideographs for proper nouns.
Text in computers
Computers store text as strings of code numbers, with one or more codes per character. With systems dedicated to one or two alphabetic scripts (such as the Latin script used in English), each character is usually represented by a single code byte (8 bits).
In order to handle thousands of characters for East Asian writing systems, each character must be represented by two or more code units and/or by code units of a larger width, usually 16 or sometimes 32 bits each. Traditional East Asian codepages were designed for Latin characters as well as the most common characters for the script. Over time, newer standards and systems were developed to handle many more and increasingly rarely-used characters.
Finally, the Unicode standard (http://www.unicode.org/) and the international and national standards that define the same character set (ISO 10646, GB 13000, JIS X 0221, etc.) include all characters in modern use in East Asian and most other languages, as well as a growing number of historic characters and scripts with small user communities.
Historically, character sets for alphabetic scripts were called single-byte character sets (SBCS), in contrast to early East Asian character sets which were called double-byte character sets (DBCS). These terms became obsolete with newer standards.
For a more thorough discussion of codepages, see "A brief introduction to code pages and Unicode"; (http://oss.software.ibm.com/icu/docs/papers/codepages_and_unicode.html). See also Ken Lunde's book "CJKV Information Processing," published by O'Reilly.
Keyboard input and display
Keyboards for alphabetic languages require only one or two shift keys and sometimes a simple 'dead-key' mechanism for diacritics. This does obviously not work for selecting among thousands of characters. For East Asian language keyboard input, computer systems provide so-called Input Method Editors (IMEs). They show a selection of characters in a small window while the user types phonetic syllables or special codes on a regular-size keyboard. Ambiguous input is sometimes resolved by selecting from a list of final character choices. One or more characters are only then sent to the application.
The display and printing of East Asian text is mostly straightforward. The characters are selected one by one from large fonts, and they do not interact typographically. Text is either displayed in horizontal rows like in English or in vertical columns (top-down, with columns progressing from right to left). A more sophisticated complication is the annotation of ideographic text with a phonetic pronunciation guide that is printed using a smaller font parallel to the main text.
Software developers rarely need to deal with these issues directly because they are provided by the operating system or other runtime environment (e.g., Java).
Text boundaries
East Asian languages are written without spaces between words. This means that for word selection, line breaking and similar operations, special algorithms need to be used to analyze the text. Such algorithms work best with language-specific dictionaries and by taking grammatical rules into account. While simple heuristic algorithms are readily available in various libraries (such as the Java and ICU BreakIterator at http://oss.software.ibm.com/icu/), more sophisticated ones require substantial development and are only available commercially. |