JavaScript String objects also store 16-bit Unicode strings (which do not have
to be well-formed UTF-16). There is a small set of standard operations for them,
such as concatenation and lowercasing. Single characters are simply represented
as short strings, or occasionally as integers with their Unicode code points
as values.
Most "interesting" string manipulations are done with regular expressions,
which are a basic feature of JavaScript. They support prefix/suffix tests and
complicated pattern matches, and search and replace, and tests for sets of certain
classes of characters. However, Unicode is only supported on the most basic
level. Outside of the ASCII range, there are no predefined character classes,
so that a script has to define expressions with explicitly listed ranges of
characters. For example, there are more than a thousand uppercase characters
in Unicode; if a script needs to find them in a string, then it needs to define
a regular expression with a character range that lists all of them.
The ECMAScript language is somewhat limited in that the specification is written
entirely in terms of 16-bit Unicode code units. Supplementary characters (those
with code point values above 0xffff) are represented in UTF-16 with pairs of
special "surrogate" code units (or with pairs of \uxxxx
escapes in a script), and can be used in strings, but not in identifiers or
in a meaningful way in regular expressions. Historically, this is similar to
other early Unicode implementations because supplementary characters were not
assigned until Unicode 3.1 (and the current edition 3 of ECMAScript predates
that). Only a small minority of texts requires any supplementary characters,
but some of the 45,000 supplementary Chinese characters are increasingly in
demand. A script has to use custom functions to handle them. |