Creating a good character map is an important task of a migration. It facilitates a complete understanding of the intricacies of the source and target encodings. A well-designed character map ensures loss-less conversion by demonstrating an unambiguous correspondence between characters in the source encoding and its equivalent characters in the target encoding.
Before designing the character map, it is important to understand the script(s), the encoding model in the non-Unicode-encoding and that in Unicode. For example the complex text encoding for Indic characters in Unicode implies that conjuncts are formed using the character modifier ‘virama’ (or ‘halant’). This is a character that may be subsumed within the final conjunct character represented using Unicode – and may not be encoded separately in a non-Unicode encoding.
Other encodings may have similar intricacies that should be well understood before creation of the character map. Without knowing these encoding principles it is possible to make mistakes preventing fidelity in data-conversion to Unicode.
To allow a good character map, the following considerations should be met by non-Unicode- and Unicode encodings and then kept in mind during the design of mapping process:
i. Mappability:
- Conversion of data from one encoding to another requires that the recipient code-set of the information have the necessary characters (associated with each code-point in the source) to represent the data. This means that Unicode must have the mechanism to represent the source-data according to its encoding-principles. If this is not so, then character data in the source may not be representable and data may be lost.
- A ‘legitimate’ combination of characters is one in which a particular sequence of code-points in the source-encoding generates the equivalent recognizable character (or a series of them) in Unicode. The ‘legitimacy’ of the combination, in the source or in Unicode, depends on the encoding principle for that particular character set (read Chapter 2 – ‘General Structure’ - of the Unicode Book).
- (If you have any concerns on whether Unicode encodes all characters in a script or not, you can search for your character here or join the Unicode mailing list and clarify your concerns. Errors on the standard can also be reported for consideration.)
- Establishing mappability ensures that there will be no loss of data and all possible and legitimate character formations will be converted to Unicode.
ii. Duplicate representations:
- Ideally, any single character in a script should be represented using either a single code point or a unique combination of code-points and no other. Ambiguity arises when a character can be represented in multiple ways – making the conversion process complicated and possibly error-prone.
- Duplicate-representations should either be mapped to the same (or same-set) of characters in Unicode or be standardized within the source encoding. This may be done either before the migration (preferable) or during it. If done before the conversion, it requires a separate run of conversion within the source-encoding from multiple-representations to a single-representation of characters. If done during the conversion to Unicode, duplicate combinations of source code-points can be eliminated by mapping them to the single equivalent code-point in Unicode. The mapping table should capture all these duplications. The process of removing duplicates within Unicode is called ‘Normalization’.
iii. Bidirectional scripts:
- In the case of ‘Bi Directional’ data, i.e..e. text that is written from right to left like Hebrew and Arabic, it is important to ensure that the data is mapped in logical order and not visual order. Display of text for right to left scripts should be considered separate from how it is stored to make conversion suitable for standard converters.
- Some encodings may also contain shaped characters used in Arabic. It is important to identify the right character and retain the shaping information to preserve the encoded character.
If the above criteria are met, then such a well-designed character-map will satisfy these conditions:
- There will be no loss of data during the conversion.
-
There will be no duplicate representations for the same characters (or sets of them).
- Bi-directional data will be represented logically (preserving any accompanying layout information) so that the appropriate layout engines may perform layout correctly.
An example of a Character Map The form of the character map is a table of corresponding characters in these encodings:
|
Mapping name:
|
|
Number
|
Source
|
Target
|
| |
Non-Unicode characters and code points |
Unicode characters and code points |
A map like the above should be created for all legitimate combinations that are intended for use by the source encoding.
For example:
|
Mapping name: nonUnicodeDevanagari -Unicode
|
|
|
Non-Unicode
|
Unicode
|
|
1
|

|

|
|
00D6
|
0915
|
|
Devanagari ‘ka’
|
Devanagari ‘KA’
|
|
2
|

|

|
|
00D7 + 00D1
|
0916
|
|
Half formed Devanagari ‘kha’ + Multi-use ‘danda’
|
Fully formed Devanagari ‘KHA’
|
|
3
|

|

|
|
00FA + 00D1
|
0916 + 094D + 0930
|
|
Half formed Devanagari ‘khra’ + Multi-use ‘danda’
|
Fully formed Devanagari ‘KHA ’ + ‘HALANT’ (‘VIRAMA’) + Fully formed Devanagari ‘RA’
|
|
4
|

|

|
|
00A3
|
0926 + 094D + 0927
|
|
Fully formed Devanagari ‘ddddha’
|
Fully formed Devanagari ‘DA ’ + ‘HALANT’ (‘VIRAMA’) + Fully formed Devanagari ‘DHA’
|
|
5
|

|

|
|
00F0 + 00D1
|
0915 + 094D + 0937
|
|
Half formed Devanagari ‘ksha’ + Multi-use ‘danda’
|
Fully formed Devanagari ‘KA ’ + ‘HALANT’ (‘VIRAMA’) + Fully formed Devanagari ‘SSA’
|
|
6
|

|

|
|
00B7
|
091F + 094D + 091F
|
|
Fully formed Devanagari ‘tttta’
|
Fully formed Devanagari ‘TTA ’ + ‘HALANT’ (‘VIRAMA’) + Fully formed Devanagari ‘TTA’
|
|
7
|

|

|
|
00D2 + 00AA + 00D1
|
092A + 093F
|
|
Full ‘i’ Matra + Half formed Devanagari ‘pa’ + Multi-use ‘danda’
|
Fully formed Devanagari ‘PA ’ + Devanagari ‘I’ matra
|
|
8
|

|

|
|
00AA + 00D1 + 00D1
|
092A + 093E
|
|
Half formed Devanagari ‘pa’ + Multi-use ‘danda’ + Multi-use ‘danda’
|
Fully formed Devanagari ‘PA ’ + Devanagari ‘AA’ matra
|
As shown above, there may not be a logical or linguistic correspondence between the individual coded characters in the non-Unicode encoding and Unicode. The map must be constructed via visual inspection by someone who understands the script using the character-set and the two encoding models.
The complete table serves as the basis for writing the code for conversion.
If the source encoding was rule-based, making it possible to identify the basis on which characters are represented by specific series of code-points, it would be possible to write a rule-based converter from it to Unicode.
To learn more about mapping considerations, read ‘Character Mapping Tables in the ICU User Guide.
Based on these considerations, the mapping table will contain pairings of all legitimate individual code points or code point sequences between the two encodings. This will allow the encoding-converters to run through streams of legitimate code point sequences and convert them.
See Text conversion From TSCII 1.7 to Unicode for another example of the character map.
Writing the converter
A converter works by decoding the input data to a code point (or sequences) based on the source encoding rules, matching these to the entries in the source column in the mapping table and writing out the equivalent code-point sequences as given in the target column and following the target encoding rules.

Figure 1: Click on image to display full-size.
The Unicode sequence can be converted to any desired encoding form of Unicode, such as UTF-8. Writing these converters is relatively simple. You could also use one of many converters (see ‘Internationalization Libraries’), some of which are open source, to perform conversions.
The steps are:
-
Design and write the code (in a programming language), including the definition of the format for the mapping table.
-
Transform the mapping table into the format that your converter will use.
-
Compile the code.
-
Use the compiled code to migrate the data to Unicode.
It is possible to write a generic converter and define the format of the mapping tables. Using this, you can quickly add a non-Unicode encodings to a generic converter. The following section describes one such generic converter. |