Skip to main content

Software  > Globalization > 

Globalize your On Demand Business

Data migration: preconditions and types

Preconditions for data migration
Before an application is capable of handling multi-lingual data, it should be globalized.  Designing and building a global application is based on a strategic decision to extend the reach of an on demand business. This is a separate task based on a different set of considerations.
A globalized application could either be built from scratch or by retrofitting existing applications. To learn more read: Developing global Web applications, Developing multilingual Web sites, A new paradigm for creating and globalizing Web applications, A J2EE-based Localization Services Architecture and Retrofit Globalization onto a Web Application.

Kinds of migratable data
There are three kinds of data that can be migrated depending on the identification of the non-Unicode encoding used to represent them:

1.      Data that is identified with a reliable encoding identifier:

  • This is data that exist in encodings designed for a specific language or group of languages. The encoding definitions and associated identifiers are clear and unambiguous. Approaches for using data in these encodings/codepages are well-known.
  • The identifiers and their definitions are typically registered with IANA or other well-known registries.
  • Such encodings work well for individual or groups of scripts but are not universal. Hence they are subject to the problems enumerated before and should be migrated to Unicode if the data is to be used in a globalized application.
  • There are many converters for this type of encodings.  See ICU or the Java™ site. Converters and mapping tables for some of these encodings may also be found on the Unicode web site.
  • There are also ways for this data to co-exist with Unicode data in a globalized application so that a phased approach may be taken for moving to Unicode over a period of time.

2.      Data that has an encoding identifier but it is not-so-reliable:

  • Such a situation arises when the definition of the encoding identified is not strictly used in the representation of the data so the definition behind the identifier is interpreted ambiguously.
  • These encodings are used they when the designers do not have a clear idea of the extent of the function of the encoding. Or sometimes special characters may be inserted to represent special glyphs in the character-set, but the wrong identifier is applied.
  • Converters for these are less-reliable because of the inherent unreliability of the tagging of the source-data.
  • Mis-tagging is inherently problematic and no generic converter can be applied. Special care has to be taken for this kind of encodings. However, once the correct encoding is identified either by inspection of the data or by knowing the context of the application, one of the converters listed above may be used for migration.

3.      Data in which the encoding is not registered:

  • One instance of this type is data used in systems where the processing assumptions of the system dictate the encoding and the data usually never left the closed environment.  Outside that environment the data could be meaningless. If the encoding used has one of the registered identifiers then such data can be treated as type 1 described above.
  • Another instance of this kind of ‘data’ is when it is used to display characters on a platform when the character-set is not supported on it; it pretends to be one of the supported encodings so that some limited stand-alone processing can be carried out. They may be transmitted with font-resources that form glyphs in some script when used in select-combinations of the underlying encoding scheme.

Unless the font-map for that particular encoded character set is available, users cannot view the data. If the user chooses not to download the font-resources, or if they are not available locally, only garbled text is visible. Newer dynamic font loading techniques can alleviate the manual intervention.

For example: the Hindi content in the Devanagari script on this site cannot be viewed unless fonts are downloaded for viewing. A similar site in Hindi implemented using Unicode does not require fonts to be downloaded on Operating Systems with Unicode support.

Because of the nature of these encodings, there are no generic converters for them. Vendors of these encodings however, could provide the appropriate converters.

  • To be used in a globalized application, data in encodings with no registered identifiers must be migrated to Unicode in a one-time exercise.
  • If there is a large glyph-set then the data may contain font-switching to represent all the glyphs needed. This data has multiple pieces, each with its own font-selection, such data should be separated into parts with a single encoding or font before migration, and reassembled after the migration

In the rest of the article, we shall be exploring only the conversion of data for the third kind of data above, i.e..e. data with no registered encoding-identifiers. We shall also describe the steps needed to create a converter.

To convert data in these encodings to Unicode, the following need to be done:

a. Identify the source characters in non-Unicode encodings
b. Find their Unicode equivalents, and
c. Devise a way to convert the data from non-Unicode to Unicode.

A structure called a ‘Character Map’ provides the correspondence between the characters in the non-Unicode-encoding and their Unicode equivalents. This is then used to programmatically convert the data either using existing tools or custom-made ones.

NOTE: Encoding data in Unicode does not take away the need for the resources in support of text presentation.  For example, the resources for rendering the different scripts (such as fonts and appropriate layout logic) must be present on that platform for the data to be displayed correctly.


gray line

Continue to Character Maps


E-mail us
Easy ways to get the answers you need.
E-mail us