Data processing in globalized applications
Data in a globalized application is by nature multi-lingual. This gives rise to various additional data-processing tasks. Tools may be needed to automate these tasks that would otherwise require intensive manpower. For example:
- Invoices could be in different scripts – yet they would need to be read by people within and outside the business who probably cannot read more than two scripts. Transliteration tools could be used to convert specific data like proper nouns (names of products, customers, addresses, etc.) from different scripts into one.
- Free-flowing text that comprises the content of web pages may need to be translated into different languages so that they can be reproduced contextually. Machine Translation simplifies translation of text among different languages.
- Structured and unstructured text received as user-feedback may need to be analyzed to generate business intelligence. Advanced text processing and data mining capabilities are needed in software to enable this.
An important attribute of multi-lingual data is the script and the language it is in. To facilitate the data processing and automation goals listed above, it is important to identify the script the data is in so that correct contextual function is performed on it.
The manner in which characters of a script are represented and assigned numeric values (or code-points) is called an encoding. The encoding attribute is fundamental to parse textual data correctly during its processing.
Over the years several encodings have been developed and used in computers. Read ‘An overview of Coded character sets’ to understand how different character-sets for the scripts of the world are encoded and identified.
Each encoding scheme specifies the set of valid numbers or the code space to which characters can be assigned. They are often documented in grids of 16 rows and as many columns as needed to represent the script. This grid is called a Code Space. Individual cells are assigned their corresponding number and a character is assigned to it. This structure is called a Coded Character Set or a Code Page. Each Coded Character Set is typically entered in a registry and assigned a unique identifier, for example the Code Page number.
Problems in processing global data
Most encodings were designed to solve the script-requirements of specific geographical regions for the problems of the day. Encodings for South Asian and East Asian scripts for example, were created without the expectation that they may exist simultaneously at any time in the future. Consequently their code-spaces often overlapped. This means that unless the identity of a particular character encoding is known, it is impossible to know which script is being represented. Some encodings have employed complex techniques to increase the number of characters they can represent, which are usually characters from other scripts.
In a scenario where many scripts can co-exist simultaneously, such as in a globalized application, this is a challenge and a processing overhead.
What is the solution?
The answer to the above problems is to use an encoding that accommodates all the scripts of the world in its coding-space. It should be such that:
- Each script is encoded in a simple, correct and consistent way.
- There is no scope for confusion about whether a character belongs to one script or another.
Unicode, which is reconciled with ISO/IEC 10646 standard, was designed to be a Universal Character Set to meet the design goals for an expanding user-base of information technology and to address the problems mentioned above. Under this design, the code-space of character sets for different scripts are grouped together to facilitate simpler management. And a very large code-space (of up to 1,114,112 numbers) ensures that all the scripts of the world can be encoded in a single encoding.
This makes Unicode the ideal choice for encoding textual data and all other encodings may be considered legacy encodings for globalized applications. Read about Unicode in a previous article: An Introduction to Unicode.
If you plan to have a globalized application, you should consider migrating non-Unicode data to Unicode to avoid the problems mentioned. In the rest of the article we shall talk about ways to do this. |