Skip to main content

Software  > Globalization > LanguageWare > 

Globalize your On Demand Business

LanguageWare is the new generation IBM linguistic platform. It was designed from the ground up to address the demands posed by today's global applications.
Design components

Performance
Because there is a misconception that linguistic solutions cannot provide the performance needed in a high-volume unstructured information management applications, performance was a key consideration in LanguageWare design.

A number of LanguageWare components have a significant impact on performance:

  • The lexical analysis framework streamlined the traditional lexical analysis process.
  • The finite state dictionaries have been highly optimized through the investigation of statistical properties of finite-state networks from the point of view of massive random networks theory.
  • The study of cache-friendly memory layouts and advanced methods of storing sparse matrix resulted in a very efficient dictionary design
  • An efficient programming model and a number of programming techniques, such as highly optimized memory management customized for our needs.

The result is a product which we believe can outperform any other solution, linguistic or non-linguistic.

Function Performance (English)
Dictionary Lookup ~400,000 words/sec
Lexical Analysis
(segmentation, normalization, annotation)
~10 GChars/hour (UTF-16 chars)
~2 Gtokens/hour

Lexical analysis framework
We analyzed the major usage scenarios in information processing and identified the traditional layers and their bottlenecks. We discovered that, in order to get meaningful results from most linguistic solutions, the user must pass through several layers (such as tokenization, lemmatization, and annotation), with each layer adding application overhead. Our conclusion was to create an architecture that would remove this overhead by merging these layers.

One-pass text processing provides all pre-syntactic processing.

To achieve this, we designed a lexical analysis framework built on the merger of character-based iteration and finite state processing as the key LanguageWare interface. The result is that a single pass through the text provides all levels of pre-syntactic processing.

Cross-linguistic design
After a detailed examination of the linguistic phenomena that differentiate languages, we programmatically represented these phenomena through common algorithms and procedures. To do this we approached the problem from a higher-level linguistic viewpoint and identified key linguistic behaviors that might span several languages. We then identified the most appropriate formal models to process these behaviors, such as state machines, formal rule systems, logic, statistical tools, etc. This allowed us to create a minimal set of tools that solves many linguistic problems, and results in a cross-linguistic architecture that is cost-effective in terms of ongoing development and maintenance of languages and functionality.

Represent the various linguistic phenomena that differentiate languages through common algorithms and procedures.

This clean and modular implementation translates directly to the API, which is simple and transparent. By generalizing linguistic phenomena we made it easy to implement functionality across a broad range of languages. Essentially, we dealt with the complexities of language so you don't have to. We also provide uniform mechanisms through which you can control system behavior. For example, we used the concept of constraint data (stored in the dictionary and of a format which is identical across all languages) to effect the behavior of segmentation. This can also be used to effect how contractions get segmented in English, e.g. can't > /can/ /n't/, or solid compounds in German, e.g. Universitatsbuchhandlung > /Universitats/ /Buchhandlung/ or /Universitats/ /Buch/ /Handlung/ (University book shop).

Flexible dictionary construction
Finally, we recognized that we could never provide one system that would be perfect for everyone, so we created a highly-customizable infrastructure across the entire architecture, for all functions. This means that the average customer can likely use the default configuration, and customers with more specific needs can customize LanguageWare extensively.

LanguageWare uses a simple, uniform and extensible data construction. Our dictionaries, for example, store more than words--they also store word formation elements. These elements may be an orthographic word, such as hoch (German), effettiva (Italian), or electronics (English), or an element from which a valid word can be constructed, such as schul (German), un (Italian), or opto (English). These word formation elements may combined to result in Hochschullehrer, un' effettiva and optoelectronics.

Our dictionaries also store information in glosses for each of these word formation elements. This information might be the lemma (morphological), stem (Porter-like), morphology, a part of speech, or constraints. This information can also be extended through user-defined glosses. This design lets customers to store any type of data in the dictionary and make it available through LanguageWare. These types of data can range from a medical ontology within a life sciences project to a product database within a large corporate search project.

A data-driven model provides a level of transparency never before seen in a linguistic solution.

This concept of complete transparency within the linguistic system is new and empowering for customers:

  • By loading a particular language dictionary, the system automatically behaves in accordance with that language.
  • If a client doesn't like the way an application functions, it can be easily modified.
  • Dictionary customization allows manipulations such as dictionary merging, which allows cross-linguistic data and rules to be independently developed and built into many language dictionaries (for example, corporate product names or terminology, or chemical elements or gene sequences).

Continue to "Functionality"


E-mail us
Easy ways to get the answers you need.
E-mail us