HLT is not a new technology and has been in use for over 30 years. From the very beginning it was conceived for the purpose of information retrieval, but with the advent of personal computing focus shifted to applications such as spell checking. However, the focus has recently shifted back to information retrieval, most recently in UIM applications.
This was highlighted by a the recent EUROMAP benchmark (EU funded) which puts the current HLT market at around 400 million euros, with projected growth to exceed 2 billion euros for the combined speech/HLT market. According to EUROMAP, while these are respectable market opportunities in themselves, they do not reflect the multiplier effect of embedded HLT. The value added to products and services employing HLT creates markets worth many times the value of the core technology itself.
While there seems to be a consensus on the market potential of HLT, and many product teams have made progressive use of it, there are still development teams that struggle with the idea of using a linguistic solution. They would prefer a non-linguistic solution, one that does not rely on language features, but that uses techniques such as character and string manipulation to generate similar results. For example, a non-linguistic solution would treat a white-space or apostrophe as a delimiter for tokenizing input text, but it cannot deal with complex issues, such as compounds or multi-word expressions. It might also use a stemmer which instructs: "strip 's' off the end of the word" to provide the infinitive of some verbs in English, e.g. 'runs&' > 'run'. However it cannot handle the irregular forms 'ran' > 'run'.
While non-linguistic methods have known limitations, they have advantages that make them attractive to development teams, at least in the short term. The most obvious advantage is that they do not require any linguistic expertise, but involve an incremental process of creating or tweaking rules and then checking behavior. This puts control of behavior in the hands of developers. In addition, it allows developers to use one method for all phenomena, linguistic (inflectional variations) and non-linguistic (URLs, email addresses, proper names, etc.). Another perceived advantage is that non-linguistic solutions do not require a dictionary, which is generally considered to be slow, and must be developed by a language expert.
There are also negative aspects to non-linguistic solutions. In the case of incremental development, the result of this model (particularly if the solution must support many languages), is a layering of rules, and exceptions that try to account for all the behaviors of a diverse set of languages. There is no model that can represent the singularities of each language, and the process becomes unwieldy as the number of languages grows. As for dictionaires, speed is no longer a drawback and dictionary-based linguistic solutions can actually provide higher performance than a hierarchy of non-linguistic rules. Later in this article we will present the performance results of IBM LanguageWare, which can outperform any non-linguistic method.
With a non-linguistic solution you have developer control of the behavior of the system and the flexibility to change it as you wish; with a linguistic solution you can let your developers concentrate on coding, and let linguists focus on the languages involved. You could extrapolate that a perfect solution would merge these two systems; to create a linguistic system that is flexible, customizable, and extremely fast. |