A solid underlying technology is required to realize the benefits of an off-the-shelf linguistic solution.
Cross-linguistic framework
While many traditional linguistic solutions only consider linguistic phenomena as part of their remit, such as inflectional variations, morphology, lemmatization, and segmentation. In reality you must deal with a broader range of linguistic phenomena, including:
-
Spelling variations (organization/organisation)
-
Typographic variations (anti-virus/antivirus/anti virus)
-
Orthographic variations (oe/ö), spelling errors (WebSphere/Webpshere)
-
Proper names (International Business Machines)
-
Abbreviations (IBM)
-
URLs (w3.ibm.com)
-
Punctuation
Additionally, some behaviors must be consistent across languages, such as cross-lingual proper names, while others vary (such as spelling variations, alphabets or inflections).
To achieve this functionality, you need a linguistic system architected to be cross-linguistic and flexible. Not all linguistic solutions qualify. In some legacy solutions, languages may have been developed in relative isolation by language experts using unique language models. The resulting technology may be a conglomeration of different linguistic capabilities which can vary significantly across languages. This could mean that instead of being isolated from linguistic complexities, you must deal with phenomena that have been complicated by implementation details within the underlying technology. You may have to deal separately with many of the non-linguistic phenomena, which defeats the main purpose of utilizing a linguistic solution.
Customization
It is rare, if not impossible, that a HLT solution will function exactly as each customer would like. You are not only faced with different expectations regarding language support, but within a given language there can be different desired behaviors. For example, one application might prefer to see 'Peter's' as one token, while another may see it as two, e.g. /John and Peter/ /'s/ room. You may encounter inconsistencies caused by information from different disciplines, such as the biomedical, legal, or information technology sectors. In a biomedical project, users may not want to treat certain punctuation as a delimiter in segmentation, especially since many elements have complex names which contain all types of punctuation, including periods, hyphens and even white spaces. A German application developer in Life Sciences probably faces problems that are more common with his English-speaking or Japanese-speaking Life Sciences counterpart than with a German financial application developer.
The key point is that no linguistic solution will satisfy every customer in every situation, even if you believe you have satisfied the languge requirements. You must be able to customize your solution, and give your development team control of system behavior. |