Below are the types of functions provided by IBM LanguageWare.
Language identification
The LanguageWare infrastructure supports simple text categorization of which language identification (ID) is one implementation. Language ID works through our native lexical analyzer by utilizing a special language ID dictionary which contains functional words (such as determiners, conjunctions, or prepositions) and other linguistically motivated words (such as clitics or contractions) for all for supported languages. It also contains their language identifiers and a weighting factor. Because it works on our already lightweight lexical analysis framework, it is extremely fast. It is also easily tunable. Finally, since it only works on linguistically motivated words, which essentially represent the structure of the language, its quality is not affected by junk (HTML tags, etc.) or foreign words.
Lexical analysis
This includes three traditionally separate levels of processing:
-
Segmentation allows for accurately segment text, which may seem trivial if you only consider languages with words separated by whitespace. Segmentation is an important function in languages such as Chinese or Japanese, which are not whitespace delimited, or languages such as German or Korean, which have a high percentage of solid compounds. Even English has solid compounds, such as antivirus or optoelectronics, and contractions, such as Jack's. Our segmentation function also recognizes multi-word expressions, such as International Business Machines, or abbreviations, such as IBM or I.B.M.
-
Normalization recognizes different representations of the same word. It is often considered in narrow terms of case normalization (House, house) or inflectional normalization (house, houses), but it can have more extensive implementations. For example, you may want to recognize different representations caused by such things as spelling variations (organization/organization), typographic variations (anti-virus/antivirus/anti virus), orthographic variations (oe/ö), abbreviations (IBM, International Business Machines), or even spelling errors (WebSphere/Webpshere), and so on. This is what we consider to be normalization. Like the rest of our functionality, it is primarily driven by our dictionary data, making it easily customizable.
-
Annotation is the association of information with the lexical items that result from segmentation. The information is stored in our dictionary gloss structure and can range from lemma (either a morphological lemma or a Porter-like stem, both are available), to parts of speech tags, to user-defined data.
Dictionary lookup
This provides an extremely quick mechanism for traversing our finite state dictionary and giving access to all information associated with the particular dictionary entry(s) being searched for. The lexical analyzer uses dictionary lookup; therefore a customer using LanguageWare to process text does not require a separate dictionary lookup. All information associated with each token is provided natively within the lexical analyzer results.
Text correction
Text correction is most commonly used in applications such as word processors to provide spell checking. IBM has implemented text correction in a more global fashion so that it can be used by other applications such as search engines, optical character recognition (OCR), or any automatic text correction application. From the viewpoint of traditional spelling correction, our text correction can be considered as two distinct functions:
-
Spell verify, which takes input text and identifies all words that have been incorrectly spelled
-
Spell aid which identifies spelling candidates for each incorrect word.
In the case of spell verify, spelling is marked as incorrect if the word does not exist in the dictionary, or if it cannot be correctly decompounded (as might be the case for a German compound, Universitätsbuchhandlung). Spell aid uses three mechanisms for generating candidates: bad speller entry in dictionary, grapheme to grapheme conversion, and fuzzy matching.
A bad speller entry allows the user to add any string of characters to the dictionary accompanied by the 'correction' of that string. Traditionally, users only have the ability to add new 'correct' words into a dictionary. However with this type of entry you can add any word, valid or not, and define what you want the correction to be. This functionality can be used in a more traditional spelling correction application to essentially 'learn' the most common mistakes of the user, and then train the spelling engine to recognize their common corrections. However, it could have even more diverse uses.
For example, while a search engine like Google can provide you a very primitive mechanism to recognize possible alternative spellings for a search query, < do you mean ... > our dictionary-based solution gives you significantly more power. The www.ibm.com sales site could add a number of alternate spellings for their product names so that if someone searched on LWP they might get "do you mean Lotus Workplace?" or they could even get "This is part number XXXXX" and "You might also be interested in WebSphere Portal Server and …".
Another example is the standardization of corporate terminology or product names. You can not only store all your product names in the dictionary, but also include bad speller entries to track common misspellings or to correct product names that have changed or become obsolete. The corporate spelling infrastructure would automatically manage naming standardization for you. People would no longer have to go searching around for a terminology database to make sure they are getting it right, their spell check engine will tell them. For example,
| Incorrect Usage |
Correction |
| Data Joiner |
DataJoiner® |
| websphere |
WebSphere® |
| LWP |
Lotus Workplace |
Grapheme to grapheme conversion involves applying a series of transformations to the input string. These transformations represent a number of models of errors, such as OCR (in 'looks like' m), phonetic (f 'sounds like' ph), or typographic (dropping letters or transposing them). This function is useful for applications like search engines where they like to suggest alternatives to users who may have misspelled the query, even if the word is valid.
Fuzzy matching utilizes dictionaries to identify words which are very similar to the input word. It does this by measuring how the words differ in number of characters (so called Levenshtein distance).
Finally, we have a system of weightings, or costs, which is used to provide ranking of all candidates. These are easily tunable to process very specific types of errors, such as errors arising from OCR.
Dictionary customization lets you create a dictionary, add and delete entries, create new gloss types, and merge dictionaries. You have complete freedom to customize the dictionaries to suit your particular needs. |