Q: How can I find out more about LanguageWare?
A:To find our more you should send an e-mail, expressing your interest in IBM LanguageWare, to global@us.ibm.com.
Q: What is LanguageWare?
A: LanguageWare is the IBM platform on which to build your human language technology (HLT) applications. It provides a language-neutral interface through which HLT functions can be accessed and supports both the C and Java programming languages. LanguageWare itself is implemented using the C++ programming language. The C API is provided directly through the LanguageWare library. The Java API is provided using a wrapper Java library which uses Java Native Interface (JNI) to access LanguageWare.
Q: What is lexical analysis?
A: Our lexical analysis is the chunking of text into sentences and lexical items. This encapsulates what is traditionally considered as three separate steps: segmentation, normalization and annotation. This handles linguistic phenomena, such as compounds, clitics, or contractions, and non-linguistic ones, such as URLs, dates/times, e-mails, abbreviations, etc.
Q: What is the LanguageWare dictionary?
A: The dictionary is a customizable finite state machine which maps finite sequences of UTF16 code units (a string of characters representing a word or phrase) to associated information: such as lemma, part-of-speech, a list of synonyms, word frequencies, or anything that the user wants to define.
Q: What are the benefits of using LanguageWare instead of a tokenizer?
A: Tokenization is traditionally considered as a process of chunking text based purely on character properties, such as white-space or apostrophe. This simplistic view is not suitable for treatment of more complex linguistic phenomena.
- LanguageWare lexical analyzer provides proper treatment of these linguistic and typographic features:
- Word segmentation for Japanese and Chinese.
- Contractions are split into their component, if needed. For example:
- wouldnt -> would + not
- Horse,s -> Horse + is/'s
- Clitics are split into their component parts, if needed. For example:
- reparti-lo-emos -> repartir + lo + emos
- l'avenue -> le + avenue
- dell'arte -> dello + arte
- Compounds are split into their component parts. For example:
- Oberschulrat -> Ober + schul + rat
- If a multi-word expression is in the dictionary it will be recognized as one lexical unit. For example, 'International Business Machines', 'tip of the iceberg', 'George W. Bush', and so on.
- If an abbreviation is in the dictionary it will be recognized as one lexical unit. If it is not then it will still be recognized as a lexical item, but will not have any associated gloss information.
- End of sentence (EOS) markers: a basic level of EOS detection is performed against punctuation.
- LanguageWare also allows for uniform treatment of various orthographic and typographic variants; such as anti virus, anti-virus and antivirus, categorization and categorization or oe and ö. These are often ignored by other linguistic engines.
Q: What sort of performance can I expect to get from LanguageWare?
A: There is often the misconception that dictionary-based solutions tend to be slower than algorithmic ones. This is definitely not the case with LanguageWare. Our dictionary architecture is highly optimized to provide extremely fast dictionary lookup which in turn positively impacts our overall performance. Currently, based on a standard Pentium III desktop, we can return approx. 10Gchars/hour lexical analysis with approx. 500,000 dictionary lookups/second.
Q: Does LanguageWare support Java applications?
A: While LanguageWare itself is implemented using the C++ programming language, it does have a Java API which can be used by Java applications. The API does rely on underlying Java Native Interface (JNI) code to access LanguageWare. However, this code has been optimized and designed to work successfully in a Java environment with the majority of our customer applications being Java applications.
Q: What languages does LanguageWare v3 currently support?
A: Chinese (Simplified/Traditional), Danish, Dutch, English (US/UK), French (National/Canadian), German (Reform/Pre-reform/Swiss), Italian, Japanese, Norwegian (Bokmål)/Nynorsk), Portuguese (National/Brazilian), Spanish, Swedish. The following languages are currently in development: Czech, Finnish, Greek, Hungarian, Polish, Russian, and Turkish.
Q: Does LanguageWare provide spelling correction?
A: Spelling correction will be available in the next release of LanguageWare.
Q: Does LanguageWare provide language identification?
A: Language Identification will be available in the next release of LanguageWare. |