Locale Data Markup Language is the XML format for specifying locale data in the repository. Version 1.0 of this specification was released June 24th, 2003 and is available at http://www.openi18n.org/specs/ldml/
Each locale's data is stored in a separate XML file, for example fr_BE.xml or en.xml, and the top level element is named <ldml>.
Locales and the <identity> element
Locales consist of four parts: the language, the territory, the variant, and finally any locale options. Only the language code is required.
Here are some example locales:
| Locale |
Description |
| en |
English |
| fr_BE |
French in Belgium |
| de_DE |
German in Germany |
| sv_FI_AL |
Swedish in Finland, Åland region. |
de_DE@collation=
phonebook,currency
@pre-euro |
German in Germany, with Collation according to phonebook order, and Currency in pre-Euro form. |
Language and territory codes follow ISO-6397 and ISO-31668, respectively. Two-letter codes are used where they exist, otherwise three-letter codes are used. (See also the OpenI18N convention on locale naming, and RFC 306610 standards for language tagging.)
The variant codes specify particular variants of the locale, typically with special options. For example, the variant "AL" specifies Åland, an autonomous region of Finland.
Options are key-value pairs which request alternate forms of the locale. The currently defined types are collation, currency, and calendar.
Below is an example <identity> element, which identifies the locale data as being part of the sv_FI_AL locale (that is, sv_FI_AL.xml).
<ldml>
<identity>
<version number="1.1">Various notes and changes</version>
<generation date="2002-08-28"/>
<language type="sv"/>
<territory type="FI"/>
<variant type="AL"/>
</identity>
</ldml>
Inheritance
Besides taking up space in the Repository, redundant data adds needlessly to the maintenance burden. The Locale Data Markup Language relies on an inheritance model, whereby the resources are collected into bundles, and the bundles organized into a tree. Data for the many Spanish locales does not need to be duplicated across all of the countries having Spanish as a national language. Instead, common data is collected in the Spanish language locale, and territory locales only need to supply differences.
The parent of all of the language locales is a generic locale known as root. Wherever possible, the resources in the root are language and territory neutral.
Given a particular locale id "en_US_someVariant", the search chain for a particular resource is the following:
en_US_someVariant --> en_US --> en --> root
In some cases, the searching is done within a resource. For example, with calendars (discussed below), all non-Gregorian calendars inherit their data from the Gregorian class.
Where this inheritance relationship is not supported by a target system, such as with POSIX, the data logically should be fully resolved in converting to a format for use by that system, by adding all inherited data to each locale data set.
In addition, the locale data does not contain general character properties that are derived from the Unicode Character Database data (UCD11). That data being common across locales, it is not duplicated in the repository. Constructing a POSIX locale from the following data requires use of that data. In addition, POSIX locales may also specify the character encoding, which requires the data to be transformed into that target encoding.
Aliasing
The contents of any element can be replaced by an alias, which points to another source for the data. The resource is to be fetched from the corresponding location in the other source.
The following example demonstrates a locale “zh_HK" which has a collation element aliased to “zh_TW". Both locales use Traditional Chinese collation, which has a considerable disk footprint.
<ldml>
<identity>
<language type=“zh"/><territory type=“HK"/>
</identity>
<collations>
<alias source=“zh_TW"/>
</collation>
</ldml>
type attribute
Any element may have a type specifier, to indicate an alternate resource that can be selected with a matching type=option in the locale id modifiers, or be referenced by a default element of the form <default type=“xxx">. The following example demonstrates multiple elements of different types used to select differing number formats.
<numberFormats>
<default type=“scientific"/>
<numberFormatStyle type=“decimal">...</numberFormatStyle>
<numberFormatStyle type=“percent">...</numberFormatStyle>
<numberFormatStyle type=“scientific">...</numberFormatStyle>
</numberFormats>
The currently defined optional key/type combinations include:
| key |
type |
Description |
| collation |
phonebook |
For a phonebook-style ordering (used in German). |
| pinyin |
Pinyin order for CJK characters |
| traditional |
For a traditional-style sort (as in Spanish) |
| stroke |
Stroke order for CJK characters |
| direct |
Hindi variant |
| posix |
A "C"-based locale |
| calendar |
gregorian |
(default) |
| arabic |
Astronomical Arabic |
| chinese |
Traditional Chinese calendar |
| civil-arabic |
Civil (algorithmic) Arabic calendar |
| hebrew |
Traditional Hebrew Calendar |
| japanese |
Imperial Calendar (same as Gregorian except for the year, with one era for each Emperor) |
| thai-buddhist |
Thai Buddhist Calendar (same as Gregorian except for the year) |
draft and standard attributes
Any element may be marked with draft="true" to indicate data that has not yet been verified. The following example shows an entire locale which is in draft stage:
<ldml draft="true"> … </ldml>
Similarly, the standard= attribute denotes any element with data designed to conform to a particular standard. It may be a single string, or a comma-separated list.
<collation standard="MSA 200:2002">
<dateFormatStyle type= "decimal" standard="ISO8601,http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=26780&ICS1=1&ICS2=140&ICS3=30,DIN 5008">
<dates> element
This top-level element contains information regarding the formatting and parsing of dates and times. <calendars>, <localizedPatternChars> and <timeZoneNames>. See the Locale Data Markup Language specification for more details on the latter two.
<localizedPatternChars> and <timeZoneNames>
This sub-element contains translated replacements for date format pattern characters (e.g. ‘m’ for month, etc.) for display use.
<timeZoneNames>
This sub-element contains translated names of time zones.
<calendars>
This sub-element contains multiple <calendar> elements, each of which specifies the fields used for formatting and parsing dates and times according to the given calendar. The month names are identified numerically, starting at 1. The day names are identified with short strings, since there is no universally accepted numeric designation.
Many calendars differ from the Gregorian calendar only in the year and era values. For example, the Japanese calendar has many more eras (one for each Emperor), and the years are numbered within that era. All other calendars inherit from the Gregorian calendar (which must be present), so only the differing data will be present. Calendars are distinguished by the ‘type’ attribute, which identifies which class of calendar it is, such as Gregorian, Japanese, and so on.
The following example shows a condensed Gregorian calendar definition, and a portion of the Japanese calendar definition for comparison:
<dates>
<calendars>
<calendar type=“gregorian">
<monthNames>
<month type=“1">January</month>
<month type=“2">February</month>
</monthNames>
<dayNames>
<day type=“sun">Sunday</day>
<day type=“mon">Monday</day>
</dayNames>
<eras>
<eraAbbr>
<era type=“0">BC</era>
<era type=“1">AD</era>
</eraAbbr>
</eras>
<dateFormats>
<default type="medium"/>
<dateFormatLength type="full">
<dateFormat>
<pattern>EEEE, MMMM d, yyyy</pattern>
</dateFormat>
</dateFormatLength>
<dateFormatLength type=“medium">
<default type="DateFormatsKey2">
<dateFormat type="DateFormatsKey2">
<pattern>MMM d, yyyy</pattern>
<displayName>DIN 5008 (EN 28601)</displayName>
</dateFormat>
<dateFormat type="DateFormatsKey3">
<pattern>MMM dd, yyyy</pattern>
</dateFormat>
</dateFormatLength>
</dateFormats>
<timeFormats>
…
<pattern>h:mm:ss</pattern>
…
</timeFormats>
</calendar>
<calendar class=“japanese">
<eras>
<eraAbbr>
<era type=“0">Showa</era>
<era type=“1">Heisei</era>
</eraAbbr>
</eras>
</calendar>
</calendars>
<numbers> element
This element supplies information for formatting and parsing numbers and currencies. The <symbols> element gives information about the textual representation of individual components of a formatted number, such as digits, separators, and signs.
<symbols>
<decimal>.</decimal>
<group>,</group>
<list>;</list>
<percentSign>%</percentSign>
<nativeZeroDigit>0</nativeZeroDigit>
<patternDigit>#</patternDigit>
<plusSign>+</plusSign>
<minusSign>-</minusSign>
<exponential>E</exponential>
<perMille>‰</perMille>
<infinity>?</infinity>
<nan>?</nan>
</symbols>
Patterns for formatting and parsing numbers are contained under the <decimalFormats>, <scientificFormats>, <percentFormats>, and <currencyFormats> elements. Each of these elements has a similar structure. For example, <decimalFormats>, contains one or more <decimalFormatLength> elements. These are distinguished by the type attribute, which describes a pattern length such as short, medium, or long.
<decimalFormats>
<default type="long">
<decimalFormatLength type=“long">
<decimalFormat>
<pattern>#,##0.###;-#,##0.###</pattern>
</decimalFormat>
</decimalFormatLength>
<decimalFormatLength type=“short">
<decimalFormat>
<pattern>#,##0;-#,##0</pattern>
</decimalFormat>
</decimalFormatLength>
</decimalFormats>
The semicolon ";" separates positive and negative patterns.
<currencyFormats>
<currencyFormatLength type=“medium">
<currencyFormat>
<special xmlns:ooo="http://www.openoffice.org">
<ooo:msgid=“FixedFormatstype9"/>
<ooo:usage=“FIXED_NUMBER" formatindex=“4"/>
</special>
<pattern> #,##0.00;( #,##0.00)</pattern>
</currencyFormat>
</curencyFormatLength>
</currencyFormats>
In the currency case, the international currency symbol, , is replaced with the national currency symbol located in the appropriate <currencies> element. Information about which currency is the default for a given locale is not stored in the locale, but is in a separate “supplemental" data component.
<currencies>
<currency type=“USD">
<displayName>dollar</displayName>
<symbol>$</symbol>
</currency>
<currency type=“JPY">
<displayName>yen</displayName>
<symbol>¥</symbol>
</currency>
</currencies>
<collations> element
The <collations> element contains one or more <collations> elements, and provides information about linguistic collation (sorting) of text. The base (root) locale is defined to have collation behavior according to the Unicode Collation Algorithm (UTS #10)12, and all other locales have collation rules which are defined in terms of tailorings (deltas) relative to the UCA.
Below is a partial example taken from the Swedish tailorings, which defines characters that sort following ‘Z’.
<collation>
<base UCA='3.1.1'>
<settings caseLevel=“on"/>
<rules>
<reset>Z</reset>
<p>æ</p>
<t>Æ</t>
<t>aa</t>
<t>aA</t>
<t>Aa</t>
<t>AA</t>
...
</rules>
</collation>
<special xmlns:yyy="xxx"> element
The <special> element may occur anywhere, and allows for arbitrary additional annotation and data that is platform-specific. It has one required attribute, xmlns, which specifies the unique XML namespace of the special data.
The following example demonstrates the inclusion of transform (transliteration) data, which is used by ICU, but not part of the Locale Data Markup Language spec. The DOCTYPE element must be at the top of the locale, and specifies that the “ldmlICU.dtd" definition must be considered for parsing.
<!DOCTYPE ldml SYSTEM http://www.openi18n.org/spec/ldml/1.0/ldml.dtd" [
<!ENTITY % icu SYSTEM “http://www.openi18n.org/spec/ldml/1.0/ldmlICU.dtd">
%icu;
]>
<ldml> …
<special xmlns:icu=“http://oss.software.ibm.com/icu/">
<icu:transforms>
<icu:transform type=“Latin">
α < a ; Α < A ;
β < v ; Β < V ;
</icu:transform>
</icu:transforms>
</special>
Other elements
For more detail about these elements, please see the Locale Data Markup Language specification.
<displayName>
a translated name that can be presented to users when discussing the particular service, for example, in a GUI
<delimiters>
common delimiters for bracketing, such as quotation marks
<characters>
information about the characters commonly used in the locale, and other information helpful in picking among character encodings
<layout>
specifies general document-layout features
<localeDisplayNames>
translated names for scripts, languages, countries, and variants
<measurement>
specifies the measuring system in use, for example, "metric"
|