The first step in DITA localization is to translate the actual content of your DITA files. The second step is to address DITA localization requirements for your output. This article provides an in-depth explanation of the localization support in the DITA Open Toolkit.
The DITA Open Toolkit (DITA OT) includes several DITA localization features. When you set up your publishing system (and whenever you add new languages), you need to do the following:
- Check the language-specific strings files
- Ensure that language- or locale-specific images are accessible
- Select typefaces for the target language
(Most of the information in this post applies to all versions of the DITA Open Toolkit. Information about specific file paths applies to the DITA OT version 1.8.)
Check the language-specific strings files
When generating output, the DITA OT inserts text strings, such as “Chapter” or “Appendix”, types of admonitions (“Note”, “Warning”, “Caution”), text and slogans on the cover pages, and copyright messages. When the output is intended for a specific language, these pieces of text must match the output language. You want “Chapter 4” to render as “Capítulo 4” in Spanish, as “Chapitre 4” in French, or as “第4章” in Japanese.
To handle this, the strings used by the DITA OT are externalized, that is, they are stored in language-specific files that are separate from the rest of the XSL transforms. Each language (or language and locale) has one or more separate files. Usually, a core plugin provides a base set of strings, then plugins that are built on that core plugin can add their own strings. Within these files, each string has an identifier, which is not translated, and the string itself.
A large number of these strings are provided by the core DITA OT. For HTML-based transforms, the DITA OT supplies strings files for over 50 languages and locales; for PDF, support for 14 languages is included.
The default translated strings may not meet your needs. The words used in the strings may not align with the word choice, tone, emphasis, or punctuation your organization requires. Also, the PDF strings files are not consistently populated; all of the strings in the English strings files may not be translated in the strings files for other localizations.
Additionally, there may be some strings for which there are no definitions in the core plugin strings files.
Work with your localization team to check the locale-specific strings files provided by the DITA-OT. You may have to do this for strings used with core HTML and PDF plugins. If the editor or language checker recommends a change, you (or the localizer) should:
- Identify the strings in the core strings files that you need to change.
- Copy the elements that define those strings to the corresponding plugin strings file.
- Change the string definition in the copied element to the new string.
When generating output for new localizations, check the DITA OT log file for missing string errors. These will be in the target “transform.topic2fo.main” with the task identifier “[xslt]”. If you find that there are missing strings, you’ll need to add them to the plugin strings file, using the English definitions as a basis for the translation.
File structure for HTML strings files
As of DITA OT version 1.8, the language-specific strings files for the core HTML-based transforms are stored in %DITA-OT%/xsl/common. The file names are in the form strings-xx-yy.xml, where xx-yy is the language identifier as defined by IETF RFC 4646 and implemented by the ISO 639-1 language codes (this is the same language code as used in the xml:lang attribute). An additional file strings.xml (in the same folder) lists the language files that are currently in use.
Each HTML strings file has the form:
<?xml version="1.0" encoding="utf-8"?>
Note that the file’s root element (<strings>) contains the xml:lang attribute, which specifies the language (as does the name of the file). Within the root element are one or more <str> elements. Each <str> element has a unique identifier (name attribute); contained in the <str> element is the text that is pushed into your output. The contents of the name attribute should NEVER be translated.
The file strings.xml has the form:
<?xml version="1.0" encoding="utf-8"?>
<lang xml:lang="xx-yy” filename="strings-xx-yy.xml"/>
The strings.xml file contains one lang element for each supported language.
File structure for PDF string files
As of DITA OT version 1.8, the strings files for the core PDF-based transforms are stored in %DITA-OT%/plugins/org.dita.pdf2/cfg/common/vars. The file names are in the form xx.xml, where xx is the language identifier as defined by IETF RFC 4646 and implemented by the ISO 639-1 language codes.
Each PDF strings file has the form:
<?xml version="1.0" encoding="UTF-8"?>
Each file contains one or more <variable> elements. Each <variable> element has a unique identifier (id attribute); contained in the <variable> element is the actual string. Some PDF strings may include one or more parameters which allow the transform to insert text into the strings. For example, the Italian strings file contains this entry for a figure title:
<variable id="Figure"> Figura <param ref-name="number"/>: <param ref-name="title"/></variable>
Note that the variable id attribute and the param element’s ref-name attribute should NEVER be translated.
Make sure the translator understands that their job is only to translate the contents of the <str> or <variable> elements. They should not translate the attributes (apart from modifying contents of the xml:lang attribute), nor should they translate the comments (any text surrounded by “<!–” and “–>”).
Additionally, within the strings, there may be spaces or non-breaking spaces (usually represented with the entity “ ”), these should remain just as they are in the original (as much as possible).
Most strings files contain comments and notes to the translator. In particular, some strings files contain paths to images; most of these are accompanied by a note NOT to translate the paths.
Additionally, the strings files may contain URLs for partner organizations or language- or locale-specific web sites. You may want to examine the contents of the strings files and determine which URLs should be made locale-specific and which should be left untouched.
When the strings files are returned from the translator, add the translated (and renamed) strings file to the plugin folders as described.
For HTML-based plugins you must also:
- Ensure that the translator correctly modified the xml:lang attribute to the <strings> element in the file containing the translated strings.
- Update the plugin-specific strings.xml file so that it contains a reference to the translated strings file. (You should run the integrator after updating this file.)
For PDF-based plugins you must also:
- Ensure that all strings in the English strings file exist in the strings file for your localization. If they don’t you’ll need to provide these strings in your plugin’s string files.
Ensure that locale- or language-specific images are available
Just as the DITA OT inserts strings into output when necessary, it can also insert icons and other images as required; for example, icons for admonitions (notes and hazard statements) and company logos in page headers or footers.
Most icons and images are intended for use in all languages. But sometimes, specific icons are required for a locale or language. These reasons may include:
- Icons or images that include language-specific text
- Icons or images that are culturally sensitive
What do you have to do?
If you need to substitute images based on the output language, do the following:
- Ensure that locale- or language-specific image files are available in the appropriate artwork folder
- Ensure that the paths to the output location of these image files are saved as strings in the language-specific strings files. Generally, the path to each image will be the same except for the file name.
Select typefaces for the target language
To generate PDF files, the transforms need typeface specifications. The DITA OT allows us to define classes of typefaces (“logical fonts”) that are associated with specific types of text. For instance, you might define that your body text uses a serif font, titles use a heavy-weight sans serif font, and that running heads use a lighter form of that same sans serif font.
Each of the logical fonts is associated with a physical font. The physical fonts are often determined by the style guidelines for your company or organization; they ensure that your information products project a consistent look and feel.
The fonts you select must support all characters used by the target localization.
If you are creating a localization for a language that requires extensive use of a non-Western character set, you may need to:
- Identify typefaces that are associated with your organization’s look and feel in specific locales.
- Specify how those typefaces are to be associated with specific text applications. That is, the fonts that will be used for body text, titles, heads, and so on.
When localizing your DITA content, remember that DITA OT plugins do contain localized information. The strings, images, icons, and fonts that are a part of your final work products must be translated or localized with the same care and cultural sensitivity as your content.
For more information about localizing your plugins, localizing your content, or developing a content strategy to facilitate localization, contact us at www.scriptorium.com/contact-us.