Why publishing architecture matters in localization
“It’s not about the tools.” Except when it’s totally about the tools.
If your content is going to be translated, you need to understand your publishing tool’s localization support. Here are some issues to consider:
Which languages do you need?
Assuming that your first language is English, you can create some broad categories for language translation:
- Western (European) languages, which use the Latin alphabet, such as English and FIGS (French, Italian, German, Spanish). Usually the first bloc of languages to be established.
- CJK (Chinese, Japanese, Korean). Languages that use thousands of characters, which require larger fonts. Often referred to as “double-byte” languages because the fonts require more storage than the Western languages.
- Eastern European languages, including Russian, Slavic languages, Hungarian, and Turkish. May require a non-Latin character set such as Cyrillic.
- Other Asian languages, such as Thai and Vietnamese, which use complex scripts. Certain letter combinations change the glyph that is required; similar to the ff or fi ligatures in English, but much more extensive.
- Right-to-left languages, such as Arabic and Hebrew.
XML and HTML can theoretically handle all of these languages, but some authoring and publishing tools cannot.
Template-based publishing
In a template-driven workflow, you create a formatting template that spells out page size, fonts, paragraph and character styles, tables styles, and so on. Templates are fantastic in a single language workflow, but as you add languages, you need a copy of the template for each language, and this quickly becomes an overwhelming maintenance problem.
Take a simple example: a note paragraph.
NOTE: This is a note.
For this blog post in WordPress, I have hard-coded “NOTE:” by typing it in and applying bold. But in a template-driven tool, I would create a style called note and specify that the note paragraph should always begin with the word “NOTE:”.
And now the fun begins. In a German template, I need to replace note with “HINWEIS”. The rest of the template is largely identical; I’m using the same fonts and most paragraphs have the same definitions in English and German. But because I need to adjust the note, caution, and warning paragraphs—and change the word “Chapter” to “Kapitel”—I have to make a copy of the entire template document.
Basic changes can become unmanageable very quickly.
Localization string files
The current best practices for multiple language outputs is to use string files. These are text files, usually XML, which separate out the language-specific items from the common formatting. This allows you to create a single formatting specification that references language-dependent information as appropriate.
With string files, localization costs escalate much less than with individual template files for each language. There are still additional complications, such as configuring for right-to-left output or “unusual” requirements. (For example, many language use “Figure 2” or similar, but Hungarian uses “2. Figure.” Just swapping out the word “Figure” doesn’t work for Hungarian.)
Gruesome technical details, available in two formats to accommodate different learning styles:
- Webcast on Localization and the DITA Open Toolkit (Simon Bate)
- White paper on Localization and the DITA Open Toolkit (Simon Bate)
Vinish Garg
This post made me recall my days with Basware where I worked with a dedicated localization team. We used Arbortext Editor for authoring and Multilizer for localization, and XML strings were forever part of the day. I am not sure if any CMS offers this kind of localization support using XML support. I guess not.