Full transcript of best practices for localizing DITA content podcast
00:00 Bill Swallow: Welcome to The Content Strategy Experts podcast brought to you by Scriptorium. Since 1997, Scriptorium has helped companies manage, structure, organize and distribute content in an efficient way. In episode 19, we discuss best practices for localizing DITA content.
00:19 BILL: Hi everyone, I’m Bill Swallow and I’m the Director of Operations at Scriptorium.
00:24 Simon Bate: And I’m Simon Bate, I’m Senior Technical Consultant at Scriptorium.
00:28 BILL: And today we’re going to talk a bit about some of the best practices to follow when localizing DITA content. One of the best ways to approach DITA localization is to think about it as a software development project. Even though you’re dealing with content, there are few things that map over to what software developers do when they are localizing applications. And one of the big things that they do is they internationalize their code base. And what this means is, essentially, they take all of their translatable content and all of any of their text input that goes into the application, and they remove it from the source code and place it in what’s called the resource file. And DITA works pretty much the same way where you have a strings file that’s kept outside of your normal DITA content. And that is used by the open tool kit during transformations to produce notes, cautions, warnings and other things that have labels of that type. This way that text can be translated outside of DITA content itself. But there are several considerations for translating the content, the DITA content itself. So I guess, Simon, do you want to talk a little bit about some of the things that authors can do to get their content ready for localization as well?
02:04 SIMON: Yes. One of the first things that authors need to think about when creating content is to make sure that all topics and maps and book maps use the xml:lang attribute. That is xml: “L-A-N-G.” And this attribute specifies the language and locale that your content is being authored in. So, typically, if you’re authoring in US English, you’re going to set xml:lang to en-US. Then later, once this content is sent off to your localizers, it’s their responsibility then to change that to identify the language and locale that’s used for their translations.
02:49 BILL: Right. So, for example, if translating to German for Germany, it would be de-DE. But if it’s for German in the US it would be de-US.
03:04 SIMON: Mm-hmm. Or perhaps a more germane example might also be, say, Swiss German which would be de-CH. So, one of the things we found with translated content that we see, sometimes the localizers aren’t aware that they actually do need to set or change the xml:lang attribute. And so, a really good thing you can do as a content creator is you can actually make sure you specify xml:lang. That way when the translators get the content they’re aware that that’s something there that they need to change. What we sometimes see is, a content creator will not use xml:lang and then when the translators get it, they’re unaware that they need to add that to the content. So it comes back translated but it doesn’t specify what the language is. One other thing to say about the xml:lang attribute is it’s… Bill was mentioning the strings file. It’s the xml:lang attribute that drives in the DITA open toolkit, the selection of which set of strings files to use for which language.
04:13 BILL: So the xml:lang attribute really tells us what language a particular file is in. But what types of things are available to the authors within the DITA content itself aside from that setting?
04:29 SIMON: Oh, in addition to xml:lang, there are two other attributes that authors will use, and the most common one is the translate attribute. And essentially, the translate attribute, you can use, you can put on just about any content. And what it does is, it’ll tell the translator whether this content should be translated or not. So you can imagine if you’re doing, say, some document and you have quotes and the quotes are very germane in the language that they’re specified in, you don’t want the translators translating that. Or perhaps, more specifically to the types of things that we as technical writers do, you might want to use the translate attribute and say “translate no code samples”. And that will ensure that the translators know that the code sample should not be translated. The other attribute, and this one actually typically is applied to a topic as a whole, and that’s the dir or D-I-R attribute. And that tells whether the content is a left to right or right to left language.
05:41 BILL: So that would be particularly useful to set, especially if you’re translating into Arabic or Hebrew.
05:47 SIMON: That’s correct.
05:49 BILL: So when we send content out for translation then, with all of these different attributes and so forth, it’s important to make sure that your translators really understand how to work with this content. And a lot of the tools that are available for translators now understand DITA content and have a filter that can basically be applied to this content to make it intelligible for them as they work within their various tools. But there are a couple of things to… I guess, to remember doing before you start sending DITA content out to translators and one of them is to definitely give them a heads-up and make sure that they can handle it. But when handing off the files, it’s often good to make sure that you note any of these attributes that you’re using just in case their system isn’t set up by default to look for them. A lot times all of these attribute flags are kind of ignored by default. So, they get all the strings and all the content, which allows them to do a solid match on the content that’s been translated before, but you really want to ensure that they turn on understanding of these attributes. So if something says, “translate=no” to make sure to obscure that text when they actually go in to start translating.
07:18 SIMON: Right. And I think it’s important to start your conversation with your translation agency or make sure that they’re aware of DITA and they’re familiar with DITA and what’s involved in translating the DITA content.
07:34 BILL: Right. And it’s important to remember that the translators themselves aren’t working within your DITA files and they’re not working within DITA at all. They’re working within a CAT tool, a compute-aided translation tool, such as Trados, where they input the DITA file and basically it creates a table or a side by side of your source material and the target language area where they type in their translation. So, they’re not necessarily working with tags and they don’t need to know the code, but the software needs be configured to be able to understand that code and hide things that shouldn’t be translated and show things that should be translated.
08:22 SIMON: What about working with keys in content? I’m thinking particularly keys, but keys and conrefs. The reason I think of keys that they’re often used for substituting short strings pieces of content. What will the translators see when they get a DITA file or content from DITA that uses keys?
08:45 BILL: Usually, that information comes across much like your strings file will. So, those values will be presented in short chunks outside of the content itself. There’s no real smart rendering on the translator side that consumes conrefs, or in cases where you’re sucking content into another topic, it’s not going to show those pieces as they’re translating. Those will reside in the files that they reside in and the translator can translate there. So it’s important to communicate with your translators and make sure they understand that some of this content is being used elsewhere and to not freak out if they see missing content, if they’re using a rendered PDF, for example, to proof against. So, it’s important for them to really understand that there are many, many fragmented components to translating DITA content, and that if you’re using keys, that the values for those keys are going to be translated outside of the content in which they’re being used.
10:02 SIMON: So the translator doesn’t necessarily see the context in which it is being used?
10:06 BILL: No. So in many cases—and this holds true for developers who are translating software strings as well—it’s very important to provide that context with the translated content or the content for translation to the translators. This way, they have that understanding of where this content resides, how it’s being used, and why it’s in a separate file.
10:33 SIMON: Is there a way to communicate that information? Will comments say, that appear in a keys file, will that get presented to the translators?
10:44 BILL: Yes and no. If you instruct them to show those comments, especially if it’s being held in something like a draft comment, or a translation-based comment, then you can definitely instruct them to turn that on and just set it to “translate=no”, or at least turn the flag back to yes or just not use the flag, and just instruct them that these are just notes for them. There’s really no one particular way to send that information over that’s better than others. It really depends on how your translators prefer to work and what tools they’re using and what those tools can do and not do. So it’s best to have this dialogue with your translators well before you start sending them content for translation to make sure they understand what you require as the customer, and so that they can make sure you understand what they require as that service provider who’s providing that translation. And if they have questions, if they need to do some testing, they can do that ahead of time and not when you’re under the gun to get the translations done.
12:00 BILL: So I know myself, I’ve seen a lot of interesting issues pop up with regard to translation of really any XML-based content. One in particular to watch out for, I think, and maybe we can call this the “gotchas section” of the podcast. But one thing to look out for is just a flat, “Yes, we can handle this file format with no additional questions or need for testing” from your translator. I’ve seen a case where a translator’s… The translation company basically said, “Oh, DITA. No problem, we do this all the time.” And they translated a bunch of DITA content, and when they gave it back to us, it was completely malformed. And we started taking a look at the raw DITA content, and we were seeing a lot of stuff that looked oddly like FrameMaker markup in the XML code.
12:58 BILL: So, we got back in touch with the translator and it turned out that, yes, they were bringing the DITA files that were sent over into structured frame and then flattening the file to an unstructured format, translating the content, and then popping it back in reverse. I don’t know how they did that but this is the workflow that they chose to do. And they did this, ironically, because they didn’t want to upgrade their translation software. And their translation software couldn’t handle DITA content in raw format, so they just went to another format that their tool could use. That was a big no, no. [chuckle] And that’s since been fixed. [chuckle] So yeah, Simon, have you seen other weird things like this?
13:48 SIMON: Well, I can’t think of any anecdotes like yours right off hand. But one observation I’d like to present is just the… One of the promises of DITA is that it reduces your desktop publishing costs and your desktop publishing costs when going to different languages. And this is true and it fits very well with this whole thing of… It fits very well with the translators not necessarily knowing how to deal with DITA content. But one of the things you do have to keep in mind when dealing with translated DITA content is, how are you going to get your output? And this particularly has to do with PDF files. There are some concerns when you’re generating HTML type output, but some of it has to do with PDF. And one of the real big things in PDF output is fonts. So, one thing I have seen quite a number of times, and this is particularly when going to translating Asian languages, is the customer will start out saying, “Oh, yeah. The translation looks good.” “Oh. Except we’re getting a whole bunch of empty spaces or a whole bunch of boxes and things.” And what’s happening is they have their own corporate font or they’ve selected some ancient, ancient font that has almost no Unicode characters and the translated content comes back in Unicode and does not fit. There are no code points in the font for the characters they’re using.
15:28 SIMON: So we’ll see content, it’ll run through the Open Toolkit beautifully, it’ll come out and you look at the pages and you’ll just see holes and things where those characters should not be. So, one of the things that’s really important to keep in mind when approaching one these translation projects, is just making sure you have the right fonts that will support all of your translation. There are several other things about fonts and translation you do need to keep in mind. The first to come to my mind is just that, in Asian languages, using italics and even using bold fonts, is not really an acceptable way of adding emphasis to text. It can actually be seen as doing a bad thing for the language to take that font and then bend it as you would with an italic. So one of the things you do need to consider is, if you’re going to these languages and you have content that needs to be rendered in italics or needs to be presented as, say a variable, you need to consider how do you want that to be rendered when it’s presented in your PDF.
16:40 BILL: Yeah, it’s very important not just in the PDF but in any format and being able to plan for that ahead of time, so that you have a mechanism for being able to toggle between an italic setting for English and for Romance languages and to be able to switch that flag to something else for whatever other language requirements you have.
17:02 SIMON: That’s correct.
17:04 BILL: I mean, how would you best go about doing that? Would you do it with filtering or use keys or what have you seen in the past?
17:15 SIMON: The best way to deal with it is, well, for PDFs specifically, and I don’t want to get really down into the weeds with how FO processing works and things…
17:26 BILL: Oh, God, no. [chuckle]
17:26 SIMON: But essentially, FO works in… We use XSL-FO to generate PDF files. And FO has a formatting language that is very much like CSS. And so, these files are called attribute sets. As we’re setting up the CSS for HTML output, we can also take that same information and move it over to the attribute sets that we set up for the PDF output. In both of these cases, we can modify the CSS or the attribute set that’s being applied to the content based on the language. So that’s one direct way to control it, is if you can pick another typographical means of representing this information, say a change in font color, say adding underscores, there’s a number of different things we can do. You can implement that from the CSS or attribute sets. And that way there’s no actual processing that has to go on.
18:36 BILL: Excellent. And less for the writer to worry about as well.
18:40 SIMON: Oh, absolutely. They just create their content and all of the worries or the concerns about how it’s handled in output is handled one time only in the transform.
18:56 BILL: Excellent.
18:57 SIMON: So, I’ll refer again to… Bill’s talking about the strings file. The strings files are used in many places in the plugins in the Open Toolkit. And the plugins are what you use to add additional output functionality, like you have a PDF plugin, you’ll have an HTML plugin. And when you develop your own plugins, you can also add strings to the strings files. And in addition to actually adding strings, you can override the existing strings. And this is an important point because the DITA Open Toolkit itself comes with a translated strings for about 50 different languages. And they chose to use particular words in each of those languages for the translations of strings. Now those words, say for chapter, figure, table, those things maybe not quite so arguable, but there may be other things that appear in the strings file that your translators may require or want to translate a different way.
20:07 SIMON: And so you can actually go through the strings files, you can read those and you can then modify them in the strings files for your plugins. Or that is, you can override them in the strings files for your plugins. So this, of course, brings up one of the next points which is, once you’ve created a plugin, you have strings files and you need to run those strings files through the translators also. So as part of packaging up your DITA files… And of course this is just a one time thing, one time for per each plugin, you package up your strings files and send them off to the translators and they come back then, and you put them into the plugin and identify the language they’re used with.
20:51 BILL: Cool. So it’s a one-stop shopping pretty much for translating those elements so that they can be applied over and over and over again.
21:00 SIMON: That’s right, once.
21:00 BILL: Very cool.
21:01 SIMON: Yeah.
21:04 BILL: Well, I Think that brings us to a close on this podcast. Before we go, we wanted to let you know about a new free online conference happening early next year called “LearningDITA Live.” This is based on the learningdita.com e-learning resource for DITA. We’ll have four days of sessions for beginner through advanced DITA users. And localization will be one of the focus topics, so if that’s of interest to you, go to learningdita.com and sign up. We hope to see you there. Simon, thanks for joining me.
21:36 SIMON: Sure thing.
21:39 BILL: And thank you for listening to The Content Strategy Experts podcast brought to by Scriptorium. For more information please visit scriptorium.com or check the show notes for relevant links.