00:00 Sarah O’Keefe: Welcome to the Content Strategy Experts podcast brought to you by Scriptorium. Since 1997, Scriptorium has helped companies manage, structure, organize, and distribute content in an efficient way.
In episode 21 of this podcast, we wonder, is Google Translate good enough? Hi everyone, I’m Sarah O’Keefe. I am hosting this episode and I am here to talk with Bill Swallow. Hey Bill.
00:24 Bill Swallow: Hello.
00:25 SO: So Bill, is Google Translate good enough?
00:30 BS: The normal candid answer would be it depends. And it really comes down to what we’re really talking about here. Is it the Google web form or is it the API? And really, we’re starting to talk about the latter. Years ago when Google just had Translate up as a single web form, it was pretty abysmal. They were still using a very old model for machine translation, lots of specific one-to-one matching going on there. So the translations came out almost laughable. But now, especially as Google has adopted neural processing, which is basically AI or machine learning, the results are starting to improve dramatically. But again, I’ll fall back to my rote answer of it depends, because there are some major things that you need to be aware of when you’re translating content, especially with machine translation.
01:35 SO: Okay, so when we say Google Translate, what we’re really saying is machine translation or maybe on-demand machine translation, particularly the kind of stuff that we’re seeing from Google Translate, and particularly the engines that are based on neural technology, which is sort of the latest and greatest thing, as opposed to the statistical or phrase-based machine translation that was the standard until pretty recently, right?
02:04 BS: Google still uses the statistical model for a lot of the languages, but they have about seven or eight that are now in the neural model. So a lot of that is most of your Romance languages and I believe Chinese is in there. But the model, it really varies as to what you’re going to get based on which Google service you’re using. So for example a lot of them are available on the web form, but you’re kind of limited to whatever you can copy and paste into that form. There is an API that allows you to basically do on-the-fly website translation, and there’s also another API where you can feed content behind the scenes for, say, larger publications, so books or a large online help system or what have you.
03:00 SO: Okay. So let’s assume that you’ve done your research and you’ve figured out that you’re using the latest and greatest engine, and it is neural, and the language pair that you’re dealing with is in that what you’re describing as providing better quality results. What are some of the factors that you need to look at when you’re deciding whether or not a machine translation, Google Translate approach is good enough or not?
03:26 BS: So there are mainly three main factors when producing content for machine translation. One of them is audience. Another one is subject matter and the third is content quality.
03:38 SO: Okay, so let’s look at each of those. Let’s talk about audience. What kinds of audiences would you look at and decide are machine translation appropriate and what kinds of audiences are not?
03:52 BS: So the best audience for machine translation are those who, and I’m going to use jargony keyword, but people who need gisting, so they just need to get the general idea of what it is you’re trying to convey. And sometimes the audience will be okay with gisting, and sometimes they won’t. So there’s also a cultural factor that comes into play. And third, it really does depend on what types of information they are consuming. So a generic web user looking for information about a product or service probably won’t care that the translation isn’t 100%, so that would be fine. But someone who is looking for very specific information or is trying to complete a task or something where they need very specific information, or if it’s very targeted information that speaks to them personally, like a lot of marketing material, they may not look upon machine translation well. They’ll be able to tell that it’s been machine translated and not done by a human.
05:03 SO: Right. A couple of examples of this that I use are if I’m reading a news article and I stumble upon a Norwegian newspaper that happens to have an article in Norwegian on a topic that I’m interested in, I will happily and gratefully settle for the machine translated version, which will give me the general idea and some comprehension of the article. I don’t expect a Norwegian newspaper to translate… To do high-quality professional translation for me because I’m not their target audience. I’m just trying to get to some information that I happen to be interested in because I care about reindeer or whatever. And then, the other example I’ve seen is an interesting one where you might machine translate a summary or an abstract of a longer document so that I as the professional can read that abstract, for example of a patent, and say, “Oh, I need to know more about this patent that’s in Japan so that I can really understand what that patent is all about.” But I’ll take a machine translation of the abstract or the summary, and then decide whether I’m going to go ask for a professional translation of that entire patent document.
06:21 BS: That second scenario actually, you can give or take there, and it really comes down to I guess the second aspect of content, which is the subject matter. So if your abstract is highly technical, then I would caution against using machine translation just to get that across, because I’ve seen many cases where the terms are translated into something that is kind of right, but not exactly correct. And if you’re particularly looking at medical translation, there are a lot of very similar terms that mean very different things in the greater context of the medical field. So getting that term wrong could potentially throw people off or lead to a wrong either diagnosis or the wrong research.
07:15 SO: Right. But if, for example, as a medical professional, I’m doing just some research, continuing education, I might read that machine translated abstract and decide, “Oh, this is in my area of expertise, or this is something I’m interested in. Now let me go get it translated properly so that I can really understand what’s going on.” But I wouldn’t be terribly upset if then the professional translation came back and I discovered that, “Oh, this wasn’t what I thought it was. Move on to the next one.” But so in terms of subject matter, I think what you’re saying is that some subject matter lends itself or is less risky in machine translation, and some subject matter is more risky.
07:56 BS: Exactly. I mean, the more general the topic, the easier it is to write about it, the easier it is to get that translated because you’re not using heavily technical terms, you’re not using industry jargon, you’re not using corporate jargon. But once you start injecting those things and get down into very specific terminology, the machine translation may or may not have or may not know what the correct term might be, and it might pull something very similar or it might try doing a literal translation of a very specific term and get results that are laughable coming out, if not offensive.
08:34 SO: Right. Now, what about the risk side of things when you get into information that’s related to health and safety?
08:42 BS: Well, the risk there is you could possibly be giving a medical professional the wrong information. And there are a lot of, especially around… In the field of cardiology, for example, there are many, many similar terms, similar sounding terms. They all have the same Latin roots for example, but they all mean something or refer to something completely different. And getting those wrong without at least a check after the translation’s been done by your machine translation, it could literally be a life-or-death situation.
09:21 SO: So information that people are using to make life-or-death decisions, or for that matter, information that if used incorrectly leads to things like electric shock or worse, if you don’t plug things in correctly the result might be unpleasant.
09:38 BS: Unpleasant to, yeah, [laughter] something worse.
09:41 SO: Let’s go with unpleasant.[laughter]
09:44 SO: Okay, so in terms of subject matter, machine translating a game might not be as big a deal because if you get it wrong, well, your character dies, but too bad. But not you.
10:00 BS: True, but in a game you also have a lot of custom terminology, a lot of that which is made up specifically for the game, and making sure that those are translated into something understandable kind of matters. Even little things like currency or things like, I don’t know, items that you might pick up along the way in the game, if they’re translated poorly, people are going to notice that. Now, they may be a little bit more forgiving because they’re primarily interested in the game and they can figure it out as they go. But when you sit down and read some of the hastily translated video game literature, it’s almost laughable.
10:40 SO: Well, and that really takes us to your third point, ’cause content quality requirements. So what kinds of audiences require very high quality, and what kinds of audiences are, as you said, more forgiving?
10:57 BS: Well, definitely, the more professional the audience, the more they want to make sure that they’re getting something of quality. Likewise, anything that’s being marketed toward people, generally, you want to have the idea that the product is speaking to you or the company is speaking to you and not just hoping that you understand what they’re talking about. But also on the flipside, if you’re a company trying to go into many different markets, you want to make sure that you’re being understood correctly, and you want to make sure that your information is correct if it absolutely needs to be correct. And you can’t rely on dodgy translation to make or break your entry into a market.
11:41 SO: Is there a distinction that you make between B2B business products, something that people buy and use at work versus something that people… We’ve talked about games, and my feeling about a game is if the translation is bad or if the text in-game is distracting because it’s badly translated, I might just give up on the game and go do something else.
12:08 BS: You could, or it could become one of the first Internet memes of all time. [laughter] “All your base are belong to us.” But yeah, there is a quality of experience there. And if that is something that matters to either the audience or the company producing it, then absolutely you need to be mindful of your content quality. And things that as a company you need to start doing is managing your terminology well, making sure that the correct term is being used all the time in the correct context, and being able to supply correct translations for those. And part of that, if you’re using a general translation service, so you’re working with people, you generally supply them with some kind of a translation glossary where they have the terms. They have the approved translations of the terms and contextual definitions and so forth. So they understand what they need to translate things as as they go.
13:15 BS: But a machine doesn’t have the luxury of referencing something like that. So you need to, essentially we call it corpus, it’s a large body of information that feeds the machine learning engine. So you need to make sure that you’re including all this information in your corpus for that machine to basically chug through and learn what these terms mean. So that means your terminology has to be correct, it has to be used correctly in context. It has to repeat over and over again in various contexts that’s slightly different, or if a slightly different treatment applies. And then you still need to hope and train this machine over time. You need to train it and hope that it gets it right eventually.
14:04 SO: So what do you think are… We’ve talked a lot about medical content and the risks that you’re incurring there and how that might very well not be a very good idea. So we’ve talked a lot about medical content and some of the special challenges that might present due to health and safety issues. We haven’t touched on regulation, but I think probably the fact that it’s usually regulated content is also an issue. So that one seems kind of high on the list of things not so great for machine translation. What are some of the industries or areas where you think machine translation will get a stronger foothold, where the risks aren’t as great, people are more accepting? What does that look like? What kind of industries do you think that would be?
14:51 BS: Definitely consumer products outside of marketing pitches. So looking at information particularly around specifications for products and so forth, schematics, background information, certainly any kind of social content that’s tied to it supplied by readers, so user comments and so forth. Those are fairly safe to machine translate either yourself or… Either as a reader doing it on your own using a web form or what not, or supplying some kind of a general gisting for people so that they can kind of follow along what other people are saying about your product. And generally content that’s… [chuckle] I hate to say it but content that’s not going to cause a big problem. It’s not going to someone to get injured or killed. It’s not going to create some kind of offensive situation.
15:53 SO: Cause your power grid to go down.
15:56 BS: Power grid to go down, ’cause you’re… [chuckle] If you’re talking about government content, certainly making sure you’re not upsetting the people you’re trying to engage. Yeah, things like that.
16:11 SO: So that’s an interesting point because we touched on audience, but we didn’t really talk about sort of the emotional connection. Is it fair to say that when you machine translate the result you get is not going to really connect with your audience, that that’s one of the things you’re giving up?
16:31 BS: Yes and no. [laughter] We go back to depends.
16:34 SO: It depends.
16:36 BS: Right. The languages that are being folded into neural machine translation, they’re becoming more human as far as the translations go. Those connections are starting to be made. Now are they being made 100%? Probably not, but generally yes. If you’re really trying to connect to someone on some kind of a emotional level or what not, marketing content is a perfect example of that. You’re trying to appeal to someone’s emotions, someone’s likes, dislikes and so forth, their appeal. And getting that wrong can turn them completely off. So in that case it’s critical to get that right. And it’s not to say that machine translation isn’t a way to get you there, but you would probably have to build in a lot of proofreading and editing after the fact.
17:37 SO: So you talked in a… Actually a recent article you created this hierarchy of localization needs and talked about how there’s this concept of minimum viable localization, getting to the point where you have something that is at least good enough.
17:54 BS: Right.
17:56 SO: And I already know that the answer is it depends, [laughter] but can you get there? Can you get to a minimum viable localization with machine translation?
18:06 BS: Minimum viable, yes, because that just assumes that the content is available, so it’s out there, that it’s accurate, which you can get to in some languages, and that it’s generally appropriate. So if your subject matter lends itself to being easy to translate, you can get there with those first three tiers of the hierarchy. But the last two are really hard to get to without some kind of a human touch. We’re talking about tailored content and content that feels organic, that it’s being produced by someone, by an actual person for you specifically. And it’s really hard to get to that point. It’s hard to get to that point writing it from scratch anyway, and to have a machine be able to infer what you mean in one language and convey that in another, I don’t think we’re quite there yet.
19:06 SO: Okay. So if somebody is thinking about adding a machine translation component into their localization strategy, what are sort of your parting words of advice on that?
19:19 BS: I would say definitely take a look at the types of content you have in play and take a look at who’s consuming it and how it’s being consumed and make some decisions right there as far as what will be machine translated 100%, what will be machine translated and cleaned up afterwards, and what absolutely should not go through machine translation. I can’t get more specific than that because everyone’s case is going to be a little bit different, but having those three buckets that you are filing essentially your content or your information architecture into to figure out what’s going where and what’s safe to translate using a machine, what’s not safe to translate.
20:05 SO: And I think that makes sense. Okay, so I think we’ll leave it there. Couple of notes to wrap up. Don’t forget about LearningDITA Live, that’s our free online conference happening in late February and early March. We’re going to have four days of sessions for beginner through advanced DITA users. That does include a session from Bill on localization issues in DITA. So if you’re a DITA user and you’re concerned about localization, that would definitely be one to sign up for. Again, it’s free, you can find all the details at learningdita.com and I hope we will see you there.
20:42 SO: Thank you for listening to the Content Strategy Experts podcast brought to you by Scriptorium. For more information, please visit scriptorium.com or check the show notes for relevant links.