Content conversion (podcast)
In episode 52 of the Content Strategy Experts podcast, Gretyl Kinsey talks with Mark Gross of DCL about content conversion. They explore some of the use cases they have seen and what partial conversion looks like.
About special guest, Mark Gross:
Mark Gross, President of DCL, is a recognized authority on XML implementation and document conversion. Mark’s company, DCL, which stands for “Data Conversion Laboratory” provides data and content transformation services and solutions. Using the latest innovations in artificial intelligence, including machine learning and natural language processing, DCL helps businesses organize and structure data and content for modern technologies and platforms.
Mark has a BS in Engineering from Columbia University and an MBA from New York University. He has also taught at the New York University Graduate School of Business, the New School, and Pace University. He is a frequent speaker on the topic of automated conversions to XML.
Gretyl Kinsey: Welcome to The Content Strategy Experts Podcast brought to you by Scriptorium. Since 1997, Scriptorium has helped companies manage, structure, organize and distribute content in an efficient way. In episode 52, we talk about content conversion with special guest Mark Gross of DCL.
GK: Welcome to The Content Strategy Experts Podcast. I’m Gretyl Kinsey, and I am here with Mark Gross. Mark, how are you?
Mark Gross: I am good today. Thank you.
GK: Thank you so much for joining us on the podcast. And I wanted to just start off by asking you to tell us a little bit about DCL and what you do and kind of what differentiates you from other companies kind of in a similar space?
MG: Well, I guess the name says Data Conversion Laboratory and that’s what we do, but things have changed. We’ve been in business almost 40 years now. When we started, conversion meant mostly one thing. It meant, “I just got this new computer. I’ve got a new system, and I want to move old information from my old system onto the new system, and that was a lot. When microcomputers first came along, people were coming off mainframe and mini computers. But today everything runs on data, and that’s not so much a focus anymore, although it certainly is a focus.
MG: We’re living in a world where data, information, analytics, the economy just doesn’t run well without well curated, structured data. There’s so much content, so much data out there that you’re inundated. So today much of what our work is about, 20% or 25% of our work is still moving things from one computer to another. But a lot of the rest of the work is really in structuring the information that’s already out there. And I guess as much of our Scriptorium audience might be working with things like DITA and S1000D. Well, that’s very structured information, but most information out there isn’t. So what we do there is take information and we add the structured information into there. And that’s essentially what we’re doing today.
GK: And you’re definitely right that at Scriptorium a lot of the clients we work with are kind of either going into structure or already in some kind of structure. And so as far as their conversion needs, they may be looking at either something like moving from one structure to another. So for example, something like a homegrown XML to DITA, or they may be looking at something like going from just completely unstructured content into something more structured like XML. That’s kind of what we see as sort of the common needs that companies have for converting their content, especially if they have a large volume of it that’s completely unstructured and they need it to have that structure, then that’s when they would say, “Okay, we’ve got to convert all of this.” And I wanted to get your take on this as well and kind of ask you, what are some of the most common reasons that you see with the companies you’ve worked with? Why they need this kind of conversion?
MG: Okay. So the first things to talk about are the moving computers. But more of the time today is they’ve got information, and they’ve been collecting it, and they’ve been structuring … And it might be structured already, but the needs today are so much different, and the bar has been raised so much that information just isn’t in the form that it needs to be today for modern uses, for artificial intelligence, for transmitting information. It doesn’t have the right metadata. An example is the work we did for Elsevier and the scopeless database. So the scopeless database, it’s called the index to the world’s literature, scientific literature. And for the last 10 or 15 years, everything has been structured very tightly. It’s bibliography information, so you have author’s names and publishers and the dates and all those things there.
MG: But going back more than 15 years, 15 years ago, the information wasn’t structured like that. It was just plain line information. Bibliography was just the way you see it in the back of a journal or on the back of a book. So that wasn’t good enough to be able to find things as quickly as they needed to be found. So what we did for them is we went back to all the material going back for another 15 years before that, and then used artificial intelligence and a bunch of very sophisticated software to go in and add the structure that should have been there, if it would have been correct in the first place. But 15 or 20 years ago, we didn’t know that we would need this anymore. So we went back and restructured and ordered information.
MG: I think there’s lots of cases like that. A company may have documents going back 20, 30, 40 years, which is still valuable, but the older material is sitting, probably not on paper but it might be, but it’d be sitting on microfilm, or it would be sitting in PDF files, which are not really structured files. So there’s a need to go back and upgrade all those materials to try to work with what’s needed today. The world contains billions of pages of information out there, so it’s a shame not to have access to them, and today we want access to all that information.
GK: Yeah, exactly. And that’s really one of the biggest use cases that we see as well. We’ll see companies saying all of our stuff is locked into an older format like PDF where it’s not as accessible. They can’t really put it on the web except a PDF for download, and it really just kind of restricts what they’re able to provide to their customers. And then they’ve got this demand for content to be able to be kind of parceled up and reused in different ways, and they really just don’t have that flexibility with it when it’s kind of stuck in an older and unstructured format.
MG: Yeah. And I think just one more example that fits this, I think is work that we’re doing for the New York Public Library now. The New York Public Library has the complete collection of copyright records going back to the 1800s when copyrights first started being put together in the United States, but all that was in books. So there’s hundreds and hundreds of books on shelves, and then all those were scanned. So now you have images of all those pages, and then they were OCRed automatically, and so there’s an OCR, but that’s still not very much use because the data itself is not structured. When you look at a copyright record, it contains a lot of information that’s really fielded, separated by commas and semicolons and other things like that. So you really can’t find anything. A full text search doesn’t do you much good.
MG: So what we’ve done now is gone back to all that material that’s already been … Images already exist of everything, but we’ve gone back and we’ve now taken the content out of that, and built that and tagged it and structured it, and built that into a database that a public library can now use and put out on the web. So there’s lots of information out there that can use more structuring.
GK: Absolutely. I want to talk about some more specific use cases that Scriptorium has seen with our clients when it comes to conversion and kind of get your take on some of them. So the first one is the idea of a partial conversion. And a couple examples we’ve seen of this would be something like a company maybe does a content audit, and they realize that some of their content is unstructured, does need to be converted and provided in multiple different formats to their customers. But then maybe they’ve got some other content that’s just out of date. It’s never going to be updated again anyway, and they maybe decide it’s not worth it to convert that content. We’ve seen similar cases where maybe there’s a certain amount of the content that’s the most important, and that’s what they’re updating most frequently, and then maybe they start with that, and then convert the rest later on a schedule that works better for them.
GK: So I wanted to ask you if you’ve seen cases like this where a company either does a partial conversion, or maybe starts with a partial conversion and kind of how common that is, and what kinds of use cases that you’ve seen there?
MG: Right. So certainly there’s a cost to conversion. I think companies have a fiduciary responsibility to think about what the return on investment is going to be in anything they do. So I think we see a lot of cases where there’s partial conversions, either because as you said, they just have a lot of stuff and decide not to do it all because they don’t need it all. And the other is they do want it all, but maybe it doesn’t have to be converted with all the bells and whistles in order to reduce the cost. It all comes down to your return on investment. So I think very often this really starts with a content audit. I think people at the organization have to look at and see what the value of what the material they have is.
MG: For example, an organization might find that only the … They want to convert the product manuals or their repair manuals for products that are 10 years or less and go from there. So that might be good enough. And the rest of it, well, moving it into something like DITA is relatively expensive. It’s dollars per page usually. So they’ll take that, the pieces that are more current and bring them up, while the rest of it can be left as just images and then done on a gradual basis as they go along.
MG: In other cases, it may make sense … A client of ours is the Optical Society of America. For their hundredth birthday, which was a couple of years ago, they wanted to convert the entire corpus of material going back to 1917, which on one hand you think, “Why would you need … Why would a physicist want material that’s a hundred years old.” On the other hand, it turns out that they’re right. It’s this very valuable information there that wasn’t at all available before. So they chose to turn everything into top quality XML, everything in it. And they went for that over a period. They had a three year program, and they did it over that period.
MG: In other cases, you know, there’s a cost to perfection. Sometimes you don’t need perfection. You just want to get 99% of the way there. So an example of that is the work we currently do for the US Patent Office. So that’s not a one-time conversion. That’s continuing information because they get five million pages a month of technical material coming into their facility. And until a few years ago, everything was imaged and everything was scanned, but it was just images. There wasn’t much you could do with it other than flip through them on a computer. But the cost of converting all that into XML the traditional way would have just been prohibitive.
MG: We came up with a completely automated approach. We take documents coming in … OCR on technical documents by itself without correction, doesn’t work very well usually. But we came up with a computer vision approach that would clean up a page before it ever got to the OCR engine. It took off all the math and the tables and all those things, took it off so the page electronically just had text and white space. And then it went through an OCR engine, which produced better than 99% accuracy right out, without any correction. And then it got converted to XML or the automated tools. And then all the things we took out, the math that was left as images was pulled back in and produced the document that was XML completely automatically, but it was only … They ordered at 99.6 accuracy. So their take on it was, “We want all these pages, and getting 99.6% at one 20th the cost is definitely worth it for us.” It may not be worth it in a publishing organization, but it definitely was worth it for them because the documents were still going to be looked at by patent examiners at some point.
MG: So I think every organization needs to think about what the return on investment is, and there are places in between other than … It doesn’t have to be everything. It doesn’t have to be zero. Some place in between there that really is the right decision.
GK: Absolutely. And another place that we’ve seen clients kind of evaluate that return on investment is when it comes to the idea of rewriting or restructuring some of their content before they convert. And we see this a lot in cases where a company has maybe written their user manual in a very book-like way that doesn’t convert cleanly over to topics, and similar cases like that where maybe they’ve kind of formatted their content such that it doesn’t have any sort of implied structure, or they’ve maybe been misusing their template in Word or FrameMaker or InDesign or whatever they’ve been using. So when it comes to doing an automated conversion process, they realize that the results are going to be pretty messy and that it’s going to require a lot of cleanup on the output side. So I wanted to get your take on that as well and ask if you’ve seen any cases where it made more sense for a company to do some cleanup and restructuring, maybe rewriting on their content before they tried to convert it?
MG: Of course there are cases where it’s just such a mess that it’s not worth it. It’s not worth converting without doing some rewriting, but I think it’s less than you might think. I don’t think people, especially when you have large amounts of content, I think there are tools around that let you insert structure where it wasn’t there before. It’s possible. Many of the cases that I’ve seen over time, we could apply some technology too. So there might be some … I think there’s a triage that needs to take place beforehand to review the items and do an inventory of what’s there and maybe identify those, the 10% or 5% or 20% that really is not going to transfer over.
MG: But I think very often you can get 70 or 80% of the material to be moved over automatically. And I think there’s a lot of benefits, and you could do that. First of all, there is a cost savings to the extent you can do that, but also a lot of times you don’t have to recheck materials that have already been approved and have been used for awhile. You may not have to get recertified, and it doesn’t need … The professionals that are in the organizations don’t have a lot of time for any of this. They can focus on just those pieces that need their attention and the rest of it can be done by others, both by automation and by less trained people who can just do some of the things that need to be done. So I think automation, there’s a lot that automation can do that we underestimate a lot of times. It’s worth doing, taking the effort upfront to see what would happen.
GK: Absolutely. So I want to talk about another common situation that we’ve seen that can kind of pose some interesting challenges to conversion. And that is when a company goes through merger or acquisitions. And then they’ve suddenly got this collection of content that’s coming from maybe two, three, five, ten different sources, and none of it’s consistent with each other, but all of a sudden all of it needs to be made consistent and rebranded and kind of following this one corporate structure. So I wanted to ask if you’ve dealt with any cases like this and what sorts of challenges that you’ve faced in converting content after a merger or acquisition?
MG: All right. Yeah, that’s actually … There’s more and more of that kind of activity going on, so that’s definitely good for business, but I think it’s just a more exaggerated case of the usual trying to normalize information. Whereas when you go into one company, you’re really dealing with the materials that were done over a period of 10, 15, 20, 30 years of trying to normalize it, because people have done things over time. When you’re dealing with multiple companies, they for sure have done things, and every company does things differently. Frequently, one of the companies has better practices than the other. That might be why they did the acquisition or maybe that’s why they were acquired. But I think it’s a very similar process of identifying where you want to go, what is the goal. And I think that’s part of the specification process that takes place upfront. “This is what we want it to look like when it’s done,” and then mapping all that information over. And I think it’s just another … I think computers are very helpful there. There’s a lot of automation that can be applied. Today, a lot of artificial intelligence kinds of software can be applied. I think it’s just an exaggerated case.
MG: But I think it’s more important … In this case it’s even more important to have a good planning process upfront with someone who’s familiar with the various data streams and the data formats and the tools that can be used because that can save years in the process. A lot of times you hear about companies that have merged, and it takes them three years or five years to get their information together, which is a disaster. So I think it behooves the organization to look at that. One of the issues, I think, is the IT groups are usually in charge of trying to merge everything together. And while they have a lot of experience with data streams, I don’t think IT groups usually have a lot of experience with the document streams and how documents come together. So I think it’s even more important to bring in professionals who are familiar with those kinds of materials in order to speed up the process.
GK: Definitely. So you mentioned that you’re kind of seeing more and more of cases with mergers and acquisitions happening. So I wanted to ask if there are any other kind of common patterns like that that you’re seeing with types of conversions that you’re doing and if there are any unique challenges that you face with each of those and sort of how you deal with those?
MG: So I think today … Traditionally, we saw these conversion as projects that go on … They have a start, and they might be six months or a year or a few years. But today, actually most of our work is continuing kind of work, like what I described about the Patent Office. That’s day by day work that needs to be done on a very timely basis. And the timeframes are really squeezed in. I mean where traditionally you’d schedule a six month process, today things might need to be delivered in 10 minutes or in an hour or in two hours. So the timeframes, I think are really squeezed down a lot of times. There’s no real time to go back. You’ve really got to describe things upfront and make sure you’ve got a machine that really takes care of everything.
MG: So I think more and more of what becomes important is having a process upfront to specify what needs supply. Define what’s going to need to be done. And that’s really very detailed work. A lot of times people think of it as an ad hoc process. We have a very formalized specification process when we start something where everything gets laid out, all the details are laid, and we walk a customer through all the steps and the decision points and record the decisions. So I think it becomes very important to do that upfront because of just how the large scale of what we’re doing and the time, the time parameters are there. There’s also, I think, a matter of prioritizing, which you spoke about already. But if you’re going to … you’re really getting in new systems, decide what’s really needed, what’s the return on investment of various materials. And inventory what’s there so that you make sure you’re doing the most important things first. A lot of times you hear things like, “Well, price is no object. We want everything to be moved over.” That’s never true. Price is always an object, and cost is always an object. So prioritization is very important.
MG: And I think another area that’s become common is this idea of all content reuse as the data streams become large, especially like when you’re dealing with technical documentation, which is much of what we’ve talked about. Systems like DITA and S1000, they are intended to reduce the amount of duplicate content that you’re handling, and I think that’s true in many places, so content reuse and normalizing the information becomes very important. We spoke about if a company merges, those two companies might have similar information, but they’re slightly different or they’re very different. One of the things we’ve done is we’ve built tools that let us examine large collections of information. It’s called Harmonizer, which will go through a collection information and find all the similar paragraphs, not just that they’re identical, but somebody’s changed a few words. So we can identify those so that we can now pull them out and say, “Well, this thing has repeated a hundred times. Let’s make that a module and just refer to it a hundred times.” I think a lot of those kind of things are common just because you’re dealing with more information and it’s all gotten bigger and faster.
GK: It really has. And we’ve noticed kind of some similar trends and patterns as well with the clients that we’ve worked with. I definitely agree, reuse continues to be a really large factor, especially with companies that are localizing content to other languages. Because if they can’t reuse their content, then that translation cost is occurring many, many more times than it needs to. So I agree, I think it’s really important to, as you said, to plan and to look at your reuse needs and your reuse potential, and see how that can factor in to your conversion process and in that way really make the most of what you do when you convert.
MG: Just one more point I think that’s important to … just in terms of the patterns.
MG: An area where we’ve had a lot of focus and we spoke about a few times is just the focus on computer intelligence and artificial intelligence. And I think that’s been a major differentiator for DCL, and way before it was a buzzword. In 1982, already we built a conversion tool called Mindreader, which would take ASCII based files, just plain text files that were coming out of an ancient word processor called the Videk, to convert it to what was then a very modern word processor that had all tags, and it was automatically infering all the ideas and the architecture and putting in tags automatically. That’s been a fortunate beginning. It’s become more and more a focus over the last years because first of all, the data streams have become so much larger, and also labor costs are rising internationally, and that’s going to continue to happen. And so I think that focus on intelligence and using computers to do this has become more and more important as we go along. And having people really understand that is a really important part of all this.
GK: Absolutely. Well, thank you again for joining us.
MG: Okay. Well, it’s been a pleasure to be here. And thank you for this session. And these are really good questions. And thank you very much.
GK: And thank you all for listening to The Content Strategy Experts Podcast brought to you by Scriptorium. For more information, visit Scriptorium.com or check the show notes for relevant links.