Taming AI: Using AI for content conversion at scale
Podcast: Play in new window | Download
Subscribe: Apple Podcasts | Spotify | Amazon Music | Email | TuneIn | RSS
AI promises to transform content conversion, but what does it actually look like when you’re processing thousands of documents a day? In this episode, Sarah O’Keefe (Scriptorium) and Rich Dominelli (DCL) dig into the real-world challenges of using AI for large-scale structured content conversion.
Rich Dominelli: If you have millions of articles and you’re asking the AI, ‘What did we do for this project six months ago?” The AI has to find those articles, pull the relevant information out of those articles, summarize it, and hand it back to you. The best way of doing that is to give extra signals to the AI, structured relevant bits of information, front matter, back matter, publication date, keywords, abstract, that allows the AI to query the corpus and get the relevant chunks out of that corpus in a very quick manner. Then, it can summarize what those chunks are. So the AI almost becomes the user interface over that corpus. But to find that data in the first place, structured content is key. Structured content is key when you’re dealing with big indexes and the web, and it’s the same with AI.
Related links:
- Defeating Nondeterminism in LLM Inference (white paper)
- Data Conversion Laboratory (DCL)
- Scriptorium, Machine experience (MX): Making content work for humans and machines (podcast)
LinkedIn:
- Host: Sarah O’Keefe
- Guest: Rich Dominelli
Transcript:
Disclaimer: This is a machine-generated transcript with edits.
Introduction with ambient background music
Christine Cuellar: From Scriptorium, this is Content Operations, a show that delivers industry-leading insights for global organizations.
Bill Swallow: In the end, you have a unified experience so that people aren’t relearning how to engage with your content in every context you produce it.
Sarah O’Keefe: Change is perceived as being risky; you have to convince me that making the change is less risky than not making the change.
Alan Pringle: And at some point, you are going to have tools, technology, and processes that no longer support your needs, so if you think about that ahead of time, you’re going to be much better off.
End of introduction
Sarah O’Keefe: Hey everyone, I’m Sarah O’Keefe and I’m here today with Rich Dominelli who is a Senior Developer and Architect at DCL. Rich, welcome.
Rich Domineli: Hi, thank you for having me.
SO: Glad to have you. We were talking before we hit the record button, and you described yourself as a perhaps hopeful AI evangelist.
RD: Yeah, I am well and thoroughly immersed in the AI game at DCL and using it and plus I play with AI assistants at home. I’m enthusiastic about the future of AI, sometimes disappointed about the present.
SO: So DCL, as I think many of our listeners know, is focused on conversion at scale, which to me makes a great use case for AI because ultimately conversion is about edge cases and about inconsistency, right? If everything was 100% consistent, conversion would be pretty easy.
RD: Yeah, no, DCL does a lot of structured content generation out of unstructured data, and the creativity, especially in the academic space, of what that unstructured data looks like is sometimes nightmarish. So the AI lets us, does a lot of the heavy lifting for us when it comes to looking for particular items, identifying concrete data points within the documents, pulling things like authors and affiliation, front matter type information, and back matter type information out of the documents and in automated fashion. It can be painful from time to time, but it’s definitely helped.
SO: Yeah, so this is, think, you know, the reality of working with AI and working with it in a production environment in order to address all these weird edge cases and what’s going on. So tell us a little bit about how you’re using AI in, you know, these conversion use cases. What does it look like to go in there and start applying some of these tools that we have?
RD: So, I mean, typically our flows work in a way where we’re coming in with a PDF or a Word document or some other unstructured format. We take it, we reformat it into a version that’s more AI-friendly, like Markdown, for example. And that’s usually the first step we’re doing when we’re looking for information to pull out of it like front matter. It’s a very common use case.
If you look at academic papers, the front matter, the authors and the affiliations that are on that paper can be formatted in more ways than I could list out during the course of this podcast. It’s kind of crazy. So what we’ve started doing, and we’ve been doing this for a couple of years now, is we’re using the AI, we’re handing it the Markdown document, and we’re saying we need to list authors and affiliations, please extract it for us.
Now, naively, when we started that process, we assumed that the AI would give us a consistent list of authors and affiliations. And sometimes it does. But every time you do that call, you’ll get it in a different format. So then you have to start tightening things down. So OK, give me a list of authors and affiliations. I want it to be structured exactly like this. And typically, we have a JSON structure that we’re presenting to the AI, along with our prompt, and saying, give it to us. Well, okay, and that gets you a good chunk of the way there. And that was very exciting when we had that working consistently, we were getting things out of the system on a consistent basis. Awesome. But then you start looking at the results, and every once in a while, you get an author that was missed, or there would be too many authors on that paper.
We had one test paper, which I loved, which had 600 collaborative authors in it. And the AI would just choke after about 280-ish. So then you have to start dealing with things like paging through the data and formatting the data. And then you have to figure out, well, did it miss anything? You have 600 authors. Good luck. So now you have to take what the AI did and compare it against your own representation of it and write a program to do that comparison to say, OK, is it good? Is it good?
You have to take a step back and you look at it and you say, okay, we have the information that’s in the non-structured format. We’re handing it to the AI. The AI is gonna give us a structured version of it and we need to validate it. Well, the first validation is very easy. Does that structured version match the schema that we gave it? Yes or no, that’s easy. Well, then you have to say, okay, is everybody there? Well, is there anybody added? Because the nice thing about AI is they occasionally get very creative. Even if you have that temperature dial turned all the way down to zero, it will pull names out of thin air and then come back to you with some random name and stick it in the middle of the data where it’s not obvious, of course, and then hand it back to you. So then you have to start saying, are all the names that appear in this list actually in the document? Are the counts matching? And if it’s not, you go back to the AI and you ask it again, and usually you’ll get a better answer the second or sometimes the third or fourth time.
But you need to be able to catch that, especially if you’re doing this at scale, because if you’re doing a few, it’s easy, you can eyeball it. If you’re doing 1,000 of these a day, you can eyeball all of them. You can say, you can ask the AI, OK, give me a confidence level, but if you can’t trust it in the first place about what it’s returning, yeah, I’m very confident about what I’m giving you right now. It’s really the truth, I promise you this time. I don’t know how trustworthy that would be. So you have to write tools to validate what the AI is producing, or you have to use the AI to validate what it’s producing. So coming in the first time, obviously, we did the count, we did the schema validation. We then said, okay, we’re going to check to make sure all the names appear in the document, we’re going to have landmarks in the document that we can refer back to. So if you start with Microsoft Word and you have track changes on, you can have paragraph IDs that are supplied. So you can make sure that you can find all of the authors in that list and they all have a paragraph ID and you can have your landmarks and that’s great. Or you can even hand the results to a separate AI call and say, proofread this. Is this accurate? Is this the best answer that could be for each of these? I know we’ll come back with an answer. And you can use that as a signal to gauge accuracy and to gauge repeatability and make sure it’s correct.
SO: So you’re, let’s see, generating an AI, not a test bed, but an AI environment that’s doing this conversion or that’s processing the files for you for conversion. And then you have to go in and do all this validation to make sure that the output that you’re getting is actually correct. As compared to, I’m gonna say old-fashioned, but you know, as compared to scripting, deterministic, pretty straightforward, if A then B kinds of scripting. What are the differences between that and AI-driven conversion in testing and validation? What are the test plans? How are they different conceptually?
RD: So from our perspective, the frustrating thing sometimes is the AI is completely non-deterministic.
SO: Mm-hmm.
RD: It can give you a name formatted one way today, and then tomorrow, its formatting might be subtly different, where in the paper it has “Richard Dominelli, Junior.” The AI may decide, well, that comma probably shouldn’t be there, or junior should be followed by a period, and it wasn’t in the paper originally. And you can try prompting around that and tell it to prompt around that and make sure that it’s accurate. But it doesn’t always follow your instructions exactly when that’s the case.
SO: And why is that? Why is it non-deterministic?
RD: Because AIs are built on a neural network, the neural network itself has fuzzy fields within that, mostly due to floating-point arithmetic. So when you’re looking at it and it’s that weight on that particular key might be out to like 16 digits of a number and it might shift it slightly one way or the other. There is a fantastic paper from, I wanna say it’s anthropic, that goes through the different reasons why AIs are non-deterministic. It goes through repeatedly querying for the AI and who Richard Hyman is and getting back a different answer every single time. They’re all correct. However, they’re all slightly different. The other thing that will lean into that is if the AI is being heavily used, the memory and model weights will shift ever so slightly and you’ll get a different result.
So you’ll end up having an issue where today I’m getting accurately this way and it’s relatively consistent, not perfectly, but close enough. And then tomorrow, it may just give you a dumpster fire of random information and you need to be able to detect that. Okay, the other challenge we hit fairly early on is more and more people are aggressively using AI right now. So we’re actually starting to hit issues where the LLM providers are overwhelmed. So you have to be able to code in sale over because you’ll literally get too many, you’ll get 429 errors, which are basically, I’m too busy. I can’t deal with your request right now. Call me back. And you’ll have to go back and repeatedly query to get around that. I am hoping at some day in the near future, we’ll be able to have in-house AI at scale and have these wonderful models that are so intelligent that we can run on our local hardware. And so I won’t have to deal with that, but right now, that’s not the case.
SO: So given all of this, I mean, I’ve asked you the leading question about the issues and the negatives, but what then makes an AI-driven conversion appealing versus a sort of scripted, deterministic, if I plug in AI, I will always get B output?
RD: So part of it is the type of data we’re dealing with. We’re dealing with unstructured information and the unstructured, the creativity of the unstructured information is rather astonishing. You’ll have people format things, know, we’ll get papers in where the entire paper is placed in different cells of the table. It’s not tabular information at all. They just, you know, we wanted this particular section to be in this cell and this particular section to be in this cell and this particular section. And the AI, I don’t want to say is immune to that, but it’s a lot more forgiving than having to write those reg ex or traditional programming or word interrupt things to try to extract that information, because the AI can address it in a much more fuzzy fashion. I know approximately what an author’s name looks like. I know approximately what a reference looks like. Even though today they decided to do it in Comic Sans or with Wingdings fonts, I can still read that and move on. So that’s really the wonderful aspect of it, is it gets around a lot of that fuzzy logic coding. You’re not dealing with having to address each of these nuances in a generic switch or state machine to try to figure out, OK, this paper should be classified this way and this approach used. Instead, the AI does a lot of that heavy lifting for you.
SO: Okay, so it gives us that sort of fuzzier, more, I’m gonna say more flexible, I know if that’s exactly the right word. And then the outcome, what you’re describing is you’re ingesting unstructured word, PDF, those kinds of things, and turning them into structured content, presumably fundamentally XML of some sort, but also some other downstream formats. So I wanted to switch gears a little bit. There’s been a lot of conversation about using structured content as an input for AI. So this, guess, is the scenario where you’ve already ingested the unstructured content, have remediated it in various ways. We now have structured content, and we’re gonna take that and feed it into, I guess, AI part two, right? So we’re past conversion. And there’s a lot of people saying, you should feed structured content into AI, it will make the AI better. And so my question for you is, you know, is that the case, and also maybe why and what goes into structured content that makes it produce better AI outcomes, potentially, assuming that it does.
RD: So there’s a bunch of guides out there. There are two pieces of conversation. First, there’s a bunch of guides out there for prompting AIs where they suggest using XML or simplified XML tagging to give the AI signals about your prompt that aren’t verbally expressible. So here is my question. Here is an example. Here’s how I want my output to look like. And you can put tags around that when you’re actually prompting the AI and the AI will know that those signals mean that it should pay attention to it. Okay, so that putting that aside, what I think you’re really asking though, is how does structured content, structured documents, the JATs and the DITAs and the S1000Ds and how does that help the world of AI? And to answer that question, we have to go through two things.
One, we have to go through retrieval augmented generation and context rot. So let’s talk about context rot first, because that’s a really interesting topic and people don’t talk about it enough. You have these large language models that are coming out right now and they’re advertising this sticker shock value of, can ingest a million tokens and it has this tremendous memory so you can stick the entire encyclopedia botanica in it, and it will be able to ingest it and regurgitate it. There’s a whole lot of academic work out there that basically says that, hold on a second, practically speaking, once you exceed a certain size, even though they can technically hold that million tokens of data in memory, they’re not gonna be answering as accurately as a smaller model.
The most common example or the most easy test for that is needle in the haystack test, where you take a document, you stick a random fact in the middle of it, and you hand the AI the document, and then you ask them for that random fact. Nine times out of 10, it will answer incorrectly. An even easier test is there’s a website which I actually like called A Thousand Names. And all this website is is a thousand randomly generated human names. The thousand randomly generated human names. You take that, you give it to the AI, say, how many names are there? And more often than not, you’ll get, well, when you do 100, you’ll get an accurate answer. 200, accurate answer. 300, things start to break down. You might get 300, or you might get 280, 320. You might get a random answer.
And then it gets progressively worse as it gets bigger and bigger. So if you’re working in the context world, content world, you’re looking at ingesting documents into a corpus of some sort. You’re making these structured documents in such a way for the sole purpose of making them retrievable. You want the AI to be able to retrieve those documents and the relevant documents from the corpus so that I can answer the question. A, because your corpus is probably bigger than that million tokens. And B, because the less data you send the AI, the more accurate the answer is. So the better way of thinking.
SO: And so a token is roughly a character, right?
RD: No, a token is actually roughly a word. It’s less than a word. It’s kind of a lot, but it’s still not like a PubMed-sized corpus or anything like that. It’s roughly the size of the New and Old Testament of the Bible, roughly a million words. So just give everybody that mental picture. But that’s just one book.
SO: Roughly a word. So a million tokens is kind of a lot. It’s a lot of words.
RD: So if you have millions of articles, or and you’re asking the AI, you know, what did we do for this project six months ago that involved JAPs in this solution? And the AI has to say, okay, it has to find those articles, and then it has to find the relevant information out of those articles to be able to summarize it and hand it back to you. And the best way of doing that, and the best way we know how to do that is to giving extra signals to the AI, giving those structured relevant bits of information, front matter, back matter, publication date, keywords, abstract, that allows the AI to query the corpus and get the relevant chunks out of that corpus in a very quick manner. And then summarize what those chunks are. So the AI almost becomes the user interface over that corpus, because it’s going to summarize the data. But to find that data in the first place, structured content is key because for the same reason, structured content is key when you’re dealing with big indexes and web, same with AI.
SO: So then structured content is potentially helpful. And I guess then circling back, let’s say I’m sitting on a pile of content of varying degrees of structured or unstructured, varying degrees of quality or lack thereof. What kinds of things should be happening before that content gets ingested into some sort of an LLM or some sort of a corpus to be used in AI-generated outputs?
RD: So these are the same type of things you would do to make them easily retrievable ahead of time. So the standard approach that was being espoused about two years ago, a year and half ago, was something called Naive RAG. You can just take your PDFs and throw them at the AI, and the AI will ingest them into a vector database, and it will do semantic similarity and find the documents that you care about, not the best approach when you start talking about large amounts of documents. And there are issues with semantic similarities, where the AI will have a hard time distinguishing negative cases, will have a hard time peeling out the best documents, and that type of thing. So the best approach to take is you want to take those documents, you want to turn them into structured information in such a way that it’s easy for the AI to ingest. So typically that involves chunking it, into topic-level pieces or semantic chunking, coming up with summaries to make them easy for the AI to find, and whatever other information you may want to chase out of those.
So, for example, if I’m handing a PDF to an AI and saying, I want to be able to search this PDF later, well, six months from now, if I get a new version of that PDF and I want to search it, search against the two of them, I really want my answers coming out of the second PDF. That’s metadata, that’s structured information that doesn’t appear in the text of the PDF or may not appear in the text of the PDF. You wanna be able to do things like versioning, you wanna be able to do things like dates, you wanna be able to give these signals to the AI to be able to pull that information back quickly. And that’s really where structured content comes in. So for the purposes of preparing your own corpus, you want to convert them into an easy to ingest format, which typically means Markdown or XML or something that the AI can deal with. You want to give it whatever other signals you can so that it’s easy to find. And then you want to hand it to something that first does chunking and then text embedding, which is basically turning the information into numbers so that you can do those cosine similarity searches. And then you want everything handed off to some kind of object store like a hybrid brand database or the hybrid factor database or graph database so that they’re easy to pull out.
SO: Awesome. So you started this off talking about being the hopeful evangelist, and now having gone through all of this, it sounds as though you’re really thinking about these issues and dealing with them at scale. What are some of the top things that you’re thinking about going forward, whether hopeful or not, the good, the bad, and the ugly?
RD: So one of the interesting aspects of my job is I get to do a lot of interactions with AI from an R &D perspective and do some in-house programming and do some in-house tool use. And what we’re finding is developing our own internal mechanisms for AI to call third-party tools, to be able to call Crossref or Grovid or some of these reference facilities out there through like model context protocol or through API calls so we can execute those calls and get that information back and do validation before it hands back the results is a very interesting topic for us because that would let us do things like any AI have it do the first few rounds of validation before it ever comes back to us without having it go to the next step, do a validation step and then the next step and then possibly do a round trip. It would be a much faster interaction. We use right now, of course, like most of the world, we’re using a lot of AI coding tools to tighten up our code bases to make sure things are working well, to basically act as a force multiplier when we’re doing development on projects, which is phenomenal.
I can’t say enough good things about Cloud Code, you know, because it’s really become an essential tool in my day-to-day life. But I’m also seeing a lot of people out there using these tools to help analyze their own and improve their own workflow and that day-to-day work. We talked with one of our customers recently, and they use cloud code, even though the person giving the demo was not a developer; they use cloud code to answer the RFP. And Cloud Code does a tool use call against their document corpus, answers the RFP correctly, and what used to take two or three days of slogging through documents and finding things are now being done in an hour by one person instead of having multiple people working on this project. So it’s great to start seeing that type of stuff in the enterprise just blossom because it’s really exciting.
SO: Well, Rich, I really appreciate your insights on this. I learned a few things and I think that it’s great to hear from people who are actually using this stuff, you know, in a production world, in a high stakes world where you’re actually, you know, need to get the content right, get the information right as opposed to just, you know, that we’ll play around with it and not worry about it too much. So thank you, and we’ll look forward to hearing more from you and what you’re doing at DCL.
RD: Sounds great. Thanks for having me.
Conclusion with ambient background music
CC: Thank you for listening to Content Operations by Scriptorium. For more information, visit Scriptorium.com or check the show notes for relevant links.
