The future of AI: structured content is key (webinar)
In this episode of our Let’s Talk ContentOps! webinar series, industry experts Sarah O’Keefe and Carrie Hane explore the intersection of structured content and artificial intelligence. Discover how structured content improves the reliability and performance of AI systems by increasing accuracy, reducing hallucinations, and supporting efficient content management.
In this webinar, attendees will learn:
- AI’s capabilities and limitations
- How structured content enhances AI abilities in content management, personalization, and distribution
- Best practices for integrating AI and structured content
Related links
- Looking for the article version of this webinar? Here it is, authored by Carrie Hane: Structured Content in the Age of AI
- AI in the content lifecycle
- Let’s Talk ContentOps! webinar series on YouTube
- Our book, Content Transformation, 3rd Edition. Building an effective content strategy is no small task. This is your guidebook for getting started.
You asked. We’re answering!
Attendees asked a record-breaking number of questions during this webinar. Here, we’ve answered the most frequently asked questions.
Moving from unstructured to structured content
Ready to get started with structured content? The white paper Structured authoring and XML outlines what structured content is and helps you determine content sources, establish content repositories, and implement content reuse.
Headless CMSs and knowledge graphs
Is a headless CMS the best tool for beginning your structured content journey? It depends on what you’re trying to accomplish. The cost of knowledge graphs article addresses how the popularity of knowledge graphs and headless CMSs needs to be balanced with foundational transitions that make structured content successful. Jumping too quickly can cause challenges that prevent your organization from embracing the change. Carrie Hane encourages you to remember that humans are the most important tool for creating content structure.
Improving interactions with LLMs
As Carrie mentioned during the webinar, “Trust but verify.” Sarah O’Keefe adds, “In my daily work, I use LLMs largely to condense existing information, and not to create new information.”
These podcasts give other great examples for producing better interactions with LLMs:
- Strategies for AI in technical documentation. The English version of this podcast was created with the support of AI transcription and translation tools. You can listen to the original German version here.
- Enterprise content operations in action at NetApp. In this episode, Adam Newton shares how NetApp’s enterprise content operations support their new GenAI content tool.
Transcript:
Scott Abel: Hello, if you’re here for The Future of Artificial Intelligence: Structured Content is the Key chat between Sarah O’Keefe, and Carrie Hane, you are in the right place, and you’re about one of over a thousand registrants of today’s show, it’s a super hot topic, but before we start, let me tell you a few things about our webinar platform. First of all, we can’t see, or hear you, which means we don’t have access to your camera, or your microphone, so you don’t have to worry about that. We are recording this program as we do with all the content regular webinars. You can about 30 minutes after today’s show ends, use the same URL that you’re using to watch today’s live show to access an on-demand recording. You can also share that link with others that you think might find some value from today’s program, and we hope that you do so. You can ask a question of our panelists at any time. In fact, this is the point of today’s show is to engage with you while the presenters are going to be discussing some things. You can access the ask a questions tab, which is located just directly below your webinar viewing panel. Clicking the tab opens a little window into which you can type a question. We’ll queue up as many of those as possible, and try to get answers for you during the time that we have available on today’s program. There’s also additional content in the attachments section located beneath your webinar viewing panel. Scrolling down a bit, you’ll find links to contact information for both our host, and the guest today, as well as some resources they provided for you, and some links for some upcoming events, and sponsored content. Definitely peruse that whenever you get a chance during today’s show. We’re also going to ask you to take a poll today. In fact, we have two polls. In fact, I’m going to go ahead, and launch the first poll right now just to let you get familiar with how it works. Taking a poll, super easy one. Question five, multiple choice answers. You pick the answer that’s best for you. Today’s question is, are you familiar with structured content? And your answer choices are, I’m very familiar with structured content, or I’ve heard of it, but I need to know a little bit more. Or I don’t know that much about structured content, or what is structured content? You pick any of those answers, and that’ll help give a little context to the presenters, and let them know a little bit about you, and your knowledge today, so I appreciate you doing that. At the end of the show, we’ll ask you to give us a rating. These are one through five star rating system in which one is a low rating, and five is exceptional. There’s also a little field to which you can type some feedback that we share with the presenter, so please feel free to do so on your way out the door. Our next upcoming show with Sarah O’Keefe in her Let’s Talk Content Operations series of shows is going to be November the 13th where Alyssa Fox will be joining her to talk about how to bridge technical, and marketing content. She’s got some ideas, strategies, and best practices to share, and given her experience in both technical, and marketing content leadership roles, I think this will be a great show for you to attend. A few things you should know about as a subscriber to the Content Wrangler Webinar series, you’re also eligible for a free help site assessment from the folks at Heretto. Heretto will evaluate your help site using best knowledge center criteria from the Software Information Industry Association’s Annual CODI Awards. You’ll get a detailed review of your site’s strengths, and practical tips for improvement. You can use the link in the attachments section of the webinar viewing area to request your free help site assessment. Also, Heretto is making available a new micro report that reveals how customer self-service revolution is changing technical communication. It highlights the [inaudible 00:03:40] of technical communicators. You can download a free copy, and it’s called From Unsung to Unstoppable: How Technical Writers Are Driving the Self-Service Revolution by using the link in the attachments sections below your webinar viewing panel. And just a final note that we are excited to be going back to live conferences, and this year we’ll be at the LavaCon Conference on Content Strategy and Technical Documentation Management October the 27th through the 30th in Portland Oregon. I know that both Sarah, and I have enjoyed these conferences in the past. There’s 400 to 500 of your peers will be there, and you can save a little bit of money if you use the discount code TCW at checkout, and you can save 10% on your registration fees. There’s instructions in the attachments section located beneath your webinar viewing panel. Of course, Sarah’s Company Scriptorium Publishing is the sponsor of today’s show along with Heretto, you can learn more about both those companies on the web of course, or in the attachments section. Heretto, for those of you who do not know, is an AI-enabled component content management system platform that’s in use by technical documentation teams around the globe to deploy help, and developer portals that delight their customers. All right, before we go on with today’s show, let’s jump in, and see our guests in person. All right, I’m playing the role of Christine who usually is the assistant today, but I’m not really the host. It’s actually Sarah O’Keefe. So, Sarah, take it away.
Sarah O’Keefe: Well, Scott, thank you, and I appreciate it, and welcome to Carrie, and her co-presenter, Zoe in the background there. Yep. I’m aware that most of you are just here for Zoe. Sorry, Carrie. So, welcome aboard, and I think we will just jump right in. I wanted to report in on the results of the poll that you just took. Basically two thirds of you are saying, “I’m very familiar with structured content”, and then most of the rest are, “I’ve heard of it, but I need to know more.” And then there’s a few in the not really familiar with that. So, I’m going to end that poll, and actually start the other one, which is just about the same question but around AI. So, what do you know about AI, and where’s that going? And while we do that, Carrie to you, I wanted to start with the question of large learning models such as ChatGPT, and sort of your initial reaction to that, and your big picture assessment of where they fall in the content space for us. What do you see there?
Carrie Hane: Yeah, well, my initial reaction was like, ugh. And that was the early days, but I just was like, “Oh, do we need another tool to make crappy content?” And that’s kind of what I saw at first, the little I paid attention, but that hype quickly went down, I won’t say away. And six months later we were like, “Okay, well what could this do?” We started asking better questions, but even it’s been a couple, or a year, year, and a half, almost two years now since ChatGPT came out, we know that it is definitely not always accurate. It’s good for some things like first drafts, summarization. I know I’ve been using it myself in my job search to help map to job descriptions, and things like that, make sure I’m getting the right keywords, but overall I still don’t trust it. And I have a story from over the last month about how untrustworthy it is. I was visiting my son who works in Yellowstone National Park, and they have one of the biggest geysers in the world, Steamboat Geyser. And I was asking him like, “Well, can we go see that?” He’s like, “We don’t know when it’s going to go off.” And then we ran into a ranger who said, “Oh, one of my colleagues asked ChatGPT when it was going to go off, and it said September 4th.” Now this was probably around August 29th, or so that we were having this conversation. So, we all watched September 4th to see if it went off. It did not. It still has not gone off since July. And so yesterday, in preparation for this, and as a follow-up to that, I said, “When was”, or I think I asked it, “When will Steamboat Geyser go off?” And it said, “Well, we don’t know. But the last eruption was September 3rd.”
That is categorically untrue. So, I looked at it is now providing some sources, which is great. In one of the articles it used as a source from 2019, there was a sentence that said the Steamboat Geyser erupted on September 3rd. That was indeed September 3rd, 2019. So, it hallucinated, which is, it made stuff up, it took that September 3rd as a recent date, and appended 2024 to it, and made a categorically untrue statement. So, that’s just one story. We all know these things happen over, and over, and over again. So, I see that there’s promise with AI, and generative AI in some spaces, but I am still very skeptical about it for unique content generation.
SO: So, it’s confident, and precise, and also wrong?
CH: Yes.
SO: Which is suboptimal. Okay, looking at this poll on AI, the breakdown is a little bit different, but mostly it’s half, and half between a lot of AI knowledge, I’m staying up to date, and a lot of I need to catch up. And a few 10%, 12%, or so are saying, “I don’t know much about AI.” So, with that contextualized, let’s talk a little bit about structured content before we try, and bring those two together. So, what’s your quick, and dirty definition of structured content?
CH: It’s content that’s broken down into reasonable pieces, and with meaning attached to it so that it’s understandable by humans, and computers. So, it’s basically a container that describes the intent of what content we’re creating. Has nothing to do with what it looks like. It is semantic. It contains the meaning as part of that intent. And so it allows us to describe what it is we’re talking about.
SO: And so you had this great quick little label, or slogan, or whatever you want to call it for this. And can you talk about that a little bit?
CH: Remind me?
SO: Things not strings.
CH: Yes.
SO: Well, I’ve internalized that even if you haven’t.
CH: Context, this is all we need, Sarah is context, and things will get better. So, Google told us in 2012, 12 years ago when it introduced its knowledge graph, the graph you see at the right side of search results pages that we need things not strings. And a string is ambiguous. And it used this example of Taj Mahal. Type the letters T-A-J, space, M-A-H-A-J, and it’s a string. Don’t know what it is, but as a thing you have to describe it, because it could be the building in India, it could be there’s an artist called Taj Mahal, there’s an Atlantic City casino. There could be an Indian restaurant down the road from you called Taj Mahal, which one are you looking for? And so structured content allows you to say what this thing is, whether it’s a building, or a monument, or a restaurant, and then it can go from there, and help you identify, and provide that meaning, and intent to the content you’re creating.
SO: Okay. And so as we think about the structured content, and where that goes, and where we’re going from there, how does the context, and the labeling that structured content provides you? How does that then tie back to AI?
CH: So, AI is a computer, and computers can’t implicitly know, or learn things. They can’t get the context in the same way humans can. So, they need the context to be explicit, so they know what’s relevant to the thing that’s being asked. It also allows you to provide connections as well as that meaning. So, when you’re making all of this explicit through the structure of your content, and the computer doesn’t have to guess. And it’s not to say this, for example, that article that said the last eruption was September 3rd, you’re reading this in 2019. It’s just assumed you know that. But there are other pieces of content on the web, and out in the world that are more structured that have what was the last eruption date? What was the eruption date before that? How long did it go off? There’s lots of structure we can put around that, so that the content is more reliable, and can lead to more accuracy in creation, and in representation.
SO: But Carrie, this sounds like work. I thought the AI was going to make all the work go away, and that was going to be the end of it.
CH: Well, that was the promise, and we could get there maybe one day, but we are not there yet. Humans have to provide this labeling, this meaning this intent to the content before computers can take that, and learn from it. So, we have things like people are like, “Oh, well just fine tune it.” Well, okay, that takes a lot of human time, and eventually it can learn the patterns, and it can classify things based on how it is to other things, but it doesn’t teach new information. It’s prone to hallucination. It’s expensive, and slow, and it doesn’t really scale. The same goes with supervised learning, which is very similar to fine-tuning. Again, it relies on humans to supervise it. And so if we do enough of that now, or in the next few years, maybe in 20 years, or 10 years, I don’t know, whatever rate we’re moving at, we may get to that point where we’re not seeing as many hallucinations. I mean it will be more reliable, more accurate, more trustworthy, take some of the work off of us humans so we can do things that we do best better. And we are seeing that with people who are using structured content, who are applying AI tools to smaller data sets, and seeing the results, and then building upon them. So, it’s already happening. It’s just at such a small scale that we’re nowhere near the tipping point where this is normal. So, we need to do more to help the artificial intelligent to actually become intelligent.
SO: And I think that really… Now that we’ve said, we’ve talked about all these problems around the large learning models, and the quality accuracy of the output that they’re putting out. It looks great always, or it looks plausible even which is worse, but in many cases it’s not quite right. Or you read it really carefully, and you discover it’s not really saying anything, which is also problematic. So, before we lose all the people that are like your anti-AI, and we think this isn’t going to happen, let’s talk about how you can make AI work. And I think here you’re headed towards retrieval augmented generation, right?
CH: Yes. Yeah. So, yeah, RAG, or retrieval augmented generation can provide these things providing some of that context. So, it’s an extra step, it’s an extra tool, but it is what will allow us to move beyond where we are now. And we’re seeing a lot of evidence of this as people experiment, so this slide talks about, or shows how things work. So, you put a prompt in, and it goes to the computer, the computer makes a query, and it retrieves information, and it sends it back. That’s how it works without RAG, when you augment that, when you put the prompt, and the query in, that goes through to this retrieval system, and enhances it so that there’s the context, the meaning, it puts it, and then it can run that through the LLM, and have a much better response so that it is more accurate, and connectual can’t necessarily a promise how good it will sound. I think that that’s another thing we’re still seeing is for people in the content space like us, and probably most of our audience, we can tell the difference between something that’s been generated by gen AI, and what’s been generated by human. So, that’s a different problem, but also related. So, yeah, so we have this, and then there are kind of two ways that we can create this augmentation. And the first one is Vector databases. So, this is going back to math way, way back to our high school math.
SO: I was told there would be no math.
CH: Just a little bit. So, Vectors, there are connections, and we can say how closely things are related. So, they’re assigned numbers, and it helps with making things, sorry, my screen just timed out, and I have no idea what I’m looking at.
SO: No, you’re still here.
CH: So, the Vector databases, this was invented for images, and video, and audio things that it’s hard to describe in words. And so it works in some places, but it’s really just a proximal closeness match. And then we have knowledge graphs.
SO: Just one thing on the Vector databases, and this is going to make all the AI professionals scream, and I don’t care. My version explanation of this is that this is basically the same thing that autocomplete does, where it is predicting the next word based on the thing that is the most likely next word. There’s way more math, and it’s way more sophisticated than that. But if you think about it that way, that’s what your LLM is doing. It’s like what’s the average next word?
CH: And I think an example is, so you take the city Sacramento, and you take the states, Washington, and California, it knows that Sacramento is closer to California than it is to Washington, but that’s about it.
SO: Because they occur in the same sentence, or close to each other in text more frequently.
CH: Right.
SO: Yeah. Okay. And so then, sorry, you were going to move on to-
CH: Knowledge graphs are made up of nodes which are the entities, and the edges, which are their relationship between the entities, and they add more dimensionality, and they label those relationships. So, in here we have Arnold Schwarzenegger at the center, he was the governor of California, and here we can see that Sacramento is the capital of California. It’s not just more closely related. And then we can also see that Arnold Schwarzenegger starred in Predator, which was produced by 20th Century Studios, which also produced Die Hard. So, we get this additional context, and awareness of context, context, and understanding that allows the computers to do more work. It can work across schemas, it can be more precise in its responses, and it can actually generate some insights. So, I just wanted to go through that, because I think it took a while for me to understand this, and figure out how to explain this in plain language, because I was not a math major either, but what I have seen is different tools talk about one, or the other, either being Vectors, and using embeddings, or being a knowledge graph, or some sort of graph database. And that’s really… It’s helpful to know what you’re looking at because they don’t have the same strengths. So, you need to know what you’re using the tool for, so you can know whether embeddings are the right way to go, or if there’s a graph that needs to be added to this. So, it’s really just helpful in evaluating tools, even if you’re not the one who has to create any of the underlying technology, you have to understand the technology you’re using.
SO: And so, again, turning this back to content, what I’m hearing you say is that the Vector-based approach is basically predictive math, like what do we think is going to be next? And the knowledge graph is explicitly if you ask a knowledge graph, what is the capital of X? And then you fill in California, or Washington, or whatever, it knows that relationship. And so it can give you a definitive answer, because it’s in the knowledge graph. It’s not this, “Let me see what the internet consensus is.” It is looking at these collection of boxes that are tied together with relationships, and saying, “Okay.” So, now turning this back to content, and why you’re saying that structured content matters, how does structured content come into this Vector, or knowledge graph scenario?
CH: So, it helps for both, because structured content turns your content into entities, or nodes, or things, or as we sometimes call it in the content strategy world, chunks for accuracy. So, your content will be turned into chunks by these machines, these robots. But structured content gives you control over the size, and meaning of the chunks. So, you can say, “These are the entities, and these are relationships”, without having to hope that it chunks it up in the right way. I know in the research that I did, it could lay it out the word not as a connection between two parts of a sentence. Well, not is crucial, and if it leaves that out, the Vector is very close, but it’s also incorrect when you’re putting those together. So, a knowledge graph doesn’t do that. And also structured content can prevent that from happening, because giving it the things it needs, the knowledge it needs to then use to generate something new.
SO: So, then what does it look like to combine those? If you combine back to retrieval augmented generation, and structured content, what happens when you put those together?
CH: Good things. So, finally we’re the good news story. It can reduce the amount of training you need to do on your data, or your content, which means the cost is lower. Humans spend less time adjusting their prompts, verifying results, cleaning up source data, and the accuracy is greatly improved.
SO: Okay. And so what do we need to do to our content in order to make the content maximally useful to AI? Because, and I’m seeing this in the comments, people are saying “AI is not going away”, and I think we all agree on that, but how can we make it actually work? How can we make it effective?
CH: So, I think we need to apply structure, semantics, metadata, use our taxonomies, use our ontologies, and create these explicit chunks. And that’s really about the content. We also need to decide when to use it. Obviously people are using it to write articles, there are customer support, customer service organizations within companies that are finding good use because they’re training it on smaller data sets that use trusted knowledge, not the internet. And so if you know what you’re using it for, what you want to achieve from that, do you want to produce content faster? Do you want content to be more accurate? Do you want fewer humans to be in the loop? Whatever that is, you can start with a small subset of your content, do this work to make it explicit, whether it’s doing a knowledge graph, getting a tool that allows you to do that, or an app that allows you to assign these things, whatever it is, and then try some experiments, and measure them, see how they work, and then learn from that, and go from there. Whether it was successful, you can expand that, and now do more things. Or if it didn’t work, go back to the drawing board, and figure out why. Was your hypothesis wrong, or was it a poor use case? I think that’s another good news is you don’t have to change everything all at once, just pretty much everything else we do in content is starting small, testing, and then growing from there. We’ll get better results both in what you produce but also in gaining traction within your organization. So, if, say, you have two silos producing content, nobody has that, I’m sure, or maybe everybody does.
SO: Oh, they have more than two.
CH: If you’re the one using structured content, and you’re getting amazing results, and another team isn’t using structured content, and they’re getting poor results, now you can say, “Hey, we can help. This is what we did.” And then maybe those people will say, “Oh, let’s try that.” And then word spreads. I find that just over, and over, this is just another application of start small, share your successes, and be willing to cross functions, and silos to expand the use.
SO: I mean, what’s interesting to me about that is that when we talk about large language models, and generative AI, it’s literally I think the exact opposite of that, right? It’s like feed the entire internet into it, and see what you get. That’s not start small, and really pay attention to the quality of your content. And so it seems to me that what’s going to happen if it hasn’t happened already, is that the content world, broadly speaking, is actually going to split into this sort of, we’re just going to throw an LLM out, and auto-generate, and not worry about it too much, which might be okay for certain kinds of use cases when it doesn’t matter whether you’re right, or not. And then there’s this other world that’s going in the other direction, which is we’re going to fix the underlying structured content, make it really, really good, and then put these tools over the top of that known good universe of content, and work through it that way. I mean those are just different worlds, right? Because one universe is saying, “We’re just going to automate it, and close enough”, or maybe not. And if it tells me the geyser going off, I don’t care even whether it did, or not, I just want a plausibly correct answer. All right, so where’s this thing going? What’s next? What do you think when you think about the future? And I mean you can decide whether this is the next week, or two weeks, or year, or five years, pick your timeframe, it’s fine. Where’s this going? What do you think is going to happen given your perspective?
CH: Well, I guess depending on your point of view, whether this is good, or bad, we’re already starting to see what is being called model collapse. And that’s when AI models are trained on data that includes content generated by previous versions of what they produced. So, over time it loses accuracy, and instead of improving, AI starts making mistakes that compound, and then it’s increasingly inaccurate, and distorted. And we saw this, there’s a story I found if people want the link, I can share it. It’s somewhere in my research. Where some customer service AI tool started Rickrolling customers because it was just constantly this recursive relationship of looking for things, and eventually it just, whether it obviously doesn’t understand the sending a Rick Astley video to people instead of a training video, but it saw enough references to that on the internet that it did that. And I’m sure some people thought it was funny, but other people were probably really annoyed, and they fixed that. But that’s what is starting to happen already. And as we’ve been talking about, the primary solution to doing that is ensuring that AI is trained in human-generated data. So, that means your own data, and more organizations are figuring out how to do this too, because it’s a security, and privacy concern. ChatGPT uses the internet, Google Gemini uses the internet. All of these tools are using the internet, but you can create your own LLM, you can create your own underlying databases, and only use yours to generate content. And if your content that you’re using to generate insights, or new customer service answers, FAQs, whatever it is, you know can rely on it more when you’re the one producing it. So, I think this is also, we’re going to hear more. I think we’re just going to start seeing more people sharing their experiments that they’re only able to share now, because it takes a while to get the data to share, and then we will see what the successes are, and are we going to get rid of crappy content spit out based on other crappy content? Probably never. But maybe we can slow that down as more people apply these best practices to their content, and put the right tools in place for the right use cases.
SA: Just jumping in here to let you know that you have about 25 minutes left, and tons of questions from the audience members.
SO: Tons of questions.
SA: All right, all right, I’m jumping back out.
SO: But that’s an excellent transition, because that’s actually where I did want to go next. I’ve got all sorts of questions coming in that are just really, really interesting. So, keep them coming. And I can tell you right now we’re not going to get to all of them, and I’m sorry. If we don’t get to your question. We will address it after, and maybe send you some resources. I have tried to address a few of them as we go. So, Carrie, you get to answer the question, and I wish you much luck. There’s a question here, “If AI is a black box, how do we know it’s accurate?” That was the first question. This is kind of a multi-parter. So, let’s start there. “If AI is a black box, how do we know if its content is accurate?”
CH: Well, I think that just goes back to what I was just saying. If you’re creating the content, and you know it’s accurate, then you can be more sure that it’s producing accurate content, but you have to trust but verify. So, you can say, “Okay, I think this is probably correct”, but you have to verify it, and see how accurate it is. Again, this is human in the loop where we just can’t avoid the human in the loop, at least not yet. And I don’t know if I’ll ever see that.
SO: And so the follow-on to that was, “Which AI sources are most likely to produce accurate content?” For example, and this is from the person who wrote this, “My understanding is that LLMs are less likely to be accurate than narrow data sets such as those used in medical research.”
CH: Yeah, I think you want a bigger pool that is structured, and rigorous in its creation. I have read, I haven’t done extensive research into this, but I have seen that on imaging, and this kind of goes back to those Vector embeddings, because you can feed a lot of medical images so into a database, and get results that are better than humans at detecting cancer. I saw something the other day, I don’t know how true it is, I didn’t look at the source, or find out what the study actually was, but it was potentially, AI can help spot cancer before it starts, especially in breast cancer based on mammograms. So, that’s a glimmer of hope that, and a way to use AI in the ways it was meant to be used. Pattern recognition, anticipation prediction based on the data. So, of course the more data you have, the more accurate it can be, because there’s more, “Oh this, not this, and this, and this” to feed it.
SO: All right, what else do we have here? Oh, sorry, I have so, so many questions. Okay, so a quick one. There’s a question here about structured versus unstructured content, and just a quick example of the difference between the two before I feed you something horrifyingly more difficult.
CH: So, unstructured content, say you were writing about the about page of a museum, and part of what you were talking about were opening hours. You could narratively describe the opening hours we’re open Monday through Friday, nine to five except on Thursdays where we’re open until eight, it’s all true, and accurate. But you could also structure those opening hours to be very explicit on Mondays, we open at nine, and close at five. On Tuesday, nine to five, Wednesday nine to eight, and then you have that information, you can reuse it. It’s explicitly opening time, and it’s explicitly a time which is a type of data value that allows you to sort, and make other connections. You can put specific dates in like we’re closed on Christmas Day, or Thanksgiving Day, or whatever dates you’re closed, not just days of the week. So, hopefully that helps. It’s taking something like a big blob of body content, and turning it into explicit stuff. That doesn’t mean you’re not still going to have narrative text, of course you are. But as much as if you can start with what can I structure, and make explicit entities, then I find that that can be about 80% of content in any given corpus, and the rest would become narrative, and more body content.
SO: So, I’m trying to take these in order from least complex to most complex, which is not actually working that well because… Audience, thank you. You have some great stuff in here. Okay. “Is there a right way to provide structured content to AI in order to teach it? And does the format change anything? Is it better, for example, to provide DITA XML content in PDF, or with a DITA map file, which would be the backend XML?” What do we do with that?
CH: I cannot answer that question.
SO: So, I will say that PDF is… I would say that you’re better the closer you are to the source because PDF essentially is a rendering, it’s an output where everything’s been kind of jammed together, and the backend probably has more metadata, and more structure on it if you’re talking specifically about DITA XML versus PDF. So, you probably want to run it against the DITA content. And having said that, I would actually argue that your third alternative might be to run it out to HTML, and process the HTML. I would consider most of those things to be better options than PDF at a high level. I mean, the actual answer is it depends, and that nobody likes that answer. Okay, I’ve got more of a businessy question, and I actually have a couple of these. “How do you make a case to senior leaders to invest in the information layer, and content structure? They seem to want the AI chatbots, and apps, but backend structure, they’re not investing in the backend structure. So, data scientists are being asked to solve things with LLMs rather than information architects, and content strategists being recruited to improve the source content, and the metadata.”
CH: Yes. So, I’ll just kind of go back. So, part of this is change management. It’s not content strategy, or information architecture, or structured content. It’s not about that. It’s about what’s in it for them. They can hire more data scientists, and do all this work for more money with worse results, or they can do it the way that’s going to ultimately save them money, and get better results. Obviously, you would need to tailor your case for your organization, but there’s more evidence that this is happening. My feed on LinkedIn, which admittedly is full of a bunch of IA geeks, and structured content geeks who I love dearly, and learn from every day is full of, “You can’t have good AI without good IA.” And they’re providing more ways that that’s true. So, follow some IAs, see what they’re learning, follow people who are doing early experiments, and see what that is. And again, start small. If you have control over a project where you can do an experiment to show the value of structured content, do that so you can use that as part of your evidence, but there’s no way you’re going to get anyone to pay attention to, if you go up the chain several steps from wherever you are, and say, “We need to do structured content, or AI won’t work.” You’ll just be pushed aside. So, figure out what matters to those people. Figure out how you can make the case to get them to what they want. It’s really hard to overcome shiny object syndrome, and we’re definitely in that, but there are going to be stories about bad things happening very soon. And if you’re in the US, you probably remember 10, 12 years ago, whenever it was healthcare.gov rollout disaster, everybody was like, “Oh, we can’t be the next healthcare.gov”, and then that faded. And so now we need these failure stories to help make the case for avoiding them. So, watch for those as well. I don’t wish anyone failure, but it’s going to happen. So, that’s another thing be like if you can see into the future, and say, “This is what’s going to happen if we don’t change how we work, and I’d like to experiment”, that might be your case.
SO: Yeah, I think I would add to that, that AI, or not even AI, machines automated processing. What happens when you put automated processing over the top of not so great content is that it, so AI exposes all the technical debt that you have in your content, all the inconsistencies, the missing pieces, the things that weren’t quite right. And because you’re automating that processing, it just propagates everywhere. It’s kind of like if you think about translation, if you start with a bad source document, and then you translate it every time you go into all these different languages, and you just have mistakes everywhere, because it’s a derivative, it’s never going to be better than what you started with. And so I think it’s worth looking at what is our core corpus of content, and what can we do with that? In addition to Carrie’s point, you cannot risk being the person who says, “AI is bad, and evil, and we don’t want to do it.” You can do some great things with AI, and with machine learning, and with these kinds of processing, but you have to get the prerequisites right, and if you don’t, some bad stuff is going to happen. The story we’re hearing, or we heard last year was all about Air Canada, and their chat bot that went sideways. Now, the great irony of this is that it was in fact, I don’t think an AI chat bot, it just had a bad set of data that was in it that somebody forgot to update, which by the way is technical debt once again. So, there were a couple of people that asked a variation of this question of our technical people are saying we can just use gen AI for everything. I wanted to touch on a slightly different question that came in, and this is a topic that we can, and should cover, and it didn’t make it into our plan. How will we ever be confident that there is no bias in the AI response? Obviously this is important, says the questioner, in things like political speech, or religious speech, but it’s also important in things like medical care. So, how do we address bias in the AI response?
CH: Better source content. I mean, this is the people problem. It’s a people, and content problem. We need more diverse teams, not just building the technology but creating the content, checking the content, structuring the content so you’re structuring it in a way that is less bias, or shows the bias. So, that’s explicit as well, because that sometimes things just are biased. But that’s a problem that AI, it’s a huge problem. And again, it’s really a people problem, and haven’t figured out how to solve that one.
SO: All right, well let’s throw out another interesting one. There’s a question here about the person writing in says, “We are a public body in the UK, not government, arms length from the government, and we provide financial guidance on helping people manage their monies in the public domain. People are accessing our content through chat GPT, et cetera. We are also developing our own LLM. We want everyone to see our free, and impartial guidance.” So, that’s their mission. “So, will structuring our content correctly, serve both models of AI.” That’s the first part of the question. And then the second part is, “Is there any way to protect our information?”
CH: So, the answer is structuring your content going to help? Yes. Protecting it? I don’t know. This is also not an area I’ve dug into, but this is something that is becoming talked about more, and more. So, stay tuned. It’s kind of like when search engines… If you’ve been around a while, you remember when search engines first came out, and you’re like, “Oh, I don’t want people coming directly to my website, not going through the search engine”, or whatever. So, you set up nofollow robot.txt files. Of course that’s silly now. For most content we want search engines to find it. And now it’s the same thing with LLMs, and the crawling. So, there are some things, but they’re not foolproof the way the nofollow was for search engines. I mean that’s really partly an IT concern of security, and privacy on the content that you have. And partly it’s the world. It is part of this evolution. Unfortunately, we didn’t have these discussions before these tools came out. We’re having them after they’re running amok among us. So, that’s a bigger tech question, and I think it’s one that a lot of people are wrestling with. I know, myself, I have not been producing content lately precise, and I’m not in any precise medical field, or something where it really matters if I get things completely accurate, but I’m like, “Well, something’s going to scrape it. Someone else is going to take credit for it.” And I don’t know, I think this is something that’s continuing to evolve. Will probably continue to evolve until there’s a big lawsuit, and there’s some regulation in all the various parts of the world. I think the EU is doing more now than any place else so far. But that’s a saga that’s going to continue to play out.
SO: Yeah, so, the EU did pass something called the EU AI Act, which basically classifies AI systems into various risk levels. And as you might expect, things like medical content are in the highest level of risk, facial recognition, those kinds of things that touch on personal aspects. And then the lowest level is sort of the basic advanced spell checker kind of thing. Okay, a couple of big picture questions as we attempt to wrap this up in the next minute, or two. There’s a question here about, there are two that are kind of related. One is, how can organizations quickly convert unstructured content into a structured model at scale? To which my answer is how good is your unstructured content? And the secondary part of this is probably it can’t be done quickly. Again, you have technical debt, and structured content is more interesting than unstructured, or more enriched. Is that accurate from your point of view, Carrie?
CH: Yes. Quickly, and at scale are not things that go together in this realm.
SO: Yeah, and there was a separate question about, “Well, could we maybe use the AI to help us find all the technical debt, and correct it?” Which sounds like a great application, right? It’s a pattern. Find the patterns, find the outliers, fix the outliers, and then you have a better collection of content. Another question here, “Who is taking care of the structured content in the chart?” The governor California’s Arnold Schwarzenegger, which of course he is in fact not anymore. “Is the model learning the new governor by itself, or is a human adding it manually?”
CH: So, underneath all of this is content governance, and your source of content should be updated as it changes. And again, this is getting into more of the technology of how this works, which I am not as familiar with as the overall, how this is all put together, the overall system. So, first you have to have the governance to make sure a content is updated, and then you have to have a way either to manually alert the systems that it’s new, or to recrawl it. So, this is one of the problems with ChatGPT is it’s only up-to-date to a certain date, which is why it said that Steamboat Geyser went off two weeks ago instead of two months ago. And so, yeah, it starts with governance, and then you would have to talk to your IT folks, the people managing the products to see how that works, and make sure it happens. There are, my understanding is that knowledge graphs can learn, but yeah, it just kind of depends on at what point in time they’re being accessed, I would think.
SO: You can feed the knowledge graph structured content, and it can pull out those relationships, and perhaps make those updates. But that just pushes the question back to who’s updating the structured content. Okay, I have one last question that we can get to before we throw it back to Scott to wrap up again, if we didn’t get to your question, we will address them via email as a follow on. There’s a question here about the people. “If the AI, and the application of AI is going to reduce the number of humans in the loop, how does the role of”, and here they’re saying the technical writer specifically, but the content creator evolving in the next decade, or three months. “What can the current day technical writer, content creator do to keep up?”
CH: Keeping up is the hard part, isn’t it? I think, for me, just getting this baseline understanding of how things work was super helpful. It’s not a black box to me anymore. So, I think that’s one part is understanding the fundamental nature. This is not going to change, and if it does, it will evolve. So, it’ll be easier to keep up with. And then it’s keeping the structure in place, keeping governance in place, that’s never going to be a bad thing. It’s only going to help you in the future. So, I think that’s my answer.
SO: Well, Carrie, thank you so much. This was really, really interesting, and hopefully useful to our audience out there. Scott, I’m going to throw it back to you.
SA: Excellent. Thank you very much, and thank you audience members. Please before you go give Carrie a rating on the quality of the information provided today using our one through five-star rating system. You can find that rating tab right below your webinar viewing panel. Super easy to participate, just click, and give a rating. You can also share some feedback if you’d like. And don’t forget that Sarah’s next show, November the 13th, is going to feature Alyssa Fox. It’s a super interesting topic about how to blend technical marketing content, and she’s got great strategies, so you don’t want to miss that show you’ve been watching The Future of AI: Structured Content is Key, with Carrie Hane, and Sarah O’Keefe, thanks for joining us today, and thanks for being here as well. We really appreciate all your participation, and we look forward to seeing you at an upcoming show in the near future. So, be well, be safe, keep doing great work. We’ll see you soon. Thanks for joining us. Thanks Sarah. Thanks Carrie.
SO: Thanks Scott.
CH: Thanks.