The misuse of metadata (podcast)
In episode 89 of The Content Strategy Experts podcast, Gretyl Kinsey and Bill Swallow talk about strategies for avoiding the misuse of metadata and DITA XML-based content.
“The more you fine-tune how your content model needs to operate, the easier it’s going to be to move it forward over time. The more you start taking shortcuts and using metadata for purposes other than what it was intended for, the more problems you’re going to have.”
– Bill Swallow
Gretyl Kinsey: Welcome to the Content Strategy Experts podcast, brought to you by Scriptorium. Since 1997, Scriptorium has helped companies manage, structure, organize and distribute content in an efficient way. In this episode, we talk about strategies for avoiding the misuse of metadata and DITA XML-based content.
GK: Hello and welcome, I’m Gretyl Kinsey.
Bill Swallow: I’m Bill Swallow.
GK: And we are going to be talking about all the different ways that we have seen metadata get misused in DITA XML. Before we dive into that subject, Bill, what is metadata in the context of DITA? And how is it used just for anyone who may be unfamiliar?
BS: In the context of DITA, metadata is a series of elements and attributes that are applied to your DITA content in order to give it some meaningful purpose. A lot of times we see it as profiling metadata, so being able to set, for example, an audience on a topic to say, “This is only for beginner people.” This way, when you publish your output, you can turn on or off your beginner audience content and produce either a beginner guide or a more advanced guide without the beginner information in there. Metadata also allows you to do more interesting things with your content. One example that we see with metadata in the standard DITA implementation is around notes and warnings and cautions. They’re all the same root element of note, but you can use a type attribute to set whether it is a note, whether it is a caution, a tip, a warning, a danger flag or what have you. That’s an example of how metadata can influence the type of content that you have in DITA.
GK: Yeah, absolutely. It’s essentially we describe it as data about data. It’s all of the information in your DITA content that does not actually get printed or electronically distributed, I guess you could say, on the page. It’s everything that’s making it run kind of behind the scenes, but it’s not actually part of your published output. It can influence the way that it is produced as Bill described, the way it’s sorted and searched and delivered. But it’s not something that you actually see, like the words on the page. And I think that that can cause a lot of the confusion that makes it get misused because when people are coming into DITA from a mindset from desktop publishing, the way that metadata works there is quite a bit different. And I think that getting into that mindset of the proper use of metadata in DITA is a really big shift. I want to talk about some of the examples of misuse that we have commonly seen and fixed with a lot of different cases.
BS: I think probably one of the most common examples we’ve seen with regard to metadata misuse has been around using metadata or generic metadata buckets for very specific purposes. And one of these is the output class attribute that a lot of people end up using as formatting instruction within DITA. Which kind of breaks the rules of DITA itself because you generally are going to DITA so you can separate your content from its formatting. But here, we often see output class equals red or output class equals 16 points, where they’re adding instruction that wasn’t built into the transform itself in order to tell the transform how to render a piece of content. And it blows my mind, but it is one of the most common things we’ve seen.
GK: Yeah. And that just, again, it comes from that mindset of working in something like desktop publishing, where you do get to control all the little bits and pieces of the formatting at an individual level. And when you go into something like DITA, where your formatting is separated from the content and is automated, it can be really, really difficult to get your mind past that shift. And so then we see a lot of instances where people go, “Oh, I’m really limited. I can’t make this one piece of text bold anymore or I can’t turn this green anymore.” They just find the workaround of putting an output class on it. And what happens over time is that that misuse of the output class attribute ends up completely defeating the purpose of having automated formatting, because you’ve put all these overrides everywhere. And if you actually had a more legitimate use of output class, then that’s kind of ruined too by the fact that you have misused it in all of these places.
BS: Another bit of metadata misuse that we’ve seen is using one metadata element or attribute for another purpose, a purpose it wasn’t designed for. One example could be that you might be setting audience to let’s say a different country, which kind of makes sense if you want to be able to filter on certain types of content for certain geographies, but really that level of metadata should be held up in the xml:lang attribute where you’re describing which language and which country this content is aimed toward. If you’re labeling something for a German audience, regardless of whether it’s in English or in German, you really should be using the xml:lang attribute as opposed to profiling it for a German audience. Now, there are some differences that you can get into as to whether you want to include or omit certain types of information for a particular audience but in general, you have to be clear about which elements and which attributes you’re going to use for which specific purpose.
GK: Yeah. And audience is a really interesting one because I’ve seen a few different cases where if the audience that a company is delivering for is really complex and there are maybe a lot of different ways that they need to kind of parcel out the content for different chunks of that audience, that just the default metadata for audience in DITA isn’t really enough for them. And some of the workarounds that I’ve seen them do are they’ll use audience for kind of one facet of their total audience and then they’ll go in and pick another metadata element or attribute for another facet of their audience, when it’s really not designed for that. And so that points to the fact that if you’ve got complexity that’s not really built in, that you need to start looking at a more effective way to handle that than just shoving it into a metadata element where it doesn’t actually fit and where it’s not designed for it.
GK: Because then what happens down the road is if they actually did need to use a different metadata element that they had designated for a part of their audience and later they need to use that for its intended purpose then it’s already taken up with however they’ve described it for that piece of their audience. And then they have to do a lot of reworking. It really is important to kind of think about this. And I know we’ve talked about this in some of our other podcasts about planning out a taxonomy and thinking about your metadata as a whole before you go in and just start assigning it to the DITA elements and attributes.
BS: Right. And then even if you’re not misusing an attribute, a lot of times we do see cases where other meta is just used throughout an entire content set, where you’re essentially defining custom metadata, which is good, but you’re doing it in a very generic way that usually requires a lot of, well, it usually involves a lot of user error because everything is hand-typed at that point.
GK: Yeah. I’ve seen a lot of instances where that’s just kind of used as a catch all or a place to shove anything that doesn’t fit into all of the other existing metadata categories that there are. And then what happens is later when you need to organize that better, everything has just been shoved into other meta and there’s not really an easy way to kind of parse that back out and define it without doing a lot of work. It really is, like we said, helpful to plan this out ahead of time, think about all the different metadata that you’re going to need and figure out where and how it fits into DITA’s metadata structure.
BS: Absolutely. The more you fine-tune exactly how your content model needs to operate, the easier it’s going to be to move it forward over time. The more you start taking shortcuts at the beginning and using metadata for purposes other than what it was absolutely intended for, you’re going to have a lot of problems unwinding that as your content set grows, as your publication breadth grows. You’re just going to run into problem after problem after problem so it’s best to do it the right way. Rather than shoehorning a bunch of metadata into random elements and attributes and using other meta wherever you want to and output class and all these other things, probably want to talk about what might be a better approach?
GK: Sure. Of course one is specialization and that is the ability that DITA has to create custom elements and attributes based on the structure of existing ones. And this is absolutely something that you can and should do with metadata if you have that need. And this is really, I think one of the more common areas where we see specialization among our clients. That a lot of times the actual topic and map structure is fine, but there are metadata requirements that they have that just do not fit within what’s available by default in DITA. Coming up with some sort of a taxonomy before you start putting everything into those default DITA elements and attributes can help you see where, okay, maybe we might need a specialized element or set of elements or attributes for the specific type of metadata. And that really gives you a roadmap for how that might work.
GK: And I think one thing to look out for is if you do start with a kind of set metadata structure and then things change over time and you start noticing a pattern, that you may be are using other meta a lot or output class a lot for the same kind of thing over and over because it just doesn’t fit, that can oftentimes kind of be a little red flag to you that, hey, maybe we need to go back and take another look at this and think about specialization for that kind of information.
BS: Right. A lot of people are really hesitant to look at specialization because it means customizing the DITA model and really doing some very high tech and difficult things. And from a metadata perspective, it really is the best way to get in there and make the model work for your content and for your needs. And the beauty of the specialization approach is that once you’ve implemented it, it carries forward. You can update DITA from version to version, to version, to version and your specialized content will work. Your specialized elements and attributes will just work. It’s not divorced from the model. You’re not dealing with some FrankenDITA thing that’s never going to be able to be updated again. It’s really the ideal approach to wrestling with metadata and making sure that you have the right buckets for the right types of data about your data.
GK: Yeah, absolutely. I want to talk about another feature that can easily be misused but also it can be really helpful if it’s used correctly and that is subject scheme. And subject scheme is basically a special type of map that’s available in DITA that allows you to bind specific values or sets of values to attributes. And this can kind of work sometimes as an alternative to specialization if you don’t really have a compelling enough case for specialization yet, but you still need some sort of a custom set of values for your attributes.
GK: And some examples we’ve seen are again, if we go back to the example we talked about for audience and you want to define a list of different pieces of your audience that is kind of more complex than maybe something like beginner, intermediate, advanced, then you can set up that list in your subject scheme, you can set up hierarchical lists of values and it really just makes it a lot easier for your writers to avoid mistyping something because they have a pick list that comes from that subject scheme. But again, it’s also something that can really easily be misused.
BS: Absolutely. And the other piece about a subject scheme is that you can set it up so that you do have that finite list of values and you only have those values available. And that really allows you to only provide the values that you have handling built in for. If you are experimenting with different types of metadata and you don’t necessarily want them in a production mode, you can exclude those from some of the lists, depending on the authors that are working on it. You might have an experimental batch of authors that are working on the next latest and greatest batch of content, but for a lot of the content, that’s more in a maintenance mode or is using the existing publishing workflows that you have established.
BS: You can limit the metadata values to just what’s in the subject scheme, that’s all they have available so they don’t accidentally create something or mistype something that is not going to be handled. Because usually the publishing instruction will basically say, “I don’t understand this value. I’m just going to throw it away and just go with the content that’s there.” And that could be very detrimental if you’re publishing something that has metadata applied to it that says, “Do not publish this in this scenario,” in which case, then you get content that you didn’t intend in your output.
GK: Yeah. I think kind of one of the other ways I’ve seen people misuse subject scheme is that when they start to approach it as kind of a stop gap between having no specialization and eventually getting into specialization for their metadata, that over time, it starts to become really unwieldy and they’re trying to kind of shove, I think, too much complexity into some of those lists of values and really try to make it a substitute for specialization when really it’s not. And I think that’s another one of those things to look out for as kind of a red flag. Is that just like in your content, if you find that you are using output class excessively for the same thing, if you’re shoving into other meta too much, if you get into a subject scheme and you realize that it’s not actually helping with the complexity of everything that you need to capture, then that’s another one of those red flags that says, “Hey, we should look at specialization.”
BS: And likewise it hearkens right back to your taxonomy as well, because at the point where you’re using subject scheme, it should be reflecting what’s in your taxonomy. If there are additional things that you need to add that aren’t available in your pick list for the subject scheme, chances are they’re also missing from your taxonomy, which means you have some more thinking to do on exactly how you are categorizing your content.
GK: Absolutely. We’ve talked about taxonomy as kind of a way to make sure that you avoid that in this use of metadata. Another one I wanted to bring up is also not just to defining a taxonomy, but defining your formatting and your presentation needs as much as you can upfront as well, because that’s also going to play a role in where you might need some custom elements or attributes that can drive things a lot better than just using output class all over the place.
BS: That’s a good point. And I think the final thing that we want to mention is that you want to future-proof your content model as much as possible so that these needs are either expected or at least your model can grow as expectations grow from others of that content model. Being able to have specific metadata that’s specialized for your exact content is going to make it a lot easier to be able to introduce new values, to be able to constrain against specific values for that metadata and also having that mature taxonomy model as well will help you in that regard.
GK: Yeah. If you think about future-proofing and think about planning your taxonomy, planning, your publishing needs, planning your distribution around that, then that will really kind of help shape the way that you think about your metadata use and make sure that you allow for that growth and that scaling that should happen in your company if you’re going in a successful direction. Before you just start going into DITA and building the metadata, it really requires that level of forethought to make sure that you’re not going to misuse anything.
GK: And we’re going to wrap things up there. Thank you so much, Bill.
BS: Thank you.
GK: And thank you for listening to the Content Strategy Experts podcast, brought to you by Scriptorium. For more information, visit scriptorium.com or check the show notes for relevant links.