Death and tax-onomies: Metadata with minimal pain
To a lot of people, the word “taxonomy” sounds intimidating. But what I want content professionals to know is that metadata doesn’t have to be painful. You don’t need to start with a blank page, and you don’t need to reinvent the wheel. The framework already exists, and it’s Dublin Core.
To understand why modern cataloging is so powerful, we have to look back. And we can look back pretty far. Basically, metadata isn’t new–metadata existed for thousands of years before computers. For example, one of the original metadata schemas was the quipu: record-keeping devices made of knotted cords, used several centuries ago in the central Andes, particularly the Inca Empire. Quipus are complex systems of cords where information is categorized by dimensions like color, order, number fiber type, and cord attachments. In each category, a decimal positional system of knots stores values so that knot clusters represent digits in a base-10 number system.

Image from Wikimedia Commons, Inca Quipu
Move from then to now, and we have innumerable standards: MARC, METS, RDF, Dublin Core, and so on. Metadata standards are a bit of a turtles-all-the-way-down situation, but tend to diverge based on context. For example, PBCore was developed to catalog public broadcasting media, Darwin Core for the biological sciences, and ISO 19115 for geographic information.
In digital library science, Dublin Core is the universal standard because of its flexibility. Quipus were only decipherable by a special class of officials in the empire, but Dublin Core is purposefully designed to be accessible and understandable to anyone, not just taxonomists and librarians.
While we can be very thankful that metadata has matured over the past several centuries, we now face a different problem: information silos. Maybe your marketing team has one set of terms, engineering has another, and training has a third.
This is where the Dublin Core Metadata Initiative (DCMI) comes in. Established in 1995 in Dublin, Ohio, this standard was designed for one thing: resource discovery. The standard is designed to be simple, extensible, interoperable, and, most importantly for the context of content operations, domain agnostic. It works for a PDF manual just as well as it works for a 19th-century novel or a piece of art or any object you want to describe.
Using Dublin Core as the basis for your organizational taxonomy can translate across various information silos to build a unified system for faceted search, information retrieval (yes, including for AI purposes), and futureproofing your content operations.
Dublin Core and DITA
Dublin Core consists of 15 basic elements divided into three parent categories:
- Content: Title, Subject, Description, Source, Language, Relation, Coverage
- Intellectual Property: Creator, Publisher, Contributor, Rights
- Instantiation: Date, Type, Format, Identifier
Metadata in DITA was explicitly modeled after Dublin Core. Mapping between the two is very straightforward. For example, a Dublin Core Description maps directly to the <shortdesc> element. The Creator and Contributor elements map to <author type=”creator”> and <author type=”contributor”> in DITA.
To prove its extensibility, I use this crosswalking practice across wildly different fields…from mapping a METS record for a 1970s Sol LeWitt conceptual art installation to a task topic in DITA and a booksellers’ record for a specific edition of Madame Bovary to metadata encoded in a DITA bookmap.
Crosswalking your own taxonomy
If you are trying to handle a replatforming project or trying to align multiple departments in how they describe and organize content, the following process might help deliver you to the interoperable, functional content classification system of your dreams.
- Map what you already have
Look at the organizational and descriptive systems your company already uses. Maybe it’s a product hierarchy on your website or an enterprise taxonomy living in marketing. Take those existing categories and map them to the 15 basic Dublin Core elements.
- Establish a controlled vocabulary
Consistency is your friend. A controlled vocabulary is a restricted list of preferred terms that ensures everyone describes the exact same thing the exact same way. If half your writers tag a topic as “setup” and the other half tag it as “installation,” your end-stage faceted search fails. You can still use synonymous or alternative terms and cross-listing to keep peace between departments, but your core system must use defined, preferred terms.
- Specialize where the base is too broad
Dublin Core is intentionally basic. Once you’ve mapped your categories into Dublin Core, identify the gaps. Where is the core standard too broad for your content? Build your custom, granular values downward from that high-level parent framework to create the true value of having a custom content taxonomy.
As you move through this process, helpful questions you can ask yourself are:
What descriptive and organizational system does my organization already have?
Your organization probably (or hopefully) already has an enterprise taxonomy of some kind. Maybe it lives in the marketing department, maybe it’s a product hierarchy used to organize product listings on your website, maybe it lives somewhere else. Use that information as a starting point instead of starting with a blank page.
What metadata is automatically tracked by my CCMS?
When you’re designing a taxonomy for your organization’s content, keep in mind that not all of the metadata has to live in DITA. You can leverage the standard things tracked by your CCMS, like when and who last edited a file or when the file was created or released.
Which standard DITA metadata operators are relevant to my content?
To keep things higher level, use the 15 basic Dublin Core elements to quantify how you need to categorize your content and then crosswalk that model into a higher level of granularity in DITA metadata.
What is unique about my content?
Value comes from combining the standard stuff, like the 15 basic Dublin Core elements or metadata automatically tracked by your CCMS, with industry-specific concepts or other things unique to your business. For example, drivetrains for a car manufacturer or flaps for an airplane manufacturer. These unique concepts will inform how and what you specialize metadata for in the DITA implementation of your taxonomy.
On tracking too much metadata…
A common trap companies fall into is tracking too much metadata. Bloated, overly granular tagging systems cause chaos. If a subvalue only applies to one tiny edge case, don’t alter your entire schema to accommodate it. Track only what is essential. Let your Component Content Management System (CCMS) automatically track instantiation data (like dates and file formats) so your authors don’t have to manually fill out a ton of fields.
Ultimately, robust metadata is a massive boon to digital accessibility, information retrieval, and even AI initiatives. If you feed standardized, cleanly tagged content into an LLM, it will be exponentially more effective in making sense of your content and knowing what content to serve for what prompts.
Questions about taxonomies? Contact us!
"*" indicates required fields
