Perils of DITA publishing, part 3: Indexing
In which we are boxed in by the limitations of DITA indexing support.
Dilemmas abound when you’re indexing DITA content. This third installment in our Perils of DITA publishing series explains how Sarah O’Keefe and I handled some indexing challenges while developing our latest book, Content Strategy 101.
Where should I place index entries in my source DITA files?
Advice here and there on the web recommends against putting indexterm elements within the body of a topic where the referenced content occurs. Instead, guidelines suggest placing the indexterm elements in one of two locations:
- In the prolog of a topic
- In the topicref in a ditamap
Those recommendations make sense for maximizing topic reuse and streamlining localization. They are problematic, however, if PDF/print is a primary deliverable (as the previously mentioned industry advice also points out). Putting indexterm elements in the prolog or topicref creates entries pointing to the start of a topic in PDF output. If that topic breaks across pages, an index entry intended for information at the end of the topic points to the top of the topic on the previous page.
We opted to place our indexterm elements within the body of topics so the PDF file we sent to the printer would have precise index entries. Reuse and localization were not our priorities for this book. Happy index users were.
Because Content Strategy 101 is narrative, the topics are a bit longer than you might see in technical content; topic length played a big part in our decision. Even so, a short DITA topic with three or four paragraphs could split across pages in a PDF file. Therefore, when considering indexterm element placement, you have to balance the needs of your readers against your localization and reuse requirements. Talk with your localization vendor to determine the best placement of index entries within the body of topics if you require specifically placed index entries and still want source content that’s more streamlined for localization.
Note: Even though we decided to place indexterm elements within the body of topics, we still ran up against a few problems in regard to where indexterm elements are allowed. For example, you can’t place an indexterm element in the elements of a definition list (dt, dd) without wrapping the indexterm element in a ph element. I’m sure there are reasons the DITA specification doesn’t allow indexterm within dt and dd elements, but I don’t know what they are.
I put indexterm elements in my topics, so why does my PDF output have no index entries?
After you spend time adding index entries to DITA source files, it’s very annoying to generate PDF output through the DITA Open Toolkit and get no index entries in your PDF file. Yep, that’s right. If you use the Apache FOP processor that comes with the Open Toolkit, you will not get index entries in output based on the default PDF plugins.
It’s enough to make anyone feel downright stabby.
To get index entries in your PDF output, you have a few options:
- Recode the index processing in the PDF plugin to work with the FOP processor. (I can hear your screams. Writing XSL-FO code isn’t my cup of tea, either. I prefer to let Simon Bate do the dirty work.)
- Buy a plugin that includes index processing FOP can understand. (Here’s where I shamelessly plug Scriptorium’s PDF plugin, which we adapted for Content Strategy 101. Simon recently updated the plugin for the 1.6.2 release of the DITA Open Toolkit.)
- Buy a proprietary FO formatter (such as the Antenna House Formatter) that will render the index information generated by the PDF plugin. For the record, we used the Antenna House Formatter for Content Strategy 101 and other Scriptorium Press titles authored in DITA.
None of those options is inexpensive, and each perfectly encapsulates how DITA is free but not cheap.
After using FrameMaker for years, I’m accustomed to typing colons to separate primary and secondary index entries. How difficult is it to break that habit?
Oh, it’s very hard. My first pass at the index code was full of colons because of FrameMaker muscle memory. I would type
<indexterm>hello:world</indexterm>
instead of
<indexterm>hello<indexterm>world</indexterm></indexterm>
Because so many technical authors have used FrameMaker and are primed to type colons while indexing, the DITA Open Toolkit lets you type colons to create nested index entries for PDF output generated from the pdf2 plugin. Starting with toolkit version 1.5.4, a toggle was added to control whether FrameMaker indexing syntax is supported; by default, support is turned off.
In version 1.6.2 of the Open Toolkit, the toggle is in the DITA-OT1.6.2libconfiguration-properties file with the org.dita.pdf2.index.frame-markup property. From what I can tell, FrameMaker syntax support works only with the pdf2 plugin in the Open Toolkit. Therefore, if you have outputs to generate based on the other default toolkit plugins, you’ll still have problems with the colons unless you update the transforms to handle FrameMaker syntax in indexing. (Ugh.)
Next week, you can read more about PDF output when our Perils of DITA Publishing series continues with the curious case of the PDF plugin. Stay tuned!
Yves Barbion
Another option to get index entries in your PDF output is to buy FrameMaker with Leximation’s DITA-FMx plugin, which has very advanced index options:
http://docs.leximation.com/dita-fmx/1.1/ > Working with Indexterms
Alan Pringle
More proof DITA is free but not cheap!
Leigh White
Oh, not so blasphemous. I wouldn’t disagree with anything here. I’ve had the same discussion many times with writers about placement of the indexterm elements. I strongly favor the prolog or the topicref, and my justification, (aside from reuse and localization) is that it’s not always likely that a reader can turn to the exact location of an indexed term and understand the reference without reading back at least a few sentences or a paragraph and possibly even returning to the beginning of the topic anyway, so why not take them there to begin with? But different kinds of subject matter might vary widely in this respect. Aside from that, you gotta wonder why FOP doesn’t have index support after all this time. Maybe it’s an Antenna House/XEP conspiracy! 🙂
Alan Pringle
I think we’re coming from the same place on this, Leigh. Placement of indexterm elements depends on the type of content, the kinds of output, etc.: the sort of stuff that should be in the requirements a team develops before choosing and implementing a new process.
I do wonder, though, how many people know they won’t get index entries in their PDF files from the Open Toolkit before starting down the DITA path.
Jarno Elovirta
The reason DITA-OT doesn’t generate an index with FOP is that FOP doesn’t support XSL 1.1 nor does it have an extension to remove page number reference duplicates. We could enable index generation for FOP in DITA-OT, but if the same index term appears multiple times in the same page, the same page number reference would appear multiple times.
AXF supports XSL 1.1 which has a feature to remove the duplicates, and XEP has an extension for it.
Alan Pringle
Jarno, thanks for clarifying why the default OT doesn’t support indexing. People need this kind of information before diving into DITA.
Joe Pairman
This is a useful article, Alan. It seems that many people still think the DITA-OT will be more like a DTP tool than it is. The example of indexing is a good reminder that the details of a DITA publishing solution (or any structured content publishing solution for that matter) may take more work than they’d expected.
I do agree with Leigh’s general recommendation for indexterm positioning, though. It often makes sense to land people at the beginning of a topic, where they get a bit more context. And by putting the indexterm in the prolog or topicref, it also shows up in the meta tags in your HTML-based outputs, which can be quite useful as synonym sources for internal site searches or other customized search applications.
Regarding the danger of linking to the top of a topic where most of the topic is on the next page, it may make sense to create stylesheet rules that tend to keep the top few block elements in a topic together anyway, thus pulling the topics you mentioned onto the next page. Of course it needs a lot of tweaking to get right, but the results can be easier to read. (The resulting increased whitespace might not be economical if the PDF is for print, but it seems fewer and fewer PDF docs are printed these days.)
However, where it makes more sense to use index terms for specific blocks in a topic, it should still work fine for localization, if the advice in the indexing best practices white paper is followed: “Insert block-level index tags immediately following the start tag of the applicable containing block element”. As you say, the best thing is to work with the translators on structure changes like this.
Regarding the actual authoring of index entries, indeed it can be a bit of a pain. (Did you get into index-see at all? That’s an interesting thing on its own, and not entirely intuitive.) I think we might end up creating a tool to manage index entries in the style of InDesign.
Alan Pringle
Joe,
I did create a few “see” and “see also” entries in the index; I think cross-indexing terms is vital for a good index. Can’t say I enjoyed coding those entries in DITA, however.
Because Content Strategy 101 is narrative and the PDF was for the print edition, we wanted our index entries to take readers to the exact page. We think our decision was “best practice” for this particular kind of content, but what you and Leigh have stated about placement up top makes a great deal of sense for traditional tech comm.
Sarah O'Keefe
see and see also are a piece of cake compared to ranges, which are highly problematic in online editions. I think we took the approach of outlawing them, which the indexer in me finds appalling. The only other solution I could come up with was to remap them as follows:
Print edition:
range entry becomes a range: example 6-8
Online editions:
start range entry becomes an entry: example 6
end range entry is discarded
Tim Slager
Thank you for addressing ranges. I like your solution. I’m still at the point of not being able to get ranges to display in my (PDF) index.