Skip to main content
September 24, 2012

Perils of DITA publishing, part 4: PDF acrobatics

In which we bend PDF publishing to our will. Eventually.

The PDF transform for our Content Strategy 101 book had many requirements beyond what you get with the default PDF plugin in the DITA Open Toolkit. They included:

  • Detailed control of common layout elements
  • Treatment of section elements as subordinate topics
  • Correct handling of part elements in bookmaps and supporting unique images for each part
  • Improved appearance of admonitions (note, caution, warning, and so on)
  • Repeating table titles (with “continued”) on page breaks
  • Improved widow and orphan handling in tables and bulleted lists
  • Extensive table customization
  • Options for formatting definition lists (dl element) as a table or as a list
  • Support for two-column output of simple lists
  • Automatic generation of cover and copyright pages from bookmap metadata

When you create a PDF using the DITA Open Toolkit, the source DITA files are converted into an intermediate language, known as XSL-FO (eXtensible Stylesheet Language-Formatting Objects). XSL-FO describes how the output pages are laid out and how text is formatted. An FO processor generates the PDF from the input XSL-FO file.

When formatting text, XSL-FO is actually very similar to XHTML. Instead of the HTML div tag, FO uses block; instead of span, FO uses inline. The other similarity is that while XHTML uses CSS for applying formatting, XSL-FO uses attribute sets. Like CSS, attribute sets can be layered upon each other. The names of the attributes used by attribute sets are very similar to the corresponding CSS properties.

Controlling common layout elements

The XSL stylesheets that define attribute sets are not the easiest things to read; the information is not as compact as in a CSS file. Additionally, because the PDF plugin uses a number of different files for defining the attribute sets (for very good reasons), it’s difficult to get a good picture of the formatting imparted by the attribute sets.

To work around this problem, I created a single file in which we define many common formatting properties using XSL variables. This basic-settings file (we started with the basic-settings.xsl file from the PDF plugin) is organized similar to an actual book-design style sheet. Thus, we can define formatting for the elements on the cover page, the TOC, preface, parts, chapters, body, and index. The basic-settings file also defines a number of default values, so that, for example, all body text elements can use a consistent set of attributes. Where fonts can be specified, the basic-settings file uses a consistent set of attributes for family, size, leading, weight, style (italics), and color. Where required, lines are defined with a consistent set of attributes for style, weight, and color.

This portion of basic-settings specifies the figure title attributes.

To support the commercial version of this plugin, we send a form to our customers in which they specify their typographical requirements. We generate this form from code comments in the basic-settings file.

As we fine-tuned the heading sizes for Content Strategy 101 and made other typographical changes, it was easy to go to the basic-settings file, make a few quick changes, and generate an updated PDF.

Part elements

Although the part element is part (ha!) of the bookmap specialization, the default PDF plugin has little part processing (just as there’s not much for dealing with section elements). In normal processing, when the plugin encounters a topic title, it counts the number of ancestor topics. The presumption is that a topic with zero ancestor topics should have a chapter title; a topic with one ancestor topic should be a first-level head; two ancestor topics, a second-level head, and so on. However, if part elements are added to the map, that throws the numbering off.

To format the topic titles correctly (and to get them to behave correctly in the TOC and bookmarks), we had to modify the way the OT counts ancestor topics. If a topic’s ancestors include a part element, subtract one from the level count.

This solution worked even with an edge case: the introductory chapter stands by itself in Content Strategy 101. The first chapter in Part I is Chapter 2.

Section elements

When writing concept topics, the DITA section element can be quite useful, as it allows you to subdivide a topic into smaller-than-topic units.

However, to the DITA-OT, the existence of the section element raises a big question: when a bookmap includes a topic that uses sections, should those section elements be treated differently than subordinate topics in the bookmap? After some wrestling with the matter, we decided that a section title inside a topic should be treated just like a topic that is subordinate to the current topic in the ditamap.

Because I had already made changes to the template that counted ancestor topics (as part of the part element modifications), it was easy go in and make similar allowances for section elements. There was one other wrinkle, though: the ancestor-counting template assumed that the starting topic element had an ID; unfortunately, section elements are not required to have IDs.

The default PDF transform doesn’t include a lot of code for dealing with section elements, so I had to borrow from some of the topic element handling logic. However, when I implemented my changes for title elements in sections, I named the attribute set differently for flexibility.

Improved appearance of admonitions

The formatting of admonitions in the PDF transform contributes to the complaints about ugly DITA PDF files. We dropped the cheesy images but were left with another issue: the PDF transform writes the admonition label (“Note”, “Caution”, “Tip”) as a separate block in FO. We wanted the admonition label to be a part of the first paragraph (that is, an inline element in the first block). This required reordering some of the labeling logic for note elements, along with some additional behavior in the p element transform.

The admonition label is a part of the first paragraph.

And then we discovered that one of the authors (ahem) couldn’t decide whether to use p elements inside note elements. We solved that by creating an implied p if one didn’t exist.

Repeating table titles across page breaks

One popular FrameMaker feature is the multipage table handling; if a table breaks across a page, FrameMaker displays the table’s title on the next page (and even optionally adds text such as “continued” to the title). This feature is not implemented in the standard PDF plugin, but Sarah and Alan wanted this handling. Because FO processors repeat column heads across page breaks, I created an initial row in the table header that contains the table title.  After formatting this first row to make it look like it was outside of the table, voilá!, repeating table titles.

If a table is split across pages, the table title is repeated.

But wait, there’s more! Next week, the PDF acrobatics continue in Part 5!