Skip to main content
XML

Handling XSL:FO’s memory issue with large page counts

Formatting Object (FO) processors (FOP, in particular) often fail with memory errors when processing very large documents for PDF output. Typically in XSL:FO, the body of a document is contained in a single fo:page-sequence element. When FO documents are converted to PDF output, the FO processor holds an entire fo:page-sequence in memory to perform pagination adjustments over the span of the sequence. Very large page counts can result in memory overflows or Java heap space errors.

Read More
Content strategy

The State of Structure

In early 2009, Scriptorium Publishing conducted a survey to measure how and why technical communicators are adopting structured authoring.

Of the 616 responses:

  • 29 percent of respondents indicated that they had already implemented structured authoring.
  • 16 percent indicated that they do not plan to implement structured authoring.
  • 14 percent were in the process of implementing structured authoring.
  • 20 percent were planning to do so.
  • 21 percent were considering it.
  • This report summarizes our findings on topics including the reasons for implementing structure, the adoption rate for DITA and other standards, and the selection of authoring tools.

    Download PDF file (2 MB, 56 pages)

    Discuss this document in our forum

Read More
Tools

Adding a DOCTYPE declaration on XSL output

In a posting a few weeks ago I discussed how to ignore the DOCTYPE declaration when processing XML through XSL. What I left unaddressed was how to add the DOCTYPE declaration back to the files. Several people have told me they’re tired of waiting for the other shoe to drop, so here’s how to add a DOCTYPE declaration.

First off: the easy solution. If the documents you are transforming always use the same DOCTYPE, you can use the doctype-public and doctype-system attributes in the <xsl:output> directive. When you specify these attributes, XSL inserts the DOCTYPE automatically.

However, if the DOCTYPE varies from file to file, you’ll have to insert the DOCTYPE declaration from your XSL stylesheet. In DITA files (and in many other XML architectures), the DOCTYPE is directly related to the root element of the document being processed. This means you can detect the name of the root element and use standard XSL to insert a new DOCTYPE declaration.

Before you charge ahead and drop a DOCTYPE declaration into your files, understand that the DOCTYPE declaration is not valid XML. If you try to emit it literally, your XSL processor will complain. Instead, you’ll have to:

  • Use entities for the less-than (“<” – “&lt;”) and greater-than (“>” – “&gt;”) signs, and
  • Disable output escaping so that the entities are actually emitted as less-than or greater-than signs (output escaping will convert them back to entities, which is precisely what you don’t want).

There are at least two possible approaches for adding DOCTYPE to your documents: use an <xsl:choose> statement to select a DOCTYPE, or construct the DOCTYPE using the XSL concat() function.

To insert the DOCTYPE declaration with an <xsl:choose> statement, use the document’s root element to select which DOCTYPE declaration to insert. Note that the entities “&gt;” and “&lt;” aren’t HTML errors in this post, they are what you need to use. Also note that the DOCTYPE statement text in this template is left-aligned so that the output DOCTYPE declarations will be left aligned. Most parsers seem to tolerate whitespace before the DOCTYPE declaration, but I prefer to err on the side of caution:


&lt;xsl:template match="/"&gt;
&lt;xsl:choose&gt;
&lt;xsl:when test="name(node()[1]) = 'topic'"&gt;
&lt;xsl:text disable-output-escaping="yes"&gt;
&lt;!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd"&gt;
&lt;/xsl:text&gt;
&lt;/xsl:when&gt;
&lt;xsl:when test="name(node()[1]) = 'concept'"&gt;
&lt;xsl:text disable-output-escaping="yes"&gt;
&lt;!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"&gt;
&lt;/xsl:text&gt;
&lt;/xsl:when&gt;
&lt;xsl:when test="name(node()[1]) = 'task'"&gt;
&lt;xsl:text disable-output-escaping="yes"&gt;
&lt;!DOCTYPE task PUBLIC "-//OASIS//DTD DITA Task//EN" "task.dtd"&gt;
&lt;/xsl:text&gt;
&lt;/xsl:when&gt;
&lt;xsl:when test="name(node()[1]) = 'reference'"&gt;
&lt;xsl:text disable-output-escaping="yes"&gt;
&lt;!DOCTYPE reference PUBLIC "-//OASIS//DTD DITA Reference//EN" "reference.dtd"&gt;
&lt;/xsl:text&gt;
&lt;/xsl:when&gt;
&lt;/xsl:choose&gt;
&lt;xsl:apply-templates select="node()"/&gt;
&lt;/xsl:template&gt;

The preceding example contains statements for the topic, concept, task, and reference topic types; if you use other topic types, you’ll need to add additional statements. Rather than write a statement for each DOCTYPE, a more general approach is to process the name of the root element and construct the DOCTYPE declaration using the XSL concat() function.


&lt;xsl:variable name="ALPHA_UC" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/&gt;
&lt;xsl:variable name="ALPHA_LC" select="'abcdefghijklmnopqrstuvwxyz'"/&gt;
&lt;xsl:variable name="NEWLINE" select="'&amp;#x0A;'"/&gt;

&lt;xsl:template match="/"&gt;
&lt;xsl:call-template name="add-doctype"&gt;
&lt;xsl:with-param name="root" select="name(node()[1])"/&gt;
&lt;/xsl:call-template&gt;
&lt;xsl:apply-templates select="node()"/&gt;
&lt;/xsl:template&gt;

<span style="color: green;">&lt;-- Create a doctype based on the root element --&gt;</span>
&lt;xsl:template name="add-doctype"&gt;
&lt;xsl:param name="root"/&gt;
<span style="color: green;">&lt;-- Create an init-cap version of the root element name. --&gt;</span>
&lt;xsl:variable name="initcap_root"&gt;
&lt;xsl:value-of
select="concat(translate(substring($root,1,1),$ALPHA_LC,$ALPHA_UC),
translate(substring($root,2 ),$ALPHA_UC,$ALPHA_LC))"
/&gt;
&lt;/xsl:variable&gt;
<span style="color: green;">&lt;-- Build the DOCTYPE by concatenating pieces.</span>
<span style="color: green;">Note that XSL syntax requires you to use the &amp;quot; entities for</span>
<span style="color: green;">quotation marks ("). --&gt;</span>

&lt;xsl:variable name="doctype"
select="concat('!DOCTYPE ',
$root,
' PUBLIC &amp;quot;-//OASIS//DTD DITA ',
$initcap_root,
'//EN&amp;quot; &amp;quot;',
$root,
'.dtd&amp;quot;') "/&gt;
&lt;xsl:value-of select="$NEWLINE"/&gt;
<span style="color: green;">&lt;-- Output the DOCTYPE surrounded by &lt; and &gt;. --&gt;</span>
&lt;xsl:text disable-output-escaping="yes"&gt;&lt;
&lt;xsl:value-of select="$doctype"/&gt;
&lt;xsl:text disable-output-escaping="yes"&gt;&gt;
&lt;xsl:value-of select="$NEWLINE"/&gt;
&lt;/xsl:template&gt;

The one caveat about this approach is that it depends on a consistent portion of the public ID form (“-//OASIS//DTD DITA “). If there are differences in the public ID for your various DOCTYPE declarations, those differences may complicate the template.

So there you have it: DOCTYPEs in a flash. Just remember to use disable-output-escaping=”yes” and use entities where appropriate and you’ll be fine.

Read More
Opinion

Would you use just a gardening trowel to plant a tree?

As technical communicators, our ultimate goal is to create accessible content that helps users solve problems. Focusing on developing quality content is the priority, but you can take that viewpoint to an extreme by saying that content-creation tools are just a convenience for technical writers:

The tools we use in our wacky profession are a convenience for us, as are the techniques we use. Users don’t care if we use FrameMaker, AuthorIt, Flare, Word, AsciiDoc, OpenOffice.org Writer, DITA or DocBook to create the content. They don’t give a hoot if the content is single sourced or topic based.

Sure, end users probably don’t know or care about the tools used to develop content. However, users do have eagle eyes for spotting inconsistencies in content, and they will call you out for conflicting information in a heartbeat (or worse, just abandon the official user docs altogether for being “unreliable”). If your department has implemented reuse and single-sourcing techniques that eliminate those inconsistencies, your end users are going to have a lot more faith in the validity of the content you provide.

Also, a structured authoring process that removes the burden of formatting content from the authoring process gives tech writers more time to focus on providing quality content to the end user. Yep, the end user doesn’t give a fig that the PDF or HTML file they are reading was generated from DITA-based content, but because the tech writers creating that content focused on just writing instead of writing, formatting, and converting the content, the information is probably better written and more useful.

Dogwood // flickr: hlkljgk

Dogwood // flickr: hlkljgk

All this talk about tools makes me think about the implements I use for gardening. A few years ago, I planted a young dogwood tree in my back yard. I could have used a small gardening trowel to dig the hole, but instead, I chose a standard-size shovel. Even though the tree had no opinion on the tool I used (at least I don’t think it did!), it certainly benefited from my tool selection. Because I was able to dig the hole and plant the tree in a shorter amount of time, the tree was able to develop a new root system in its new home more quickly. Today, that tree is flourishing and is about four feet taller than it was when I planted it.

The same applies to technical content. If a tool or process improves the consistency of content, gives authors more time to focus on the content, and shortens the time it takes to distribute that content, then the choice and application of a tool are much more than mere “conveniences.”

Read More
Conferences Webinar

Coming attractions for October and November

October 22nd, join Simon Bate for a session on delivering multiple versions of a help set without making multiple copies of the help:

We needed to generate a help set from DITA sources that applied to multiple products. However, serious space constraints prevent us from using standard DITA conditional processing to create multiple, product-specific versions of the help; there was only room for one copy of the help. Our solution was to create a single help set in which select content would be displayed when the help was opened.
In this webcast, we’ll show you how we used the DITA Open Toolkit to create a help set with dynamic text display. The webcast introduces some minor DITA Open Toolkit modifications and several client-side JavaScript techniques that you can use to implement dynamic text display in HTML files. Minimal programming skills necessary.

Register for dynamic text display webcast

I will be visiting New Orleans for LavaCon. This event, organized by Jack Molisani, is always a highlight of the conference year. I will be offering sessions on XML and on user-generated content. You can see the complete program here. In addition to my sessions, I will be bringing along a limited number of copies of our newest publication, The Compass. Find me at the event to get your free copy while supplies last. (Otherwise, you can order online Real Soon Now for $15.95.)

Register for LavaCon (note, early registration has been extended until October 12)

And last but certainly not least, we have our much-anticipated session on translation workflows. Nick Rosenthal, Managing Director, Salford Translations Ltd., will deliver a webcast on cost-effective document design for a translation workflow on November 19 at 11 a.m . Eastern time:

In this webcast, Nick Rosenthal discusses the challenges companies face when translating their content and offers some best practices to managing your localization budget effectively, including XML-based workflows and ways to integrate localized screen shots into translated user guides or help systems.

Register for the translation workflow webcast

As always, webcasts are $20. LavaCon is just a bit more. Hope to see you at all of these events.

Read More
Opinion

A strident defense of mediocre formatting

In addition to a gratuitous (and entertaining) swipe at “noisome” DITA “fanboys,” Roger Hart argues that we need to reconsider the disadvantages of automated formatting:

The thing is, [separation of content and formatting has] all been taken rather stridently to heart in certain quarters, leading to a knee jerk reaction whenever author-controlled formatting/pagination/lineation is mentioned as anything other than bleak, sulphurous devilry. This is twaddle. […]

Uncertainty in meaning is anathema to user intelligibility. If we’re going to make sure we’re not writing poetry, there’s definitely value in having poetry’s level of control over semantic blocks.

Of course, it’s fully possible that this is an expensive distraction.

Possible? It’s definitely expensive. It’s possible that it’s a distraction.

I think Hart perhaps unintentionally put his finger on the real issue: value. How much value (in the form of improved comprehension) is added to a technical document when you are able, in the words of commenter Brian Harris, to “lovingly handcraft” each page?

How much value (in the form of cost avoidance) is added to an organization when you are able to spit out a reasonably formatted document in a few minutes?

Actually, I have a different question. How far should we take this argument? Here’s an example of the pinnacle of handcrafting:

Book of Kells image
Can we all agree that this might perhaps take handcrafting a little too far?

Compared to the Book of Kells (above), the Gutenberg Bible looks quite pedestrian:

Gutenberg Bible image

You can just imagine the scribes with their quills, lapis, gold leaf, and other implements muttering, “That Gutenberg and his noisome fanboys. He can’t even render two colors without our help. Poser. It’ll never last.”

Formatting automation removes cost from the process of creating and delivering content. For technical documents that change often and are perhaps delivered in multiple languages, it removes a lot of cost. Let’s assume that handcrafted pages can improve ease of reading and comprehension with careful copy-fitting and adjusted spacing (Hart’s article mentions “headings, line breaks, intra-word, etc”). This increases the cost of the content.

What happens when content is expensive? Fewer people get to see it.

Books in Europe went from 50000 before Gutenberg to 12 million 50 years later.

I think we can all agree that e-books offer none of the typographic sophistication in question here. Bill Gates (yes, that Bill Gates) wrote in 1999:

It is hard to imagine today, but one of the greatest contributions of e-books may eventually be in improving literacy and education in less-developed countries. Today people in poor countries cannot afford to buy books and rarely have access to a library. 

Essentially, we can produce documents inexpensively and give more people access to them as a direct result of lower cost, or we can climb on our typographic high horse and whine about word spacing.

I’m with the noisome fanboys.

Read More
Tools

Ignoring DOCTYPE in XSL Transforms using Saxon 9B

Recently I had to write some XSL transforms in which I wanted to ignore the DOCTYPE declarations in the source XML files. In one case, I didn’t have access to the DTD (and the files wouldn’t have validate even if I did). In the other case, the XML files were DITA files, but I had no need or interest in validating the files; I simply needed to run a transform that modified some character data in the files.

In the first case, I ended up writing a couple of SED scripts that removed and re-inserted the DOCTYPE declaration. By the time I encountered the second case, I wanted to do something less ham-fisted, so I started investigating how to direct Saxon to ignore the DOCTYPE declaration.

My first thought was to use the -x switch in Saxon. Perhaps I didn’t use it correctly, but I couldn’t get it to work. Even though I was using a non-validating parser (Piccolo), Saxon kept telling me that the DTD couldn’t be found.

I went back to the drawing board (aka Google) and found a note from Michael Kay that said, “to ignore the DTD completely, you need to use a catalog that redirects the DTD reference to some dummy DTD.” Michael provided a link to a very useful page in the Saxon Wiki that discussed using a catalog with Saxon. After a bit of experimentation, I got it working correctly. In this blog post, I’ve distilled the information to make it useful to others who need to ignore the DOCTYPE in their XSL.

Before I describe the catalog implementation, I’d like to point out a simple solution. This solution works best when a set of XML files are in a single directory and all files use the same DOCTYPE declaration in which the system ID specifies a file:

&lt;!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd"&gt;

In this case, you don’t need a catalog. It’s easier to create an empty file named “topic.dtd” (a dummy DTD) and save it in the same directory as the XML files. The XML parser looks first for the system ID; if it finds a DTD file, it uses it. Case closed.

However, there are many cases in which this simple solution doesn’t work. The system ID (“topic.dtd” in the previous example) might specify a path that cannot be reproduced on your machine…or the XML files could be spread across multiple directories…or there could be many different DOCTYPEs…or…

In these cases, it makes more sense to set up a catalog file. To specify a catalog with Saxon, you must use the XML Commons Resolver from Apache (resolver.jar). You can download the resolver from SourceForge. The good thing is, if you have the DITA Open Toolkit installed on your machine, you already have a copy of the resolver.jar file. The file is in %DITA-OT%libresolver.jar. You specify the class path for the resolver in the Java command using the -cp switch (shown below).

The resolver requires you to specify a catalog.xml file, in which you map the the public ID (or system ID) in the DOCTYPE declaration to a local DTD file. The catalog.xml file I created looks like this:

&lt;catalog prefer="public" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"&gt;
&lt;public publicId="-//OASIS//DTD DITA Topic//EN" uri="dummy.dtd"/&gt;
&lt;public publicId="-//OASIS//DTD DITA Concept//EN" uri="dummy.dtd"/&gt;
&lt;public publicId="-//OASIS//DTD DITA Task//EN" uri="dummy.dtd"/&gt;
&lt;public publicId="-//OASIS//DTD DITA Reference//EN" uri="dummy.dtd"/&gt;
&lt;/catalog&gt;

Note that the uri attribute in each entry points to a dummy DTD (an empty file). The file path used for the dummy.dtd file is relative to the location of the catalog file.

Putting it all together, I created a DOS batch file to run Java and invoke Saxon:

java -cp c:saxon9saxon9.jar;C:DITA-OT1.4.3libresolver.jar ˆ
-Dxml.catalog.files=catalog.xml ˆ
net.sf.saxon.Transformˆ
-r:org.apache.xml.resolver.tools.CatalogResolver ˆ
-x:org.apache.xml.resolver.tools.ResolvingXMLReader ˆ
-y:org.apache.xml.resolver.tools.ResolvingXMLReader ˆ
-xsl:my_transform.xsl ˆ
-s:my_content.xml

The Java -cp switch adds class paths for the saxon.jar and resolver.jar files. The -D switch sets the system property xml.catalog.files to the location of the catalog.xml file.

The switches following the Java class (net.sf.saxon.Transform) are Saxon switches.

  • -r – class of the resolver
  • -x – class of the source file parser
  • -y – class of the stylesheet parser

Note, I’m using Windows (DOS) syntax here. If you are using Unix (Linux, Mac), separate the paths in the class path with a colon (:) and use the backslash () as a line continuation character.

When you run Saxon this way, you’ll notice two things: first, Saxon doesn’t complain about the DTD (yay!), but secondly, there is no DOCTYPE declaration in the output. I’ll address how to add the DOCTYPE declaration back to the output XML file in my next blog post.

Read More
Conferences

Got plans for May 2010?

After my summer of complaints and criticism of STC and its various issues, I was more than a little surprised to be asked to manage the Design, Architecture, and Publishing track for next year’s STC Summit.

Hoist on my own petard (my obsession with Wordnik continues)…what could I do but agree. Or, go into exile.

Several of the other conference organizers are people I know quite well:

  • The author of Managing Writers: A Real World Guide To Managing Technical Documentation, Richard Hamilton, is the track manager for Managing People, Projects, and Business. He knows his stuff.
  • The principal of UserAid, Paul Mueller, is track manager for three (THREE!) tracks: Education and Training, Web Technologies, and Emerging Technologies. He’s also the Deputy Chair of the conference. (private note to Paul: I take it you were not able to retrieve the goat pictures. Sorry about that.) Another excellent choice.
  • Ant Davey of the UK and Ireland chapter has the Communication and Interpersonal Skills and Professional Development tracks. I’ve worked on STC-related matters with Ant, and he’s a great choice for this track.
  • Rachel Houghton, Program Chair. She did great work on last year’s conference.
  • Alan Houser, conference manager. You may remember him as the guy who retrieved David Pogue from a poorly timed bathroom break during the opening session. I’ve known Alan for many years, and I expect another well-organized event, in which he solves the inevitable emergencies with typical aplomb.

(I’m sure that the other track managers are excellent as well, but I don’t know them personally.)

Here is the description of the Design, Architecture, and Publishing track:

Choice of appropriate design and architectures can improve the efficiency, usability, and quality of an organization’s technical publishing. This track explores issues in information design and system architectures for publishing, with particular emphasis on systems and solutions for organization-wide publishing. Suggested session topics include:

  • Visual communication, integrating text and graphics, page layout
  • Single-source publishing, for multiple delivery formats, multiple purposes, and multiple audiences
  • Methodologies and solutions for content management
  • Comparing and selecting delivery formats
  • Issues in structured authoring and publishing, including migration, design, and deployment
  • XML-based publishing
  • Using industry-standard publishing architectures, such as DITA
  • Accommodating localization workflows in the publishing process
  • Moving unstructured content to structure

And now I need your help in two areas:

  1. Submit your proposals. The quality of the conference is determined by the quality of the presentations. And that, of course, is determined by the quality of the proposals submitted. Please send in your best stuff. I suppose you can look into the other tracks if you must.
  2. Help review proposals. I need two or three people to help out in reviewing conference proposals in this track. I’ve done this in the past; it’s a relatively limited time commitment. You will be asked to read lots of proposals and evaluate them, probably in mid-October. Along with reviewers, I will eventually generate a list of recommendations for which proposals to accept. If you have significant expertise in topics in this track, and especially if you do not intend to submit a proposal of your own, please consider volunteering to help with this effort.

Some notes on this year’s process:

  • The deadline for proposal submission is October 5, 2009 at 10 a.m. Eastern time.
  • This is a direct quote from the conference page: “With the smaller number of sessions (for the most part) only one proposal per speaker will be accepted.” (You can still submit multiple proposals, but do not expect to have more than one accepted.)
  • Two speaker references are required (unless you have presented at this conference in the past four years, in which case we will review your evaluations). I personally intend to put a significant weighting on previous highly rated speaking experience.
  • In 2009, sessions were recorded. I assume this will happen again.
  • The conference is May 2-5, 2010, in Dallas, Texas.

Get started with a proposal

If you have questions, leave a comment or contact me. I look forward to seeing lots of compelling proposals.

Read More
News

Liberated type

(or should that be “Liberated typoes?”)

We have opened up free access to two of our white papers:

  • Hacking the DITA Open Toolkit, available in HTML or PDF (435 KB, 19 pages)
  • FrameMaker 8 and DITA Technical Reference, available in PDF (5 MB, 55 pages)

These used to be paid downloads.

Why the change of heart? Most of our business is consulting. To get consulting, we have to show competence. These white papers are one way to demonstrate our technical expertise.

(By this logic, our webcasts should also be free, but I’m not ready to go there. Why? We have fixed costs associated with the webcast hosting platform. Plus, once we schedule a webcast, we have to deliver it at the scheduled time, even if we’d rather be doing paying work. By contrast, we can squeeze in white paper development at our convenience.)

What are your thoughts? We are obviously not the only organization dealing with this issue…

Read More
Webinar

Webinar mania!

I have several webinar-related updates to share:

Next week, the State of Structure

You probably know that Scriptorium conducted an industry survey on structured authoring earlier this year. The report, The State of Structure in Technical Communication, is available in our online store for $200.

There is a cheaper option to get the highlights. On Tuesday, June 16, at 1 p.m. Eastern time, I’ll be delivering a one-hour webinar that highlights the most important findings.

Coming in July and August

Expect to see additional webinars in cooperation with our TechComm Alliance partners, Cherryleaf and HyperWrite. We are also welcoming Jack Molisani of ProSpring, who will offer excellent and candid career development advice. Watch this space for details about these upcoming events. Scriptorium consultants will also be offering additional content.

Recorded events

Two of our recent webinars are now available for download:

  • Hacking the DITA Open Toolkit
  • Documentation as Conversation

Each webinar lasts about one hour and is $20, either live or recorded. You can register for the Tuesday webcast and download recordings in our online store.

(Warning: The recorded webcast files are quite large.)

Read More