Author: Simon Bate
When I first started importing DITA and other XML files into structured FrameMaker, I was surprised by the excessive whitespace that appeared in the files. Even more surprising (in FrameMaker 8.0) were the red comments displayed via the EDD that said that some whitespace was invalid (these no longer appear in FrameMaker 9).
The whitespace was visible because of an odd decision by Adobe to handle all XML whitespace as if it were significant. (XML divides the world into significant and insignificant whitespace; most XML tools treat whitespace as insignficant except where necessary…think <codeblock> elements). This approach to whitespace exists in both FrameMaker and InDesign.
At first I handled the whitespace on a case-by-case basis, removing it by hand or through regular expressions. Eventually, I realized this was a more serious problem and created an XSL transform to eliminate the white space as a part of preprocessing. By using XSL that was acceptable to Xalan (not that hard), the transform can be integrated into a FrameMaker structured application.
I figured this whitespace problem must be affecting (and frustrating) more than a few of you out there,
so I made the stylesheet available on the Scriptorium web site. I also wrote a white paper “Removing XML whitespace in structured FrameMaker documents” that describes describes the XSL that went into the stylesheet and how to integrate it with your FrameMaker structured applications.
The white paper is available on the Scriptorium web site. Information about how to download the stylesheet is in the white paper.
If the stylesheet and whitepaper are useful to you, let us know!
In a posting a few weeks ago I discussed how to ignore the DOCTYPE declaration when processing XML through XSL. What I left unaddressed was how to add the DOCTYPE declaration back to the files. Several people have told me they’re tired of waiting for the other shoe to drop, so here’s how to add a DOCTYPE declaration.
First off: the easy solution. If the documents you are transforming always use the same DOCTYPE, you can use the doctype-public and doctype-system attributes in the <xsl:output> directive. When you specify these attributes, XSL inserts the DOCTYPE automatically.
However, if the DOCTYPE varies from file to file, you’ll have to insert the DOCTYPE declaration from your XSL stylesheet. In DITA files (and in many other XML architectures), the DOCTYPE is directly related to the root element of the document being processed. This means you can detect the name of the root element and use standard XSL to insert a new DOCTYPE declaration.
Before you charge ahead and drop a DOCTYPE declaration into your files, understand that the DOCTYPE declaration is not valid XML. If you try to emit it literally, your XSL processor will complain. Instead, you’ll have to:
- Use entities for the less-than (“<” – “<”) and greater-than (“>” – “>”) signs, and
- Disable output escaping so that the entities are actually emitted as less-than or greater-than signs (output escaping will convert them back to entities, which is precisely what you don’t want).
There are at least two possible approaches for adding DOCTYPE to your documents: use an <xsl:choose> statement to select a DOCTYPE, or construct the DOCTYPE using the XSL concat() function.
To insert the DOCTYPE declaration with an <xsl:choose> statement, use the document’s root element to select which DOCTYPE declaration to insert. Note that the entities “>” and “<” aren’t HTML errors in this post, they are what you need to use. Also note that the DOCTYPE statement text in this template is left-aligned so that the output DOCTYPE declarations will be left aligned. Most parsers seem to tolerate whitespace before the DOCTYPE declaration, but I prefer to err on the side of caution:
<xsl:when test="name(node()) = 'topic'">
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<xsl:when test="name(node()) = 'concept'">
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<xsl:when test="name(node()) = 'task'">
<!DOCTYPE task PUBLIC "-//OASIS//DTD DITA Task//EN" "task.dtd">
<xsl:when test="name(node()) = 'reference'">
<!DOCTYPE reference PUBLIC "-//OASIS//DTD DITA Reference//EN" "reference.dtd">
The preceding example contains statements for the topic, concept, task, and reference topic types; if you use other topic types, you’ll need to add additional statements. Rather than write a statement for each DOCTYPE, a more general approach is to process the name of the root element and construct the DOCTYPE declaration using the XSL concat() function.
<xsl:variable name="ALPHA_UC" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
<xsl:variable name="ALPHA_LC" select="'abcdefghijklmnopqrstuvwxyz'"/>
<xsl:variable name="NEWLINE" select="'&#x0A;'"/>
<xsl:with-param name="root" select="name(node())"/>
<span style="color: green;"><-- Create a doctype based on the root element --></span>
<span style="color: green;"><-- Create an init-cap version of the root element name. --></span>
<span style="color: green;"><-- Build the DOCTYPE by concatenating pieces.</span>
<span style="color: green;">Note that XSL syntax requires you to use the &quot; entities for</span>
<span style="color: green;">quotation marks ("). --></span>
' PUBLIC &quot;-//OASIS//DTD DITA ',
<span style="color: green;"><-- Output the DOCTYPE surrounded by < and >. --></span>
The one caveat about this approach is that it depends on a consistent portion of the public ID form (“-//OASIS//DTD DITA “). If there are differences in the public ID for your various DOCTYPE declarations, those differences may complicate the template.
So there you have it: DOCTYPEs in a flash. Just remember to use disable-output-escaping=”yes” and use entities where appropriate and you’ll be fine.
Recently I had to write some XSL transforms in which I wanted to ignore the DOCTYPE declarations in the source XML files. In one case, I didn’t have access to the DTD (and the files wouldn’t have validate even if I did). In the other case, the XML files were DITA files, but I had no need or interest in validating the files; I simply needed to run a transform that modified some character data in the files.
In the first case, I ended up writing a couple of SED scripts that removed and re-inserted the DOCTYPE declaration. By the time I encountered the second case, I wanted to do something less ham-fisted, so I started investigating how to direct Saxon to ignore the DOCTYPE declaration.
My first thought was to use the -x switch in Saxon. Perhaps I didn’t use it correctly, but I couldn’t get it to work. Even though I was using a non-validating parser (Piccolo), Saxon kept telling me that the DTD couldn’t be found.
I went back to the drawing board (aka Google) and found a note from Michael Kay that said, “to ignore the DTD completely, you need to use a catalog that redirects the DTD reference to some dummy DTD.” Michael provided a link to a very useful page in the Saxon Wiki that discussed using a catalog with Saxon. After a bit of experimentation, I got it working correctly. In this blog post, I’ve distilled the information to make it useful to others who need to ignore the DOCTYPE in their XSL.
Before I describe the catalog implementation, I’d like to point out a simple solution. This solution works best when a set of XML files are in a single directory and all files use the same DOCTYPE declaration in which the system ID specifies a file:
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
In this case, you don’t need a catalog. It’s easier to create an empty file named “topic.dtd” (a dummy DTD) and save it in the same directory as the XML files. The XML parser looks first for the system ID; if it finds a DTD file, it uses it. Case closed.
However, there are many cases in which this simple solution doesn’t work. The system ID (“topic.dtd” in the previous example) might specify a path that cannot be reproduced on your machine…or the XML files could be spread across multiple directories…or there could be many different DOCTYPEs…or…
In these cases, it makes more sense to set up a catalog file. To specify a catalog with Saxon, you must use the XML Commons Resolver from Apache (resolver.jar). You can download the resolver from SourceForge. The good thing is, if you have the DITA Open Toolkit installed on your machine, you already have a copy of the resolver.jar file. The file is in %DITA-OT%libresolver.jar. You specify the class path for the resolver in the Java command using the -cp switch (shown below).
The resolver requires you to specify a catalog.xml file, in which you map the the public ID (or system ID) in the DOCTYPE declaration to a local DTD file. The catalog.xml file I created looks like this:
<catalog prefer="public" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//OASIS//DTD DITA Topic//EN" uri="dummy.dtd"/>
<public publicId="-//OASIS//DTD DITA Concept//EN" uri="dummy.dtd"/>
<public publicId="-//OASIS//DTD DITA Task//EN" uri="dummy.dtd"/>
<public publicId="-//OASIS//DTD DITA Reference//EN" uri="dummy.dtd"/>
Note that the uri attribute in each entry points to a dummy DTD (an empty file). The file path used for the dummy.dtd file is relative to the location of the catalog file.
Putting it all together, I created a DOS batch file to run Java and invoke Saxon:
java -cp c:saxon9saxon9.jar;C:DITA-OT1.4.3libresolver.jar ˆ
The Java -cp switch adds class paths for the saxon.jar and resolver.jar files. The -D switch sets the system property xml.catalog.files to the location of the catalog.xml file.
The switches following the Java class (net.sf.saxon.Transform) are Saxon switches.
- -r – class of the resolver
- -x – class of the source file parser
- -y – class of the stylesheet parser
Note, I’m using Windows (DOS) syntax here. If you are using Unix (Linux, Mac), separate the paths in the class path with a colon (:) and use the backslash () as a line continuation character.
When you run Saxon this way, you’ll notice two things: first, Saxon doesn’t complain about the DTD (yay!), but secondly, there is no DOCTYPE declaration in the output. I’ll address how to add the DOCTYPE declaration back to the output XML file in my next blog post.
Scriptorium and JustSystems are announcing a three-webinar series on preparing to use DITA.
The first two webinars in the series describe the age-old problem of converting legacy content into DITA. Because a great deal of unstructured content is in either Adobe FrameMaker and Microsoft Word, we’re dedicating one webinar to converting Unstructured FrameMaker to DITA and the other to converting Microsoft Word to DITA.
The third webinar describes various re-use strategies you can apply to your DITA content.
The dates and times for the conversion webinars are:
- Converting Unstructured FrameMaker to DITA – August 25, 2:00pm Eastern time.
- Converting Microsoft Word to DITA – September 1, 2:00pm Eastern time.
The date and time for the third webinar (DITA reuse strategies) will be announced toward the end of August.
All of the webinars in the series are free, but you do have to register before attending. To sign up, follow this link to the JustSystems web site:
I spend a good deal of time with a Windows cmd.exe window open on my desktop. If I’m not running the DITA OT, I’m testing some Perl script, or Ant, or Python, or who knows.
A few years ago (in the Windows 98 days), I discovered a nifty cmd window trick. People are consistently amazed when I demonstrate it to them. Now I’m going to share it with you.
Say you need to change directory to some long and gnarly path name. You could type the whole thing in. Or, if you have Windows Explorer open on your desktop, you can:
- Type “cd ” in the cmd window (the space is important).
- Go to Windows Explorer and find the folder you want to navigate to.
- Drag and drop the folder from Windows Explorer to the cmd window.
Hey presto! The path name is copied to the cmd window. What’s more, if there are spaces in the path, the path is automatically quoted.
Now you can click in the cmd window and press Enter to perform the command.
Cool! No more typing long path names for this ToolSmith.
This works for filenames too. If I’m running a Perl script that needs to work on a file way down my directory tree, I type “perl myScriptName.pl “, then drag and drop the file name from Windows Explorer into my cmd window.
I’ll keep adding more ToolSmith’s Tricks as I use them. What’s your favorite trick?
Have you ever been really scared? I don’t mean just the Halloween kinda scared, but really scared. That’s how I felt at the Burlington Marriott when the hotel employee delivered the box containing the workbooks for my Introduction to XMetaL and DITA workshop. He stood in the doorway, smiled, and handed me a very beat up, bent, folded, spindled, and mutilated FedEx box.
The box looked like the driver had had a flat on Route 128 and used it to prevent the truck from rolling back while jacking up the front end. It was nice and damp too. With much trepidation, I opened the box and — to my relief — found that the materials were undamaged. Whew.
Following that, Wednesday’s all-day workshop on XMetaL and DITA was smooth sailing. OK, we had a bit of a problem with powerstrips, but the helpful DocTrain folks got that taken care of. In retrospect, many of the questions I fielded in the workshop weren’t so much about DITA or XMetaL itself. Instead many of the questions were about generating output. The fact is that unless you’re willing to spend some quality time with CSS and the DITA Open Toolkit, your output from DITA will look very generic. XMetaL has a number of hooks that ease some of the pain in generating XHTML output. But even those hooks won’t save you from FO issues if you want to generate PDF output.
In my presentation on Thursday comparing XMetaL and FrameMaker support in DITA, the questions returned once again to output. Of course, this time the focus was on using FrameMaker 8.0 as a PDF engine. In workflows where content is created and maintained in XML, but then has to be delivered in PDF (or print), FrameMaker 8.0 looks like an attractive possibility. There are a few flaws in this solution (such as translating xref elements for intra-document links into live links in PDF), but users are closer to a solution than they were six months ago.
We’ve posted PDFs of the slides from both sessions on SlideShare.
You can find the Introduction to XMetaL and DITA workshop slides at:
The slides for the session on DITA Support in FrameMaker and XMetaL are at:
When you’re done browsing the slides, take a look on our site for information about how we can help you with your FrameMaker, XMetaL, OT, PDF problems.
It’s not that scary.
A recent post on the dita-users Yahoo group asked how to customize the DITA OT stylesheets in view of the fact that there isn’t much documentation available.
From my work customizing and otherwise perverting the DITA OT, I can sympathize with these frustrations. When I started investigating OT customizations, I found many well-crafted tutorials on how to customize and specialize the OT. These were a great starting point, but they only got me so far. In its current state, the documentation is an incomplete jigsaw puzzle; the trees and buildings are filled in nicely, but the sky is still waiting for someone with patience. (Block that metaphor!)
Because there is no documentation available at the individual template level, you need to reconsider the task at hand. I look on it as debugging, decoding, or sleuthing. With that in mind, I find the following to be very useful:
- Find a good visual grep-like utility. I use AgentRansack, a free version of FileLocator Pro (it’s free and amazing). This enables me to locate all files that contain a particular class identifier. The visual aspect of the tool allows me to see the context quickly.
- Use a programmer’s editor that supports XML and XSL. We use Oxygen. Not only does it help check validity and closes tags automatically, but it also provides a handy sidebar that lists the templates and their modes.
- Liberally spread <xsl:comment> or <xsl:message> directives through the stylesheets you’re examining. That helps figure out where you are. Use <xsl:value-of> or <xsl:copy-of> to figure out what you’ve got.
- Once you’ve figured out what happens in one of the OT templates, add comments. Now the next time you come back to it, you won’t waste time.
Probably the best form of documentation that the OT could provide here is additional comments in the stylesheets, particularly about the order of processing. I find I add many comments about where to find the template that handles nodes from an <xsl:apply-templates> directive.
One further note. On Tuesday, September 23, I’ll be presenting the third of our “Best Practices in Structured Authoring and Publishing” joint Webinar series with JustSystems. In this presentation I’ll describe a number of approaches you can use to customize DITA OT output. For more information, visit the JustSystems web site.