Skip to main content
Tools

Ignoring DOCTYPE in XSL Transforms using Saxon 9B

Recently I had to write some XSL transforms in which I wanted to ignore the DOCTYPE declarations in the source XML files. In one case, I didn’t have access to the DTD (and the files wouldn’t have validate even if I did). In the other case, the XML files were DITA files, but I had no need or interest in validating the files; I simply needed to run a transform that modified some character data in the files.

In the first case, I ended up writing a couple of SED scripts that removed and re-inserted the DOCTYPE declaration. By the time I encountered the second case, I wanted to do something less ham-fisted, so I started investigating how to direct Saxon to ignore the DOCTYPE declaration.

My first thought was to use the -x switch in Saxon. Perhaps I didn’t use it correctly, but I couldn’t get it to work. Even though I was using a non-validating parser (Piccolo), Saxon kept telling me that the DTD couldn’t be found.

I went back to the drawing board (aka Google) and found a note from Michael Kay that said, “to ignore the DTD completely, you need to use a catalog that redirects the DTD reference to some dummy DTD.” Michael provided a link to a very useful page in the Saxon Wiki that discussed using a catalog with Saxon. After a bit of experimentation, I got it working correctly. In this blog post, I’ve distilled the information to make it useful to others who need to ignore the DOCTYPE in their XSL.

Before I describe the catalog implementation, I’d like to point out a simple solution. This solution works best when a set of XML files are in a single directory and all files use the same DOCTYPE declaration in which the system ID specifies a file:

<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">

In this case, you don’t need a catalog. It’s easier to create an empty file named “topic.dtd” (a dummy DTD) and save it in the same directory as the XML files. The XML parser looks first for the system ID; if it finds a DTD file, it uses it. Case closed.

However, there are many cases in which this simple solution doesn’t work. The system ID (“topic.dtd” in the previous example) might specify a path that cannot be reproduced on your machine…or the XML files could be spread across multiple directories…or there could be many different DOCTYPEs…or…

In these cases, it makes more sense to set up a catalog file. To specify a catalog with Saxon, you must use the XML Commons Resolver from Apache (resolver.jar). You can download the resolver from SourceForge. The good thing is, if you have the DITA Open Toolkit installed on your machine, you already have a copy of the resolver.jar file. The file is in %DITA-OT%libresolver.jar. You specify the class path for the resolver in the Java command using the -cp switch (shown below).

The resolver requires you to specify a catalog.xml file, in which you map the the public ID (or system ID) in the DOCTYPE declaration to a local DTD file. The catalog.xml file I created looks like this:

<catalog prefer="public" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//OASIS//DTD DITA Topic//EN" uri="dummy.dtd"/>
<public publicId="-//OASIS//DTD DITA Concept//EN" uri="dummy.dtd"/>
<public publicId="-//OASIS//DTD DITA Task//EN" uri="dummy.dtd"/>
<public publicId="-//OASIS//DTD DITA Reference//EN" uri="dummy.dtd"/>
</catalog>

Note that the uri attribute in each entry points to a dummy DTD (an empty file). The file path used for the dummy.dtd file is relative to the location of the catalog file.

Putting it all together, I created a DOS batch file to run Java and invoke Saxon:

java -cp c:saxon9saxon9.jar;C:DITA-OT1.4.3libresolver.jar ˆ
-Dxml.catalog.files=catalog.xml ˆ
net.sf.saxon.Transformˆ
-r:org.apache.xml.resolver.tools.CatalogResolver ˆ
-x:org.apache.xml.resolver.tools.ResolvingXMLReader ˆ
-y:org.apache.xml.resolver.tools.ResolvingXMLReader ˆ
-xsl:my_transform.xsl ˆ
-s:my_content.xml

The Java -cp switch adds class paths for the saxon.jar and resolver.jar files. The -D switch sets the system property xml.catalog.files to the location of the catalog.xml file.

The switches following the Java class (net.sf.saxon.Transform) are Saxon switches.

  • -r – class of the resolver
  • -x – class of the source file parser
  • -y – class of the stylesheet parser

Note, I’m using Windows (DOS) syntax here. If you are using Unix (Linux, Mac), separate the paths in the class path with a colon (:) and use the backslash () as a line continuation character.

When you run Saxon this way, you’ll notice two things: first, Saxon doesn’t complain about the DTD (yay!), but secondly, there is no DOCTYPE declaration in the output. I’ll address how to add the DOCTYPE declaration back to the output XML file in my next blog post.

Read More
News

Liberated type

(or should that be “Liberated typoes?”)

We have opened up free access to two of our white papers:

  • Hacking the DITA Open Toolkit, available in HTML or PDF (435 KB, 19 pages)
  • FrameMaker 8 and DITA Technical Reference, available in PDF (5 MB, 55 pages)

These used to be paid downloads.

Why the change of heart? Most of our business is consulting. To get consulting, we have to show competence. These white papers are one way to demonstrate our technical expertise.

(By this logic, our webcasts should also be free, but I’m not ready to go there. Why? We have fixed costs associated with the webcast hosting platform. Plus, once we schedule a webcast, we have to deliver it at the scheduled time, even if we’d rather be doing paying work. By contrast, we can squeeze in white paper development at our convenience.)

What are your thoughts? We are obviously not the only organization dealing with this issue…

Read More
Webinar

Webinar mania!

I have several webinar-related updates to share:

Next week, the State of Structure

You probably know that Scriptorium conducted an industry survey on structured authoring earlier this year. The report, The State of Structure in Technical Communication, is available in our online store for $200.

There is a cheaper option to get the highlights. On Tuesday, June 16, at 1 p.m. Eastern time, I’ll be delivering a one-hour webinar that highlights the most important findings.

Coming in July and August

Expect to see additional webinars in cooperation with our TechComm Alliance partners, Cherryleaf and HyperWrite. We are also welcoming Jack Molisani of ProSpring, who will offer excellent and candid career development advice. Watch this space for details about these upcoming events. Scriptorium consultants will also be offering additional content.

Recorded events

Two of our recent webinars are now available for download:

  • Hacking the DITA Open Toolkit
  • Documentation as Conversation

Each webinar lasts about one hour and is $20, either live or recorded. You can register for the Tuesday webcast and download recordings in our online store.

(Warning: The recorded webcast files are quite large.)

Read More
Humor Opinion

More cowbell!

About a year ago, we added Google Analytics to our web site. I have done some research to see what posts were the most popular in the past year:

  1. The clear winner was our FrameMaker 9 review. With 21 comments, I think it was also the most heavily commented post. Interestingly, the post itself is little more than a pointer to the PDF file that contains the actual review.
  2. InDesign CS4 = Hannibal post, which discussed InDesign’s encroachment on traditional FrameMaker features.
  3. A surprise…a post from 2006 in which Mark Baker discussed the merits (or lack thereof) of DITA in To DITA or not to DITA

Our readers appear to like clever headlines, because I don’t think the content quality explains the high numbers for posts such as:

We noticed this pattern recently, when a carefully crafted, meticulously written post was ignored in favor of a throwaway post dashed off in minutes with a catchy title (Death to Recipes!).

For useful, thoughtful advice on blogging, I refer you to Tom Johnson and Rich Maggiani. I, however, have a new set of blogging recommendations:

  1. Write catchy titles
  2. Have an opinion, preferably an outrageous one
  3. More cowbell

Read More
Webinar

Documentation as conversation webinar

We have added Documentation as Conversation, presented by Anne Gentle, to our upcoming webinars. Anne is scheduled to present on June 9 at 11 a.m. Eastern time:

Even if your documentation system does not converse with your users, your documentation can help customers talk to each other and make the connections that help them do their jobs well or learn something new as if they were in a classroom with a community for classmates. This talk describes how you can think about documentation and user assistance in a conversational way, with the help of social media technology. I’ll discuss the topics in my new book, Conversation and Community: The Social Web for Documentation. I’ll describe the use of in-person Book Sprints that combine wikis and community events to gather together writers to accomplish documentation goals

Anne is an expert, perhaps the expert, on using wikis and other social media to extend traditional documentation efforts. She’s also an excellent speaker, so I hope you’ll join us for this session.

Register for Documentation as Conversation ($20)

See all upcoming webinars

PS We are working on additional topics and looking for more speakers. Do you have topics you would like us to cover? Please let us know. We are working on a couple of sessions on document conversion.

Reblog this post [with Zemanta]

Read More
Conferences

DocTrain’s demise and a challenge to presenters

Unfortunate news in my inbox this morning:

I regret to announce that DocTrain DITA Indianapolis is cancelled. DocTrain/PUBSNET Inc is shutting down.

As a business owner, messages like this strike fear in my heart. If it could happen to them…gulp. (This might be a good time to mention that we are ALWAYS looking for projects, so send them on over, please.) My condolences to the principals at DocTrain.

Meanwhile, I’m also thinking about what we can do in place of the event. I had a couple of presentations scheduled for DocTrain DITA, and Simon Bate was planning a day-long workshop on DITA Open Toolkit configuration.

So, here’s the plan. We are going to offer a couple of webinars based on the sessions we were planning to do at DocTrain DITA:

Each webinar is $20. We may record the webinars and make the recordings available later, but I’m not making any promises. Registration is limited to 50 people.

Here’s the challenge part: If you were scheduled to present at DocTrain DITA (or weren’t but have something useful to say), please set up a webcast of your presentation. It would be ultra-cool if we could replicate the event online (I know that the first week in June was cleared on your schedule!), but let’s get as much of this content as possible available. If you do not have a way to offer a webinar, let me know, and I’ll work with you to host it through Scriptorium.

And here’s my challenge to those of you who like to attend conferences: Please consider supporting these online events. If $20 is truly more than you can afford, contact me.

Read More
Conferences

Life in the desert

Last week, I attended the annual DocTrain West event, which was held this year in Palm Springs, California.

Weather in Palm Springs was spectacular as always with highs in the 80s during the day. Some of my more northerly friends seemed a bit shell-shocked by the sudden change from snow and slush to sun and sand. (North Carolina was 40 degrees when I left, so that was a nice change for me as well.)

Scott Abel did his usual fine job of organizing and somehow being omnipresent.

I promised to post my session slides. The closing keynote was mostly images and is probably not that useful without audio, so I’m going to point you to an article that covers similar ground (What do Movable Type and XML Have in Common, PDF link).

I have embedded the slides from my DITA to PDF session below.

I have also posted the InDesign template file and the XSL we built to preprocess the DITA XML into something that InDesign likes on our wiki. Note that running the XSL requires a working configuration of the DITA Open Toolkit. For more information, refer to the DITA to InDesign page on our wiki.

Read More
Opinion

I am not a Pod Person

Confession time: I don’t like podcasts.

And I think I know why.

I am a voracious reader. And by voracious, I mean that I often cook with a stirring spoon in one hand and a book in the other. I go through at least a dozen books a months (booksfree is my friend).

So why don’t I like podcasts?

  1. They’re inconvenient. I don’t have a lot of interrupted listening time, other than at the gym. And frankly, there’s a bizarre cognitive dissonance listening to Tom Johnson interview Bogo Vatovec while I’m lifting weights. I tried listening to a crafting podcast, but that was worse — my brain can’t handle auditory input describing crocheting techniques while simultaneously operating an elliptical machine. So I went back to Dr. Phil on the gym TV. It may rot my brain, but at least it doesn’t hurt.
  2. They’re inefficient. I can listen to a 30-minute podcast, or I can skim the equivalent text in 90 seconds.

I’ve been thinking about what would make a podcast more appealing to me, and realized that it’s not really the medium I object to, it’s my inability to control the delivery.

I’ll become a podcasting proponent when I perceive these properties:

  1. Better navigation. Podcasts, like other content, need to be divided into logical chunks. These chunks should be accessible via a table of contents and an index.
  2. Ability to skim. Podcasts need to provide the audio equivalent of flipping pages in a book or scrolling through a document while only reading the headings.

Depending on the software you use to consume podcasts, you may already have some of the features. For instance, a colleague told me that he listened to my recent DITA webinar at five times the normal speed:

I wanted to let you know about something in particular. I listened to it at 5x fast fwd in Windows Media Player while drinking a coke. My heart is still racing. You should try it. :o)

Do you enjoy podcasts? Do you have any special techniques for managing them efficiently?

Read More
Tools

Don’t type, drag to the cmd window

I spend a good deal of time with a Windows cmd.exe window open on my desktop. If I’m not running the DITA OT, I’m testing some Perl script, or Ant, or Python, or who knows.

A few years ago (in the Windows 98 days), I discovered a nifty cmd window trick. People are consistently amazed when I demonstrate it to them. Now I’m going to share it with you.

Say you need to change directory to some long and gnarly path name. You could type the whole thing in. Or, if you have Windows Explorer open on your desktop, you can:

  1. Type “cd ” in the cmd window (the space is important).
  2. Go to Windows Explorer and find the folder you want to navigate to.
  3. Drag and drop the folder from Windows Explorer to the cmd window.

Hey presto! The path name is copied to the cmd window. What’s more, if there are spaces in the path, the path is automatically quoted.

Now you can click in the cmd window and press Enter to perform the command.

Cool! No more typing long path names for this ToolSmith.

This works for filenames too. If I’m running a Perl script that needs to work on a file way down my directory tree, I type “perl myScriptName.pl “, then drag and drop the file name from Windows Explorer into my cmd window.

I’ll keep adding more ToolSmith’s Tricks as I use them. What’s your favorite trick?

Read More
Tools

WMF…that’ll shut ’em up

Which graphics formats should you use in your documentation? For print, the traditional advice is EPS for line drawings and TIFF for screen captures and photographs. That’s still good advice. These days, you might choose PDF and PNG for the same purposes. There are caveats for each of these formats, but in general, these are excellent choices.

Of course, everybody knows to stay away from WMF, the Windows Metafile Format. WMF doesn’t handle gradients, can’t have more than 256 colors, and refuses to play nice with anything other than Windows.

Think you’re too good to hang out with WMF? For your print and online documentation, perhaps. But it may be a great choice to give to your company’s PowerPoint users.

Are you familiar with this scenario? PowerPoint User saw some graphics in your documentation and thought they would work for some sales presentations. The screen captures are easy; you just give PowerPoint User PNGs or BMPs or whatever. It’s the line drawings that are the problem. PowerPoint User doesn’t have Illustrator and has never heard of EPS. PowerPoint User says, “Can you give me a copy of those pictures in a format that I can use in PowerPoint? Oh, and can make that box purple and change that font for me first? And move that line just a little bit? And make that line thicker? And remove that entire right side of the picture and split it into two pictures?”

You want PowerPoint User to reuse the graphics; you’re all about reuse. But you have dealt with PowerPoint User before, and you know you will never get your real job done if you get pulled into the sucking vortex of PowerPoint User’s endless requests.

The secret is to give PowerPoint User the graphics in a format that can be edited from within PowerPoint (or Word): WMF. Here’s the drill that will make you a hero:

  1. Save your graphics as WMF.
  2. Place each WMF on a separate page in a PowerPoint or Word file.
  3. Tell PowerPoint User to double-click on a graphic to make it editable.(If you think your PowerPoint User is really dumb, you can double-click the graphic and respond to the dialog box asking if you want to make the drawing editable yourself before saving the file, but nobody is that dumb.)

WMF. It will make PowerPoint User go away…happy!

Read More