Ignoring DOCTYPE in XSL Transforms using Saxon 9B
Recently I had to write some XSL transforms in which I wanted to ignore the DOCTYPE declarations in the source XML files. In one case, I didn’t have access to the DTD (and the files wouldn’t have validate even if I did). In the other case, the XML files were DITA files, but I had no need or interest in validating the files; I simply needed to run a transform that modified some character data in the files.
In the first case, I ended up writing a couple of SED scripts that removed and re-inserted the DOCTYPE declaration. By the time I encountered the second case, I wanted to do something less ham-fisted, so I started investigating how to direct Saxon to ignore the DOCTYPE declaration.
My first thought was to use the -x switch in Saxon. Perhaps I didn’t use it correctly, but I couldn’t get it to work. Even though I was using a non-validating parser (Piccolo), Saxon kept telling me that the DTD couldn’t be found.
I went back to the drawing board (aka Google) and found a note from Michael Kay that said, “to ignore the DTD completely, you need to use a catalog that redirects the DTD reference to some dummy DTD.” Michael provided a link to a very useful page in the Saxon Wiki that discussed using a catalog with Saxon. After a bit of experimentation, I got it working correctly. In this blog post, I’ve distilled the information to make it useful to others who need to ignore the DOCTYPE in their XSL.
Before I describe the catalog implementation, I’d like to point out a simple solution. This solution works best when a set of XML files are in a single directory and all files use the same DOCTYPE declaration in which the system ID specifies a file:
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
In this case, you don’t need a catalog. It’s easier to create an empty file named “topic.dtd” (a dummy DTD) and save it in the same directory as the XML files. The XML parser looks first for the system ID; if it finds a DTD file, it uses it. Case closed.
However, there are many cases in which this simple solution doesn’t work. The system ID (“topic.dtd” in the previous example) might specify a path that cannot be reproduced on your machine…or the XML files could be spread across multiple directories…or there could be many different DOCTYPEs…or…
In these cases, it makes more sense to set up a catalog file. To specify a catalog with Saxon, you must use the XML Commons Resolver from Apache (resolver.jar). You can download the resolver from SourceForge. The good thing is, if you have the DITA Open Toolkit installed on your machine, you already have a copy of the resolver.jar file. The file is in %DITA-OT%libresolver.jar. You specify the class path for the resolver in the Java command using the -cp switch (shown below).
The resolver requires you to specify a catalog.xml file, in which you map the the public ID (or system ID) in the DOCTYPE declaration to a local DTD file. The catalog.xml file I created looks like this:
<catalog prefer="public" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//OASIS//DTD DITA Topic//EN" uri="dummy.dtd"/>
<public publicId="-//OASIS//DTD DITA Concept//EN" uri="dummy.dtd"/>
<public publicId="-//OASIS//DTD DITA Task//EN" uri="dummy.dtd"/>
<public publicId="-//OASIS//DTD DITA Reference//EN" uri="dummy.dtd"/>
Note that the uri attribute in each entry points to a dummy DTD (an empty file). The file path used for the dummy.dtd file is relative to the location of the catalog file.
Putting it all together, I created a DOS batch file to run Java and invoke Saxon:
java -cp c:saxon9saxon9.jar;C:DITA-OT1.4.3libresolver.jar ˆ
The Java -cp switch adds class paths for the saxon.jar and resolver.jar files. The -D switch sets the system property xml.catalog.files to the location of the catalog.xml file.
The switches following the Java class (net.sf.saxon.Transform) are Saxon switches.
- -r – class of the resolver
- -x – class of the source file parser
- -y – class of the stylesheet parser
Note, I’m using Windows (DOS) syntax here. If you are using Unix (Linux, Mac), separate the paths in the class path with a colon (:) and use the backslash () as a line continuation character.
When you run Saxon this way, you’ll notice two things: first, Saxon doesn’t complain about the DTD (yay!), but secondly, there is no DOCTYPE declaration in the output. I’ll address how to add the DOCTYPE declaration back to the output XML file in my next blog post.
Very informative stuff. I’ve always wondered how to do this, and like you I’ve grown tired of ploughing through batches with sed. I look forward to the sequel!