Skip to main content
Tools

Ignoring DOCTYPE in XSL Transforms using Saxon 9B

Recently I had to write some XSL transforms in which I wanted to ignore the DOCTYPE declarations in the source XML files. In one case, I didn’t have access to the DTD (and the files wouldn’t have validate even if I did). In the other case, the XML files were DITA files, but I had no need or interest in validating the files; I simply needed to run a transform that modified some character data in the files.

In the first case, I ended up writing a couple of SED scripts that removed and re-inserted the DOCTYPE declaration. By the time I encountered the second case, I wanted to do something less ham-fisted, so I started investigating how to direct Saxon to ignore the DOCTYPE declaration.

My first thought was to use the -x switch in Saxon. Perhaps I didn’t use it correctly, but I couldn’t get it to work. Even though I was using a non-validating parser (Piccolo), Saxon kept telling me that the DTD couldn’t be found.

I went back to the drawing board (aka Google) and found a note from Michael Kay that said, “to ignore the DTD completely, you need to use a catalog that redirects the DTD reference to some dummy DTD.” Michael provided a link to a very useful page in the Saxon Wiki that discussed using a catalog with Saxon. After a bit of experimentation, I got it working correctly. In this blog post, I’ve distilled the information to make it useful to others who need to ignore the DOCTYPE in their XSL.

Before I describe the catalog implementation, I’d like to point out a simple solution. This solution works best when a set of XML files are in a single directory and all files use the same DOCTYPE declaration in which the system ID specifies a file:

<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">

In this case, you don’t need a catalog. It’s easier to create an empty file named “topic.dtd” (a dummy DTD) and save it in the same directory as the XML files. The XML parser looks first for the system ID; if it finds a DTD file, it uses it. Case closed.

However, there are many cases in which this simple solution doesn’t work. The system ID (“topic.dtd” in the previous example) might specify a path that cannot be reproduced on your machine…or the XML files could be spread across multiple directories…or there could be many different DOCTYPEs…or…

In these cases, it makes more sense to set up a catalog file. To specify a catalog with Saxon, you must use the XML Commons Resolver from Apache (resolver.jar). You can download the resolver from SourceForge. The good thing is, if you have the DITA Open Toolkit installed on your machine, you already have a copy of the resolver.jar file. The file is in %DITA-OT%libresolver.jar. You specify the class path for the resolver in the Java command using the -cp switch (shown below).

The resolver requires you to specify a catalog.xml file, in which you map the the public ID (or system ID) in the DOCTYPE declaration to a local DTD file. The catalog.xml file I created looks like this:

<catalog prefer="public" xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//OASIS//DTD DITA Topic//EN" uri="dummy.dtd"/>
<public publicId="-//OASIS//DTD DITA Concept//EN" uri="dummy.dtd"/>
<public publicId="-//OASIS//DTD DITA Task//EN" uri="dummy.dtd"/>
<public publicId="-//OASIS//DTD DITA Reference//EN" uri="dummy.dtd"/>
</catalog>

Note that the uri attribute in each entry points to a dummy DTD (an empty file). The file path used for the dummy.dtd file is relative to the location of the catalog file.

Putting it all together, I created a DOS batch file to run Java and invoke Saxon:

java -cp c:saxon9saxon9.jar;C:DITA-OT1.4.3libresolver.jar ˆ
-Dxml.catalog.files=catalog.xml ˆ
net.sf.saxon.Transformˆ
-r:org.apache.xml.resolver.tools.CatalogResolver ˆ
-x:org.apache.xml.resolver.tools.ResolvingXMLReader ˆ
-y:org.apache.xml.resolver.tools.ResolvingXMLReader ˆ
-xsl:my_transform.xsl ˆ
-s:my_content.xml

The Java -cp switch adds class paths for the saxon.jar and resolver.jar files. The -D switch sets the system property xml.catalog.files to the location of the catalog.xml file.

The switches following the Java class (net.sf.saxon.Transform) are Saxon switches.

  • -r – class of the resolver
  • -x – class of the source file parser
  • -y – class of the stylesheet parser

Note, I’m using Windows (DOS) syntax here. If you are using Unix (Linux, Mac), separate the paths in the class path with a colon (:) and use the backslash () as a line continuation character.

When you run Saxon this way, you’ll notice two things: first, Saxon doesn’t complain about the DTD (yay!), but secondly, there is no DOCTYPE declaration in the output. I’ll address how to add the DOCTYPE declaration back to the output XML file in my next blog post.

Read More
Opinion

HTML 5: Browser Wars Reprise?

Recently, I ran across an article by Rob Cherny in Dr. Dobb’s Journal. He suggests that the added features in HTML 5 combined with an end to the development of XHTML point to a brighter standards-based future. He sees closed solutions like Flash, Silverlight, and JavaFX being supplanted directly by HTML 5 code. His view is that the web owes its success to standards.

It’s tempting to agree. Standards certainly allow for collaborative growth. Though I’m not the least bit convinced that collaborative growth is the foundation of the web’s success. I believe that the web’s incredible success is really traceable to the simplicity and flexibility of HTML. Each new version takes us further from that simplicity.

Through the browser war years we saw the impact of new features in HTML—incompatibility among browsers. My sense is that the success of Flash is largely due to the fact that Adobe owns both ends of the problem. They create the tools that generate Flash code as well as the viewer. Web developers can pretty much assume that what they see, when they build a Flash-based solution, is what the end user will see.

I fear that we will head right back to the bad old days if HTML 5’s complex capabilities are widely employed. I suspect that ‘wait and see’ will last a pretty long time. I have other concerns about HTML 5—more on that later. What do you think—will your organization take advantage of these new capabilities as soon as they are available?

Read More
Opinion

Font snobbery? (I don’t think so.)

For its 2010 catalog, IKEA used Verdana font instead of the customized Futura it’s used for years. To say people noticed the switch would be an understatement:

“Ikea, stop the Verdana madness!” pleaded Tokyo’s Oliver Reichenstein on Twitter. “Words can’t describe my disgust,” spat Ben Cristensen of Melbourne. “Horrific,” lamented Christian Hughes in Dublin. The online forum Typophile closed its first post on the subject with the words, “It’s a sad day.” On Aug. 26, Romanian design consultant Marius Ursache started an online petition to get Ikea to change its mind. That night, Verdana was already a trending topic on Twitter, drawing more tweets than even Ted Kennedy.

As a fan of IKEA and its products, I can understand the reaction. If you showed me a page out of an IKEA catalog with just text and prices (and no pictures or funky product names, of course!), I could tell you in a heartbeat that the content was from IKEA.

Verdana may be easier to read if you’re looking at the IKEA catalog online, but that font lacks the designer-y flair of Futura. Because IKEA is known for its affordable cutting-edge design, Verdana just doesn’t seem to quite fit the bill.

This situation reminds me of a comment a friend made about a failed hotel in Raleigh, NC. He said, “Did you see the awful Brush Script on the hotel’s sign? Those people clearly didn’t know how to run a business.” I doubt the Brush Script killed the hotel, but that bad design decision gave my friend (and probably many others) a very unfavorable impression about the company.

Earlier this week, Sarah O’Keefe and I were doing some web research and came upon a web site that used Comic Sans. My reaction to that site was less than positive. I loathe Comic Sans, and I find it hard to take any company seriously that uses a font that emulates text in a comic book.

A company’s use of fonts can become iconic–think about the fonts used by Coca-Cola and FedEx in their logos, for example. Font choice does have an effect on how people perceive content, a product, or a company.

I don’t think reactions to fonts are limited to just those who work in publishing and design. No snobbery here at all. (But if noticing fonts makes me a card-carrying font snob, you better believe that card would have no Comic Sans on it.)

For more about the impact of fonts, check out the documentary Helvetica:

Read More
Conferences

Got plans for May 2010?

After my summer of complaints and criticism of STC and its various issues, I was more than a little surprised to be asked to manage the Design, Architecture, and Publishing track for next year’s STC Summit.

Hoist on my own petard (my obsession with Wordnik continues)…what could I do but agree. Or, go into exile.

Several of the other conference organizers are people I know quite well:

  • The author of Managing Writers: A Real World Guide To Managing Technical Documentation, Richard Hamilton, is the track manager for Managing People, Projects, and Business. He knows his stuff.
  • The principal of UserAid, Paul Mueller, is track manager for three (THREE!) tracks: Education and Training, Web Technologies, and Emerging Technologies. He’s also the Deputy Chair of the conference. (private note to Paul: I take it you were not able to retrieve the goat pictures. Sorry about that.) Another excellent choice.
  • Ant Davey of the UK and Ireland chapter has the Communication and Interpersonal Skills and Professional Development tracks. I’ve worked on STC-related matters with Ant, and he’s a great choice for this track.
  • Rachel Houghton, Program Chair. She did great work on last year’s conference.
  • Alan Houser, conference manager. You may remember him as the guy who retrieved David Pogue from a poorly timed bathroom break during the opening session. I’ve known Alan for many years, and I expect another well-organized event, in which he solves the inevitable emergencies with typical aplomb.

(I’m sure that the other track managers are excellent as well, but I don’t know them personally.)

Here is the description of the Design, Architecture, and Publishing track:

Choice of appropriate design and architectures can improve the efficiency, usability, and quality of an organization’s technical publishing. This track explores issues in information design and system architectures for publishing, with particular emphasis on systems and solutions for organization-wide publishing. Suggested session topics include:

  • Visual communication, integrating text and graphics, page layout
  • Single-source publishing, for multiple delivery formats, multiple purposes, and multiple audiences
  • Methodologies and solutions for content management
  • Comparing and selecting delivery formats
  • Issues in structured authoring and publishing, including migration, design, and deployment
  • XML-based publishing
  • Using industry-standard publishing architectures, such as DITA
  • Accommodating localization workflows in the publishing process
  • Moving unstructured content to structure

And now I need your help in two areas:

  1. Submit your proposals. The quality of the conference is determined by the quality of the presentations. And that, of course, is determined by the quality of the proposals submitted. Please send in your best stuff. I suppose you can look into the other tracks if you must.
  2. Help review proposals. I need two or three people to help out in reviewing conference proposals in this track. I’ve done this in the past; it’s a relatively limited time commitment. You will be asked to read lots of proposals and evaluate them, probably in mid-October. Along with reviewers, I will eventually generate a list of recommendations for which proposals to accept. If you have significant expertise in topics in this track, and especially if you do not intend to submit a proposal of your own, please consider volunteering to help with this effort.

Some notes on this year’s process:

  • The deadline for proposal submission is October 5, 2009 at 10 a.m. Eastern time.
  • This is a direct quote from the conference page: “With the smaller number of sessions (for the most part) only one proposal per speaker will be accepted.” (You can still submit multiple proposals, but do not expect to have more than one accepted.)
  • Two speaker references are required (unless you have presented at this conference in the past four years, in which case we will review your evaluations). I personally intend to put a significant weighting on previous highly rated speaking experience.
  • In 2009, sessions were recorded. I assume this will happen again.
  • The conference is May 2-5, 2010, in Dallas, Texas.

Get started with a proposal

If you have questions, leave a comment or contact me. I look forward to seeing lots of compelling proposals.

Read More
News

Liberated type

(or should that be “Liberated typoes?”)

We have opened up free access to two of our white papers:

  • Hacking the DITA Open Toolkit, available in HTML or PDF (435 KB, 19 pages)
  • FrameMaker 8 and DITA Technical Reference, available in PDF (5 MB, 55 pages)

These used to be paid downloads.

Why the change of heart? Most of our business is consulting. To get consulting, we have to show competence. These white papers are one way to demonstrate our technical expertise.

(By this logic, our webcasts should also be free, but I’m not ready to go there. Why? We have fixed costs associated with the webcast hosting platform. Plus, once we schedule a webcast, we have to deliver it at the scheduled time, even if we’d rather be doing paying work. By contrast, we can squeeze in white paper development at our convenience.)

What are your thoughts? We are obviously not the only organization dealing with this issue…

Read More
News

Is this thing on?

If you are reading this, then we have succeeded in migrating our web site over to WordPress.

Of course, the process of managing our own content always takes a back seat to working with our customers’ content, so the process took longer than you might expect. 

We did learn a couple of things, most of which should sound awfully familiar if you are working on your own content strategy:

  • It’s not until you try to move into a new system that you recognize all the mistakes you made the previous system.
  • PHP stands for Picky Hypochondriac Programming. I had several cases where code absolutely refused to work for no apparent reason. I had the resident PHP expert (Simon) look it over. Eventually, I gave up and retyped the code, and then it worked.
  • Learn to work with the tool and not against it. I have to credit a former coworker, Bruce Bicknell, for this little gem, which he originally applied to Word versus FrameMaker. When moving from Dreamweaver-based HTML to WordPress, take some time to learn best practices for WordPress. Don’t try to impose your existing  Way of Doing Things onto the new system. It’s inefficient and it probably won’t work.
  • Content migration is always awful. To transfer our blog, I found a blogger-to-WordPress converter. That worked pretty well, except that a couple of posts now have my name on them even though I didn’t write them. Transferring comments was a travesty that involved the support people at Haloscan (helpful) and cleaning out random comment triplication (gross manual labor).

But I hope you like the new site and blog. Please poke around and leave us feedback.

Read More
Reviews

The long and winding roads from DITA to PDF

by Sheila Loring

DITA XML is of little use to readers unless it’s converted to some kind of output. The DITA Open Toolkit (DITA OT) provides transforms and scripts that convert DITA to PDF output and a long list of other formats.

Producing PDF output from DITA content can be challenging. DITA XML is converted to an XSL-FO file, a combination of content and formatting instructions. You must know XSL-FO to customize the PDF, even just to add simple content such as headers and footers, logos, and so on.

To forgo the programming, you can choose a page layout or help authoring tool, but these tools also have pitfalls. Page layout programs have varying degrees of DITA support. Help authoring tools let you style the PDF through CSS, but you can’t fine-tune page layout as you can in page layout programs.

These are just a few examples we discuss in our white paper “Creating PDF files from DITA content.” Read the white paper online (in HTML or PDF).

Read More
Webinar

New Webinar Series: Things to consider when moving to DITA

Scriptorium and JustSystems are announcing a three-webinar series on preparing to use DITA.

The first two webinars in the series describe the age-old problem of converting legacy content into DITA. Because a great deal of unstructured content is in either Adobe FrameMaker and Microsoft Word, we’re dedicating one webinar to converting Unstructured FrameMaker to DITA and the other to converting Microsoft Word to DITA.

The third webinar describes various re-use strategies you can apply to your DITA content.

The dates and times for the conversion webinars are:

  • Converting Unstructured FrameMaker to DITA – August 25, 2:00pm Eastern time.
  • Converting Microsoft Word to DITA – September 1, 2:00pm Eastern time.

The date and time for the third webinar (DITA reuse strategies) will be announced toward the end of August.

All of the webinars in the series are free, but you do have to register before attending. To sign up, follow this link to the JustSystems web site:

http://na.justsystems.com/webinars.php

Register now!

Read More
Webinar

Learn DITA and XML at your desk

For August and September, our webinar schedule is as follows:
DITA 101, August 18 at 11 a.m. Eastern time
Participants will learn about basic Darwin Information Typing Architecture (DITA) concepts, the business case for implementing DITA, and some typical uses of DITA. This webinar is ideal for those who are considering a move to structured authoring based on the DITA standard. Register
Demystifying DITA to PDF Publishing, September 10 at 11 a.m. Eastern time
When a company implements a DITA-based workflow, the most difficult technical obstacle is often setting up a PDF/print publishing workflow. This session discusses the advantages and disadvantages of using the DITA Open Toolkit, FrameMaker, InDesign, and other options to create PDF output from DITA content. Basic familiarity with DITA, Extensible Markup Language (XML), and related technologies is helpful but not required. Register
What Do Movable Type and XML Have in Common?, September 22 at 11 a.m. Eastern time
The invention of movable type changed the economics of information by making the process of copying a book by hand obsolete. More than 500 years later, XML seems to be doing the same to desktop publishing. But where movable type changed the economics of a mechanical process—creating printed 
copies—XML changes the economics of content authoring, formatting, and customization. This webinar takes a look at how publishing technologies revolutionize the way people consume information and how those technologies affect authors. Register
Each webinar is $20. 
During the sessions, you can interact with the presenter and other students through the chat interface or the audio connection. There is a question-and-answer session at the end of each webinar. The Q&A is not included in session recordings, which are available for download later. Participants in the sessions receive a free recording.
To register for these webcasts, or to purchase recordings of past webinars, go to our online store.

Read More
Reviews

Let the conversation begin

Conversation and Community book cover imageConversation and Community: The Social Web for Documentation (XML Press, ISBN: 9780982219119) by Anne Gentle provides technical communicators with a roadmap for integrating social media — blogs, wikis, and much more — into their content development efforts. This is critical because, as Anne notes in the preface, “professional writers now have the tools to collaborate with their audience easily for the first time in history.”
Anne provides overviews of all the major social media concepts — from aggregation to syndication, wikis, discussion, presence, and much more. But it is Chapter 3, “Defining a Writer’s Role with the Social Web,” that will make this book a classic. Here, Anne lays out a detailed strategy for determining whether and how to introduce social media in an organization. Consider this:

It’s important to find a balance between allowing an individual’s authentic voice to speak on behalf of an organization and the requirements of institutional messaging and brand preservation. […] It’s also possible that you are ahead of the curve and need to help others see ways to apply social technologies for the company.

She goes on to explain just how to accomplish these things.
Wikis and blogs each get a chapter of their own, in which Anne discusses how to start and maintain these types of environments.
After reading so much of Anne’s work on her blog, it’s a bit odd to see her writing on paper in an actual book. The feeling that I’ve wandered into the wrong medium is augmented by extensive footnotes, most of which point to web site resources, and the many examples of web-based content (such as videos or interactive mashups). However, it’s likely that the book’s target audience is more comfortable with paper.
Conversation and Community: The Social Web for Documentation provides an excellent introduction to wikis, blogs, forums, and numerous other social media technologies for the professional content creator. There is valuable (and perhaps career-preserving) information about how to develop a strategy for user-generated content that is compatible with your organization’s corporate culture.
If you think that community participation in your documentation is coming soon, read this book immediately. If you think that it’s not coming, you’re wrong, and you especially need to read this book.
Resources:
[Disclosure: I reviewed an early draft of this book. I have met Anne in person a few times and we have ongoing email and blog correspondence.]

Read More