Words in Boxes

Nouns, verbs, and occasionally adjectives.

Monday, March 16, 2009

Debugging XProc Pipelines

As I've worked more with XProc, I've written a couple of utility steps to help debug pipelines.  Although they started as quick hacks, they've continued helping out long enough to be worth sharing.  You can download the library here.

The more interesting of those is a step (wxp:assert) which asserts that the result of a given XPath expression evaluated against one document must be equal that that of a second XPath evaluated against a second document. 

Practically speaking, I can assert that the output from a step must match certain expectations.  If this assertion is met, the wxp:assert step functions just like a p:identity step, passing XML through its primary input to its primary output without modification.  If the assertion fails, the step sends an error message to the console, and optionally throws an XProc error and terminates pipeline execution. 

For example, here's an excerpt from an upconversion pipeline:

<p:xslt name="group-sections">
  <p:input port="stylesheet">
    <p:document href="group-sections.xslt"/>

<wxp:debug-output step-name="6-sections-grouped" debug="true"/>

<wxp:assert label="No sections were deleted during grouping" 
            xpath-alternate="count(.//*[self::sect-1 | self::sect-2 | 
                             self::sect-3 | self::sect-4])" 
  <p:input port="alternate">
    <p:pipe port="result" 

(This also shows the wxp:debug-output step, which has probably been rendered obsolete by the latest version of Calabash.  It simply functions as a p:identity step that also writes the XML it receives to disk for later review.)

This instance of wxp:assert asserts that the number of subsection elements in the primary source input port matches the number of sect-1 and sect-2 elements on the alternate input port.  The source document is produced by the preceding step, and the alternate document is piped in from a step earlier in the pipeline.  This is a quick way to ensure I don't do anything too lame-brained in the "group sections" step.

Here's another instance from the same upconversion pipeline.  This tests the output from a XSLT step that uses regular expressions to parse out section numbers from title elements.

<wxp:assert label="No empty enums after parsing sections"
    xpath-source="count(.//enum[not(.//text()[string-length(normalize-space(.)) gt 0])])" 

In this case, the alternate XPath expression is a constant (0), so I don't need to provide a document on the alternate port. 

Using wxp:assert steps does slow down a pipline, so I usually remove them once I finish debugging the pipeline.  The major exception to this rule is that I leave them active in upconversion pipelines where manual intervention is acceptable and accuracy is more important than speed. 

There are a few interesting in the step's source worth checking out.  Here's a summary of how wxp:assert works:

  1. It combines the two source documents into a single document using p:pack.
  2. That is passed to an XSL stylesheet, which uses the saxon:evaluate extension function to evalutate the two XPaths against their respective nodesets. 
  3. Depending on the value of the fail-on-error option, throw an XProc error and terminate the pipeline.
  4. Use a final p:identity step to pipe the initial input from the step's source port to its result port.

Or course, using a Saxon extension function does tie the pipeline to Calabash (or at least, a Saxon-based processor). I'm interested in hearing ideas about alternate approaches.

I'm James Sulak, a software developer in Houston, Texas.

You can also find me on Twitter, or if you're curious, on my old-fashioned home page. If you want to contact me directly, you can e-mail comments@wordsinboxes.com.