Words in Boxes

Nouns, verbs, and occasionally adjectives.

Sunday, December 07, 2008

Getting Started With XProc, Part II

In my previous post, I showed how to set up a batch script to run Calabash in Windows.  This time I'll show how to write a useful — if simple — XProc pipeline.  Here it is:

<p:pipeline xmlns:p="http://www.w3.org/ns/xproc">

  <p:xslt>
    <p:input port="stylesheet">
      <p:document href="sort.xslt" />
    </p:input>
  </p:xslt>

  <p:xslt>
    <p:input port="stylesheet">
      <p:document href="to_wordml.xslt" />
    </p:input>
  </p:xslt>

</p:pipeline>

So what happens here?  When you execute it from the command line using:

runcalabash -i "source=in.xml" -o "result=out.xml' sort-and-print.xpl

this pipeline reads the input XML (in.xml), executes the first transform on it, passes the result to the second transform, executes that transform, and the saves the result as out.xml.  But the syntax of XProc is clear enough you probably guessed that on your own. 

The fundamental metaphor of XProc is that XML data "flows" into a pipeline and between its steps through a series of connected ports.  The two most common and important ports are the primary input port (usually called "source") and the primary output port (usually called "result").  There are others, and not every step actually has both of these, but that's another discussion. 

This isn't obvious in the above pipeline because all of the input and output ports, and the connections between them, are implicit.  There are more explicit ways to state most of the pipeline.  For example, it turns out that p:pipeline is just syntactic sugar for p:declare-step.  In other words:

<p:pipeline xmlns:p="http://www.w3.org/ns/xproc" >
</p:pipeline>

is equivalent to:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" >
  <p:input port="source" primary="true" />
  <p:input port="parameters" kind="parameter" />
  <p:output port="result" primary="true" />
</p:declare-step>

Expanding on this, here's a more explicit version of our original pipeline:

<p:declare-step name="sort-and-print" xmlns:p="http://www.w3.org/ns/xproc">
  <p:input port="source" primary="true"/>
  <p:output port="result" primary="true">
    <p:pipe step="print" port="result" />
  </p:output>
  <p:input port="parameters" kind="parameter" />
  
  <p:xslt name="sort">
    <p:input port="source">
      <p:pipe step="sort-and-print" port="source" />
    </p:input>
    <p:input port="stylesheet">
      <p:document href="sort.xslt" />
    </p:input>
    <p:input port="parameters" kind="parameter">
      <p:pipe step="sort-and-print" port="parameters" />
    </p:input>
  </p:xslt>

  <p:xslt name="print">
    <p:input port="source">
      <p:pipe step="sort" port="result"/>
    </p:input>
    <p:input port="stylesheet">
      <p:document href="to_wordml.xslt"/>
    </p:input>
    <p:input port="parameters" kind="parameter">
      <p:pipe step="sort-and-print" port="parameters" />
    </p:input>
  </p:xslt>
  
</p:declare-step>

That's quite a mouthful, but it exposes the relationships between all the steps.  There are a few things worth noting.  First, the p:pipe element connects the input of a step to another step.  In the above pipelines, these elements are redundant, since XProc automatically connects sequential sibling steps, but they become necessary when you need to connect non-sibiling steps in larger, more complicated pipelines.

Second, every step can be given a name using the "name" attribute, and that name must be unique within the pipeline.  These names can be referenced by the "step" attribute of p:pipe to explicitly connect ports.  If a step isn't given an explicit name, one is generated internally by the XProc processor. 

Third, the p:xslt step gets its stylesheet from a port that is just like any other port.  Most of the time, you will explicitly point to a stylesheet using a p:document element, but you have a lot more flexibility than that.  For example, you can programmatically create a stylesheet in one step, and then pipe it to a p:xslt to execute it.  This would be a great alternative to using saxon:compile-stylesheet() and saxon:transform() within a single, multiple pass stylesheet. 

Fourth, p:output works a bit differently than you might expect (as talked about recently on xproc-dev here and here).  The p:output step does not define where the output for a step goes; it defines where the output comes from.  So, while it may initially seem like this

<p:output port="result">
  <p:document href="out.xml" />
</p:output>

will send the output to out.xml, it instead sends the content of out.xml to the "result" output port.  What this means is that where the output port of a pipeline or a step goes cannot be defined within that pipeline or step.  It must be defined outside that step or pipeline — for example on the command line or in the input port of another step. 

(If you want the other effect, you can use p:store, which can be thought of as the equivalent of <xsl:result-document/> in XSLT, but in that case the primary output from is actually the name of the file you are writing and not the actual content of the output.) 

That's it for now.  Next time:  How to create and use a library of custom XProc steps.

I'm James Sulak, a software developer in Houston, Texas.

You can also find me on Twitter, or if you're curious, on my old-fashioned home page. If you want to contact me directly, you can e-mail comments@wordsinboxes.com.