Words in Boxes

Nouns, verbs, and occasionally adjectives.

Monday, December 22, 2008

XProc, Part III — Turtles All The Way Down (Steps, Reuse, and Encapsulation)

This is my third post on XProc. (To see the first two, go here and here.)  In this post, I'll talk about the different categories of steps, how to create your own steps and reuse them across pipelines, and how this all relates to the fundamental metaphor of XProc. 

One task I do frequently is removing Arbortext Editor change-tracking markup from documents. Here's a simple pipeline that does just that:

<p:pipeline type="u:accept-changes">
  <p:xslt>
    <p:input port="stylesheet">
      <p:document href="accept_changes.xslt"/>
    </p:input>
  </p:xslt>
</p:pipeline>

The fundamental unit of work in XProc is the "step," which you can think of in the same way you do subroutines in other programming languages. In general, a step takes XML as input, performs an operation on it, and outputs XML. There are three types of steps in XProc:

  • atomicAtomic steps. These are the most basic building blocks of XProc pipelines, and each carries out a single fundamental XML operation.  There are a number of built-in atomic steps, such as p:xslt (as shown above).  These built-in atomic steps will be the foundation for almost everything you do in XProc, so learn them.  

  • image Compound steps.  These are assembled from other XProc steps.  Sometimes they are just a series of implicitly connected atomic steps, and sometimes they use built-in logical control structures, such as p:for-each, to control their execution.

    It turns out that the above accept-changes pipeline — and in fact any pipeline you create — is also a compound step.  (The fact that p:pipeline is a shortcut for p:declare-step is a dead giveaway.)  This is an important idea, and we'll discuss it more below.  But for now, remember this:  a pipeline and a step are the same thing

  • Multi-container steps.  There are only two multi-container steps: p:choose and p:try.  These contain two or more alternate pipelines.  You cannot define your own custom multi-container step.

No doubt, in your day-to-day work, you'll never feel the need to stop and consider whether a certain step fits one category or the other.  But the difference does have consequences, so it's worth spending a bit of time on it.

Let's say that you want to use the above accept-changes step as a step within a larger pipeline — for example, a pipeline that removes all of the notes from 2nd-level sections in a document.  Here's how you would do that:

<p:pipeline name="delete-notes" 
    xmlns:p="http://www.w3.org/ns/xproc" 
    xmlns:atict="http://www.arbortext.com/namespace/atict"
    xmlns:u="http://www.wordsinboxes.com/xproc">

  <p:serialization port="result" encoding="utf-8" method="xml"  
    omit-xml-declaration="false"/>

  <!-- Declare accept-changes step -->

  <p:pipeline type="u:accept-changes">    
    <p:xslt>
      <p:input port="stylesheet">
        <p:document href="accept_changes.xslt"/>
      </p:input>
    </p:xslt>
  </p:pipeline>
  
  <!-- Pipeline flow starts here -->  

  <u:accept-changes name="remove-tracking-markup" />
  
  <p:delete name="delete-level-2-notes" 
    match="/chapter/section/section/note" />
     
</p:pipeline>

What's happening is that we are declaring a new step type called "u:accept-changes," and then invoking an instance of it as the first step in the pipeline.  A few notes:

  • The u:accept-changes step declaration itself is not executed directly; only it's instance on line 21 is.  More formally, a pipeline element that is the child of another pipeline element (unlike what we've seen before) is used to define a step that can be invoked later. As an analogy to other programming languages, you are defining a object type and then creating an instance of it.  As a result, there is no need in this pipeline for explicit piping; the default implicit connections suffice.
  • A step's type attribute is different than its name attribute.  The value you place in type becomes the name of the element you use to invoke it (lines 11 and 21).  That instance can itself be given a name, which can then be used when controlling the flow of XML using pipes.  So, a type refers to the object, and a name refers to the instance.
  • A step's type must be in a non-XProc namespace, which you must remember to declare. 
  • The p:serialization element defines how a specified pipeline output port serializes XML when outputting from the pipeline.  It is the equivalent of the xsl:output instruction in XSLT, and takes many of the same options.

Of course, if you're going to just reference your new step type once within the same pipeline that you've declared it in, there's not really much of a point.

The real power of declaring your own step type comes from the ability to place it in an external library and create instances of it within any number of pipelines.  Here's how you do that, using p:import.  First, the library document:

<p:library 
    xmlns:p="http://www.w3.org/ns/xproc" 
    xmlns:atict="http://www.arbortext.com/namespace/atict"
    xmlns:u="http://www.wordsinboxes.com/xproc">

  <p:pipeline type="u:accept-changes">    
    <p:xslt>
      <p:input port="stylesheet">
        <p:document href="accept_changes.xslt"/>
      </p:input>
    </p:xslt>
  </p:pipeline>
       
</p:library>

Note the root element p:library, which is used as a container for a collection of custom steps.  If I wanted, I could have made the p:pipeline element the root element, and imported that single step, but using p:library leaves more room for future growth.

Here's the final pipeline, which imports the above library:

<p:pipeline name="delete-notes" 
    xmlns:p="http://www.w3.org/ns/xproc" 
    xmlns:atict="http://www.arbortext.com/namespace/atict"
    xmlns:u="http://www.wordsinboxes.com/xproc">

  <p:serialization port="result" encoding="utf-8" method="xml"  
    omit-xml-declaration="false"/>

  <!-- Import accept-changes step -->

  <p:import href="lib.xpl" />
  
  <!-- Pipeline flow starts here -->  

  <u:accept-changes name="remove-tracking-markup" />
 
  <p:delete name="delete-level-2-notes" 
    match="/chapter/section/section/note" />
     
</p:pipeline>

There.  A nice and simple pipeline.  I (and anyone else) can reference this u:accept-changes step in any number of pipelines, and never worry about how it actually works.  (And, if you stored the transform inside the u:accept-changes definition using p:inline, it would be even more portable.)  Later, if I decide that I'm better served by using a Python script to accept changes instead of a transform, I can simply update the step declaration (using p:exec) once in the library.  My hope is that XProc will become a common language for sharing XML manipulation tasks. 

It's also worth noting that if I gave this note-deleting pipeline a type attribute, then it too could be invoked as step in another pipeline.  And that pipeline could be invoked as a step, and so on.  Turtles all the way down.

image

Here's where the difference between an atomic step and a compound step comes into play.  Although we defined our accept-changes step as a compound step (because it contains a subpipeline of one step), when we invoke an instance of it, that instance is an atomic step

If that is confusing (and it sure confused me at first), it helps if you think about the step from two separate perspectives — from within, looking at the implementation details, and without, looking at what actually enters and leaves the step.  We've already covered the inside.  From the outside, when you call an instance, that instance is a black box which XML enters and XML leaves — exactly like a built-in atomic step.  The fact that it's actually implemented by assembling XProc steps is irrelevant to the calling pipeline. 

There's one more important XProc fact I've talked around but never explicitly stated.  The only information that can flow between steps through pipes is XML. At first, I thought this a stifling restriction, but as I work more with XProc, I think that this prohibition is the key to its power. 

Because, when you combine all these ideas, you end up with a very high level of enforced encapsulation that provides a very simple yet powerful abstraction for working with XML.  It allows you to worry about what needs to be done instead of how.  When any pipeline can be used as an atomic step, you can work fractally, making parts out of the same raw material as the whole, creating higher and higher levels of abstraction.

image 

So, to summarize:

  • A pipeline is a step.
  • Although the declaration of a step may be a compound step, its invocation is not. 
  • XML is the only information that can flow between steps through pipes. 

That's it for now.  If you're looking for more XProc information, Dave Pawson has been working on his introduction to XProc, which is shaping up to be a great resource.

I may start posting my XProc-related material in a more suitable space, away from my blog.  I'm considering using Docbook Website system.  Any thoughts?

I'm James Sulak, a software developer in Houston, Texas.

You can also find me on Twitter, or if you're curious, on my old-fashioned home page. If you want to contact me directly, you can e-mail comments@wordsinboxes.com.