Words in Boxes

Nouns, verbs, and occasionally adjectives.

Monday, December 22, 2008

XProc, Part III — Turtles All The Way Down (Steps, Reuse, and Encapsulation)

This is my third post on XProc. (To see the first two, go here and here.)  In this post, I'll talk about the different categories of steps, how to create your own steps and reuse them across pipelines, and how this all relates to the fundamental metaphor of XProc. 

One task I do frequently is removing Arbortext Editor change-tracking markup from documents. Here's a simple pipeline that does just that:

<p:pipeline type="u:accept-changes">
  <p:xslt>
    <p:input port="stylesheet">
      <p:document href="accept_changes.xslt"/>
    </p:input>
  </p:xslt>
</p:pipeline>

The fundamental unit of work in XProc is the "step," which you can think of in the same way you do subroutines in other programming languages. In general, a step takes XML as input, performs an operation on it, and outputs XML. There are three types of steps in XProc:

  • atomicAtomic steps. These are the most basic building blocks of XProc pipelines, and each carries out a single fundamental XML operation.  There are a number of built-in atomic steps, such as p:xslt (as shown above).  These built-in atomic steps will be the foundation for almost everything you do in XProc, so learn them.  

  • image Compound steps.  These are assembled from other XProc steps.  Sometimes they are just a series of implicitly connected atomic steps, and sometimes they use built-in logical control structures, such as p:for-each, to control their execution.

    It turns out that the above accept-changes pipeline — and in fact any pipeline you create — is also a compound step.  (The fact that p:pipeline is a shortcut for p:declare-step is a dead giveaway.)  This is an important idea, and we'll discuss it more below.  But for now, remember this:  a pipeline and a step are the same thing

  • Multi-container steps.  There are only two multi-container steps: p:choose and p:try.  These contain two or more alternate pipelines.  You cannot define your own custom multi-container step.

No doubt, in your day-to-day work, you'll never feel the need to stop and consider whether a certain step fits one category or the other.  But the difference does have consequences, so it's worth spending a bit of time on it.

Let's say that you want to use the above accept-changes step as a step within a larger pipeline — for example, a pipeline that removes all of the notes from 2nd-level sections in a document.  Here's how you would do that:

<p:pipeline name="delete-notes" 
    xmlns:p="http://www.w3.org/ns/xproc" 
    xmlns:atict="http://www.arbortext.com/namespace/atict"
    xmlns:u="http://www.wordsinboxes.com/xproc">

  <p:serialization port="result" encoding="utf-8" method="xml"  
    omit-xml-declaration="false"/>

  <!-- Declare accept-changes step -->

  <p:pipeline type="u:accept-changes">    
    <p:xslt>
      <p:input port="stylesheet">
        <p:document href="accept_changes.xslt"/>
      </p:input>
    </p:xslt>
  </p:pipeline>
  
  <!-- Pipeline flow starts here -->  

  <u:accept-changes name="remove-tracking-markup" />
  
  <p:delete name="delete-level-2-notes" 
    match="/chapter/section/section/note" />
     
</p:pipeline>

What's happening is that we are declaring a new step type called "u:accept-changes," and then invoking an instance of it as the first step in the pipeline.  A few notes:

  • The u:accept-changes step declaration itself is not executed directly; only it's instance on line 21 is.  More formally, a pipeline element that is the child of another pipeline element (unlike what we've seen before) is used to define a step that can be invoked later. As an analogy to other programming languages, you are defining a object type and then creating an instance of it.  As a result, there is no need in this pipeline for explicit piping; the default implicit connections suffice.
  • A step's type attribute is different than its name attribute.  The value you place in type becomes the name of the element you use to invoke it (lines 11 and 21).  That instance can itself be given a name, which can then be used when controlling the flow of XML using pipes.  So, a type refers to the object, and a name refers to the instance.
  • A step's type must be in a non-XProc namespace, which you must remember to declare. 
  • The p:serialization element defines how a specified pipeline output port serializes XML when outputting from the pipeline.  It is the equivalent of the xsl:output instruction in XSLT, and takes many of the same options.

Of course, if you're going to just reference your new step type once within the same pipeline that you've declared it in, there's not really much of a point.

The real power of declaring your own step type comes from the ability to place it in an external library and create instances of it within any number of pipelines.  Here's how you do that, using p:import.  First, the library document:

<p:library 
    xmlns:p="http://www.w3.org/ns/xproc" 
    xmlns:atict="http://www.arbortext.com/namespace/atict"
    xmlns:u="http://www.wordsinboxes.com/xproc">

  <p:pipeline type="u:accept-changes">    
    <p:xslt>
      <p:input port="stylesheet">
        <p:document href="accept_changes.xslt"/>
      </p:input>
    </p:xslt>
  </p:pipeline>
       
</p:library>

Note the root element p:library, which is used as a container for a collection of custom steps.  If I wanted, I could have made the p:pipeline element the root element, and imported that single step, but using p:library leaves more room for future growth.

Here's the final pipeline, which imports the above library:

<p:pipeline name="delete-notes" 
    xmlns:p="http://www.w3.org/ns/xproc" 
    xmlns:atict="http://www.arbortext.com/namespace/atict"
    xmlns:u="http://www.wordsinboxes.com/xproc">

  <p:serialization port="result" encoding="utf-8" method="xml"  
    omit-xml-declaration="false"/>

  <!-- Import accept-changes step -->

  <p:import href="lib.xpl" />
  
  <!-- Pipeline flow starts here -->  

  <u:accept-changes name="remove-tracking-markup" />
 
  <p:delete name="delete-level-2-notes" 
    match="/chapter/section/section/note" />
     
</p:pipeline>

There.  A nice and simple pipeline.  I (and anyone else) can reference this u:accept-changes step in any number of pipelines, and never worry about how it actually works.  (And, if you stored the transform inside the u:accept-changes definition using p:inline, it would be even more portable.)  Later, if I decide that I'm better served by using a Python script to accept changes instead of a transform, I can simply update the step declaration (using p:exec) once in the library.  My hope is that XProc will become a common language for sharing XML manipulation tasks. 

It's also worth noting that if I gave this note-deleting pipeline a type attribute, then it too could be invoked as step in another pipeline.  And that pipeline could be invoked as a step, and so on.  Turtles all the way down.

image

Here's where the difference between an atomic step and a compound step comes into play.  Although we defined our accept-changes step as a compound step (because it contains a subpipeline of one step), when we invoke an instance of it, that instance is an atomic step

If that is confusing (and it sure confused me at first), it helps if you think about the step from two separate perspectives — from within, looking at the implementation details, and without, looking at what actually enters and leaves the step.  We've already covered the inside.  From the outside, when you call an instance, that instance is a black box which XML enters and XML leaves — exactly like a built-in atomic step.  The fact that it's actually implemented by assembling XProc steps is irrelevant to the calling pipeline. 

There's one more important XProc fact I've talked around but never explicitly stated.  The only information that can flow between steps through pipes is XML. At first, I thought this a stifling restriction, but as I work more with XProc, I think that this prohibition is the key to its power. 

Because, when you combine all these ideas, you end up with a very high level of enforced encapsulation that provides a very simple yet powerful abstraction for working with XML.  It allows you to worry about what needs to be done instead of how.  When any pipeline can be used as an atomic step, you can work fractally, making parts out of the same raw material as the whole, creating higher and higher levels of abstraction.

image 

So, to summarize:

  • A pipeline is a step.
  • Although the declaration of a step may be a compound step, its invocation is not. 
  • XML is the only information that can flow between steps through pipes. 

That's it for now.  If you're looking for more XProc information, Dave Pawson has been working on his introduction to XProc, which is shaping up to be a great resource.

I may start posting my XProc-related material in a more suitable space, away from my blog.  I'm considering using Docbook Website system.  Any thoughts?

Tuesday, December 16, 2008

Recently In Markup

Sunday, December 07, 2008

Getting Started With XProc, Part II

In my previous post, I showed how to set up a batch script to run Calabash in Windows.  This time I'll show how to write a useful — if simple — XProc pipeline.  Here it is:

<p:pipeline xmlns:p="http://www.w3.org/ns/xproc">

  <p:xslt>
    <p:input port="stylesheet">
      <p:document href="sort.xslt" />
    </p:input>
  </p:xslt>

  <p:xslt>
    <p:input port="stylesheet">
      <p:document href="to_wordml.xslt" />
    </p:input>
  </p:xslt>

</p:pipeline>

So what happens here?  When you execute it from the command line using:

runcalabash -i "source=in.xml" -o "result=out.xml' sort-and-print.xpl

this pipeline reads the input XML (in.xml), executes the first transform on it, passes the result to the second transform, executes that transform, and the saves the result as out.xml.  But the syntax of XProc is clear enough you probably guessed that on your own. 

The fundamental metaphor of XProc is that XML data "flows" into a pipeline and between its steps through a series of connected ports.  The two most common and important ports are the primary input port (usually called "source") and the primary output port (usually called "result").  There are others, and not every step actually has both of these, but that's another discussion. 

This isn't obvious in the above pipeline because all of the input and output ports, and the connections between them, are implicit.  There are more explicit ways to state most of the pipeline.  For example, it turns out that p:pipeline is just syntactic sugar for p:declare-step.  In other words:

<p:pipeline xmlns:p="http://www.w3.org/ns/xproc" >
</p:pipeline>

is equivalent to:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" >
  <p:input port="source" primary="true" />
  <p:input port="parameters" kind="parameter" />
  <p:output port="result" primary="true" />
</p:declare-step>

Expanding on this, here's a more explicit version of our original pipeline:

<p:declare-step name="sort-and-print" xmlns:p="http://www.w3.org/ns/xproc">
  <p:input port="source" primary="true"/>
  <p:output port="result" primary="true">
    <p:pipe step="print" port="result" />
  </p:output>
  <p:input port="parameters" kind="parameter" />
  
  <p:xslt name="sort">
    <p:input port="source">
      <p:pipe step="sort-and-print" port="source" />
    </p:input>
    <p:input port="stylesheet">
      <p:document href="sort.xslt" />
    </p:input>
    <p:input port="parameters" kind="parameter">
      <p:pipe step="sort-and-print" port="parameters" />
    </p:input>
  </p:xslt>

  <p:xslt name="print">
    <p:input port="source">
      <p:pipe step="sort" port="result"/>
    </p:input>
    <p:input port="stylesheet">
      <p:document href="to_wordml.xslt"/>
    </p:input>
    <p:input port="parameters" kind="parameter">
      <p:pipe step="sort-and-print" port="parameters" />
    </p:input>
  </p:xslt>
  
</p:declare-step>

That's quite a mouthful, but it exposes the relationships between all the steps.  There are a few things worth noting.  First, the p:pipe element connects the input of a step to another step.  In the above pipelines, these elements are redundant, since XProc automatically connects sequential sibling steps, but they become necessary when you need to connect non-sibiling steps in larger, more complicated pipelines.

Second, every step can be given a name using the "name" attribute, and that name must be unique within the pipeline.  These names can be referenced by the "step" attribute of p:pipe to explicitly connect ports.  If a step isn't given an explicit name, one is generated internally by the XProc processor. 

Third, the p:xslt step gets its stylesheet from a port that is just like any other port.  Most of the time, you will explicitly point to a stylesheet using a p:document element, but you have a lot more flexibility than that.  For example, you can programmatically create a stylesheet in one step, and then pipe it to a p:xslt to execute it.  This would be a great alternative to using saxon:compile-stylesheet() and saxon:transform() within a single, multiple pass stylesheet. 

Fourth, p:output works a bit differently than you might expect (as talked about recently on xproc-dev here and here).  The p:output step does not define where the output for a step goes; it defines where the output comes from.  So, while it may initially seem like this

<p:output port="result">
  <p:document href="out.xml" />
</p:output>

will send the output to out.xml, it instead sends the content of out.xml to the "result" output port.  What this means is that where the output port of a pipeline or a step goes cannot be defined within that pipeline or step.  It must be defined outside that step or pipeline — for example on the command line or in the input port of another step. 

(If you want the other effect, you can use p:store, which can be thought of as the equivalent of <xsl:result-document/> in XSLT, but in that case the primary output from is actually the name of the file you are writing and not the actual content of the output.) 

That's it for now.  Next time:  How to create and use a library of custom XProc steps.

Saturday, December 06, 2008

This Week in Google Reader

Friday, December 05, 2008

Typography and the New York City Subway

An AIGA article and a New York Times blog post about the history of the typography of the NYC subway system are both good reads: 

Helvetica was originally created in Switzerland. It was a neutral typeface from a neutral country and gained runaway popularity starting in the 1960s for its modern grace. But the subway system looked elsewhere.

“It was an incredibly courageous thing to do at a time when Helvetica was riding high,” Mr. Shaw said.

That's right.  Typography as an act of bravery. 

(via Daring Fireball)

Thursday, December 04, 2008

Getting Started With XSpec

Lately I've been working with XSpec, a new testing framework for XSLT created by Jeni Tennison. I dabbled with Jeni's old testing framework, Tennison Tests, but I never could muster the discipline to use it as much as I should. XSpec makes testing frictionless, which is really how it must be if I'm going to make it a habit. 

Here's how to integrate it into Oxygen:

xspec_toolbar

On Windows, XSpec is launched by running a batch file. It's simple to hook up this batch file to Oxygen as an external tool so that you can run an XSpec suite by clicking a button.  By being a little clever - and sticking to a consistent naming convention - we can set it up so that we can activate the tests either from the XSpec document itself, or the XSLT you're developing.  The caveat is that you must save your XSpec documents in the same folder as the XSLT, and give it the same name as that transform, except with the extension "xspec" instead of "xslt." (For example, my_stylesheet.xspec tests my_stylesheet.xslt.) If you want to follow a different convention, adjust the following instructions appropriately.

  1. Download XSpec and unzip it somewhere in your system path. I recommend getting the latest xspec.bat and transforms, since Jeni's fixed a few bugs from the initial release.
  2. Open up xspec.bat, and edit the third line to set the CP (or class path) variable to the location where Saxon is installed. (If you don't have Saxon, get it).
  3. Now we need to add XSpec as an external tool in Oxygen. To do so, go to Tools > External Tools > Preferences > New and fill out the fields like so:

    XSpec_oxygen

    (Make sure to change the paths to match those on your computer.)

  4. I like to tell Oxygen to make XSpec-namespaced elements a different color. To do that, go to Options > Preferences > Editor > Colors > Elements by Prefix.  This makes it easier to differentiate between XSpec instructions, and the inline XML test data.

Now you should have a button on your toolbar labeled "Run XSpec Test." If it doesn't show up, make sure to activate the "External Tools" toolbar.

Here's an example of a successful test report:

xspec_report

Tuesday, December 02, 2008

Creativity and “Courageous Sucking”

From 43 Folders (emphasis mine):

I laid on the sidewalk. All the way down. On my gut on 50° of western San Francisco concrete.

And, I took my time, thinking about the aperture (all the way open for depth of field) and the available light (very little, so I put the the camera right on the ground to steady it). I snapped a dozen or more shots with slightly different settings. No idea what I was doing. People walked by, cars passed, the L barreled by, but I kept shooting until I was satisfied that I might have something. Then, I grabbed the shoe, stood up, and trotted back up the hill, triumphant, with a recovered piece of footwear, plus what I suspected might be at least one pretty good photo.

I like how it turned out.

Yeah, I know, it’s no masterpiece, but I’m proud of it for reasons of my own. Because, last night, as I was splayed prone in the fog along Taraval Street, I realized I was getting a little better at this.

Not because I’d been magically touched with mythical creativity and skill, but because for a moment I was thinking more about how to use what I’d learned to get a good photo than I was about how I might have looked while doing it. And, that felt like a small turning point.

I'm James Sulak, a software developer in Houston, Texas.

You can also find me on Twitter, or if you're curious, on my old-fashioned home page. If you want to contact me directly, you can e-mail comments@wordsinboxes.com.