Tuesday, May 26, 2009

XProc Roundup

1419405808_7dff1fea23_m

Interest in XProc has really picked up recently, which is exciting to see.  Here's a roundup of some of the recent activity:

  • XProc: Step by Step. Vojtech Toman (who's working on EMC's Calumet implementation) wrote a great introduction to XProc.  Calumet itself is scheduled to be released under a developer's license on June 15.  I don't use any of EMC's products, but it's encouraging that such a big player in the market is embracing XProc.
  • XProc Tutorial.  Roger Costello updated his excellent tutorial.  It's a bit of a shame that it's in PowerPoint, but it's well organized and comes with several sample scripts.
  • Why XProc Rocks.  Joel Amoussou talks about XProc and news publishing.
  • XML Pipelines / XProc for bioinformatics. Pierre Lindenbaum describes an interesting use of XProc.

The XProc spec is tantalizingly close to making the transition to proposed recommendation, and all the recent (and positive) attention makes me optimistic that XProc will be a standard that's actually used in the real world. 

Monday, March 16, 2009

Debugging XProc Pipelines

As I've worked more with XProc, I've written a couple of utility steps to help debug pipelines.  Although they started as quick hacks, they've continued helping out long enough to be worth sharing.  You can download the library here.

The more interesting of those is a step (wxp:assert) which asserts that the result of a given XPath expression evaluated against one document must be equal that that of a second XPath evaluated against a second document. 

Practically speaking, I can assert that the output from a step must match certain expectations.  If this assertion is met, the wxp:assert step functions just like a p:identity step, passing XML through its primary input to its primary output without modification.  If the assertion fails, the step sends an error message to the console, and optionally throws an XProc error and terminates pipeline execution. 

For example, here's an excerpt from an upconversion pipeline:

<p:xslt name="group-sections">
  <p:input port="stylesheet">
    <p:document href="group-sections.xslt"/>
  </p:input>
</p:xslt>

<wxp:debug-output step-name="6-sections-grouped" debug="true"/>

<wxp:assert label="No sections were deleted during grouping" 
            xpath-source="count(.//subsection)"
            xpath-alternate="count(.//*[self::sect-1 | self::sect-2 | 
                             self::sect-3 | self::sect-4])" 
            fail-on-error="false">
  <p:input port="alternate">
    <p:pipe port="result" 
            step="parse-initial-subsections"/>
  </p:input>
</wxp:assert>

(This also shows the wxp:debug-output step, which has probably been rendered obsolete by the latest version of Calabash.  It simply functions as a p:identity step that also writes the XML it receives to disk for later review.)

This instance of wxp:assert asserts that the number of subsection elements in the primary source input port matches the number of sect-1 and sect-2 elements on the alternate input port.  The source document is produced by the preceding step, and the alternate document is piped in from a step earlier in the pipeline.  This is a quick way to ensure I don't do anything too lame-brained in the "group sections" step.

Here's another instance from the same upconversion pipeline.  This tests the output from a XSLT step that uses regular expressions to parse out section numbers from title elements.

<wxp:assert label="No empty enums after parsing sections"
    xpath-source="count(.//enum[not(.//text()[string-length(normalize-space(.)) gt 0])])" 
    xpath-alternate="0"
    fail-on-error="true"/>

In this case, the alternate XPath expression is a constant (0), so I don't need to provide a document on the alternate port. 

Using wxp:assert steps does slow down a pipline, so I usually remove them once I finish debugging the pipeline.  The major exception to this rule is that I leave them active in upconversion pipelines where manual intervention is acceptable and accuracy is more important than speed. 

There are a few interesting in the step's source worth checking out.  Here's a summary of how wxp:assert works:

  1. It combines the two source documents into a single document using p:pack.
  2. That is passed to an XSL stylesheet, which uses the saxon:evaluate extension function to evalutate the two XPaths against their respective nodesets. 
  3. Depending on the value of the fail-on-error option, throw an XProc error and terminate the pipeline.
  4. Use a final p:identity step to pipe the initial input from the step's source port to its result port.

Or course, using a Saxon extension function does tie the pipeline to Calabash (or at least, a Saxon-based processor). I'm interested in hearing ideas about alternate approaches.

Wednesday, February 11, 2009

Code Review: Commenting XSLT Regular Expressions

You learn a lot from reading other people's code.  For example, the other day I ran into a clever trick in Jeni Tennison's XSpec code for commenting regular expressions in XSLT:

<xsl:variable name="attribute-regex" as="xs:string">
  <xsl:value-of>
    \s+
    (\S+)        <!-- 1: the name of the attribute -->
    \s*
    =
    \s*
    (       <!-- 2: the value of the attribute (with quotes) -->
      "([^"]*)"  <!-- 3: the value without quotes -->
      |
      '([^']*)'  <!-- 4: also the value without quotes -->
    )
  </xsl:value-of>
</xsl:variable>

The trick is the <xsl:value-of /> instruction, which casts its contents as a string.  An especially nice thing about this method is that you can refer to other variables within the declaration:

   (\S+)    <!-- 12: the name of the element being opened -->
   (        <!-- 13: the attributes of the element -->
     (      <!-- 14: wrapper for the attribute regex -->
       <xsl:value-of select="$attribute-regex" />  <!-- 15-18 attribute stuff -->
     )*
   )

Of course, to ignore all the extra white space in a regex constructed this way, you'll need to set the "x" flag in any <xsl:analyze-string />, replace(), or matches() that refers to it.

Wednesday, February 04, 2009

“Broken gets fixed. Shoddy lasts forever”

Jack Moffett:

One of the developers I work with said this after I complained about a lingering issue in one of our products. It rings true. When deadlines are tight, and there is more work to get done than there are developers or hours in the schedule, it’s not the squeaky wheel, but the jammed one that gets the grease. The lesson, then, is to make sure it gets done right the first time. You never know when you’ll have the opportunity to revisit it.

(via Daring Fireball)

Tuesday, January 20, 2009

Stark and Adjustable Meanings

An interview of Alan Kay:

But at PARC our idea was, since you never step in the same river twice, the number-one thing you want to make the user interface be is a learning environment—something that’s explorable in various ways, something that is going to change over the lifetime of the user using this environment. New things are going to come on, and what does it mean for those new things to happen?

... Even if the user is an absolute expert, able to remember almost everything, I’m always interested in the difference between what you might call stark meaning and adjustable meaning.

I did quite a bit of study on that over the years to understand the influence of having something that you can read. It’s known that our basic language mechanism for both reading and hearing has a fast and a slow process. The fast process has basically a surface phrasal-size nature, and then there’s a slower one. This is why jokes require pauses; the joke is actually a jump from one context to another, and the slower guy, who is dealing with the real meanings, has to catch up to it.

There have been many, many studies of this. This argues that the surface form of a language, whatever it is, has to be adjustable in some form.

Sunday, January 11, 2009

Money Is Gas in the Car

Tim O'Reilly talks about the importance of working on things that are important, not just things that simply make money:

First off, though, I want to make clear that "work on stuff that matters" does not mean focusing on non-profit work, "causes, or any other form of "do-goodism." Non-profit projects often do matter a great deal, and people with tech skills can make important contributions, but it's essential to get beyond that narrow box. I'm a strong believer in the social value of business done right. We need to build an economy in which the important things are paid for in self-sustaining ways rather than as charities to be funded out of the goodness of our hearts.

...

Some of you may end up working at highflying companies. Some of you may succeed, and some of you may fail. I want to remind you that financial success is not the only goal or the only measure of success. It's easy to get caught up in the heady buzz of making money. You should regard money as fuel for what you really want to do, not as a goal in and of itself. Money is like gas in the car — you need to pay attention or you'll end up on the side of the road — but a well-lived life is not a tour of gas stations!

Thursday, January 08, 2009

What's Your Semantic Exit Strategy?

Quoting Cafe Con Leche quoting A List Apart:

We’ll start by posing the question: “why are we inventing these new elements?” A reasonable answer would be: “because HTML lacks semantic richness, and by adding these elements, we increase the semantic richness of HTML—that can’t be bad, can it?”

By adding these elements, we are addressing the need for greater semantic capability in HTML, but only within a narrow scope. No matter how many elements we bolt on, we will always think of more semantic goodness to add to HTML. And so, having added as many new elements as we like, we still won’t have solved the problem. We don’t need to add specific terms to the vocabulary of HTML, we need to add a mechanism that allows semantic richness to be added to a document as required. In technical terms, we need to make HTML extensible. HTML 5 proposes no mechanism for extensibility.

HTML 5, therefore, implements a feature that breaks a sizable percentage of current browsers, and doesn’t really allow us to add richer semantics to the language at all.

This is an important warning for anyone creating or maintaining a custom XML document schema. Creating a tight, semantically correct element for every type of information in your existing documents is easy (and dangerously seductive); creating a mechanism by which the schema's users themselves can gracefully expand the semantic tagging as documents grow is much harder, and ultimately much more important.  Be as general as you can get away with, and as specific as you dare.

Tuesday, January 06, 2009

Lessons from a Windows Reinstall

The Windows install on my fiancee's laptop, a 3-year-old Gateway, had become unstable and slow (as Windows does if left alone for three years) so it was time to wipe the hard drive and start again.  I methodically prepared, backing up all the data to two different locations — Dropbox and an external hard drive.  Also, at my father's urging, I created a drive image using the free trial of Acronis True Image, which I stored on an external hard drive.  I didn't think I'd need it, because the current install was so horrible that anything was preferable to it.

So, I rebooted, I popped in the Gateway system restore disk, and told it to format the drive and reinstall the operating system. 

When it was done, the computer booted into Windows.  Happy surprise, the restore CD didn't reinstall all of the stupid cruft software that all consumer laptops come loaded with.  However, a not-so-happy-surprise — the CD also didn't reinstall any of the Gateway-specific drivers the laptop needed to function, so it couldn't use its wireless card or suspend. 

No problem, I thought, I'll get them from the system restore partition on the primary hard drive.  Hmm, the restore process repartitioned the drive and killed it.  (Really?)  Okay, no problem, I'll go to Gateway's website and download the drivers.  Hmm, those don't work. (Seriously?) I'll go to the chipset manufacturer's website.  Hmm, not there.  (Oh, no.)

This problem was entirely my fault.  I should have made absolutely sure I had the drivers I needed before wiping the disk, so I'll spare you the whole rant of WHY DOESN'T GATEWAY'S CUSTOM SYSTEM RESTORE DISK CONTAIN THE DRIVERS TO ACTUALLY, YOU KNOW, RESTORE THE SYSTEM?  Because, that would be petty.

After some Internet searching, I discovered a free utility called Double Driver, which allows you to backup and restore all the drivers installed on your system.  Which was great, except I'd already killed the Windows installation with the drivers I needed. 

To make a long story short, here's what I did:

  1. Created a new drive image of my partially finished reinstall with Acronis True Image.
  2. Restored the original, pre-wipe drive image.
  3. Booted into Windows and ran Double Driver to create an archive of all the installed drivers on my USB drive.
  4. Restored the in-progress drive image.
  5. Used Double Driver to restore all the previously installed drivers.

Success!  (Well, Double driver actually missed one file, which I had to retrieve from the backup image.  But close enough.)

Also, although Acronis True Image saved the day, I can't recommend it as a product, since its background service slowed to a crawl the two computers I installed it on.  Which is too bad.

So, recapping today's lessons, if you're reinstalling Windows, you should first:

  • Backup all your data.  Twice.
  • Create a drive image of your current install.  It will save your bacon. 
  • Create an archive of all your drivers using DoubleDriver.  This too will save your bacon.

Monday, December 22, 2008

XProc, Part III — Turtles All The Way Down (Steps, Reuse, and Encapsulation)

This is my third post on XProc. (To see the first two, go here and here.)  In this post, I'll talk about the different categories of steps, how to create your own steps and reuse them across pipelines, and how this all relates to the fundamental metaphor of XProc. 

One task I do frequently is removing Arbortext Editor change-tracking markup from documents. Here's a simple pipeline that does just that:

<p:pipeline type="u:accept-changes">
  <p:xslt>
    <p:input port="stylesheet">
      <p:document href="accept_changes.xslt"/>
    </p:input>
  </p:xslt>
</p:pipeline>

The fundamental unit of work in XProc is the "step," which you can think of in the same way you do subroutines in other programming languages. In general, a step takes XML as input, performs an operation on it, and outputs XML. There are three types of steps in XProc:

  • atomicAtomic steps. These are the most basic building blocks of XProc pipelines, and each carries out a single fundamental XML operation.  There are a number of built-in atomic steps, such as p:xslt (as shown above).  These built-in atomic steps will be the foundation for almost everything you do in XProc, so learn them.  

  • image Compound steps.  These are assembled from other XProc steps.  Sometimes they are just a series of implicitly connected atomic steps, and sometimes they use built-in logical control structures, such as p:for-each, to control their execution.

    It turns out that the above accept-changes pipeline — and in fact any pipeline you create — is also a compound step.  (The fact that p:pipeline is a shortcut for p:declare-step is a dead giveaway.)  This is an important idea, and we'll discuss it more below.  But for now, remember this:  a pipeline and a step are the same thing

  • Multi-container steps.  There are only two multi-container steps: p:choose and p:try.  These contain two or more alternate pipelines.  You cannot define your own custom multi-container step.

No doubt, in your day-to-day work, you'll never feel the need to stop and consider whether a certain step fits one category or the other.  But the difference does have consequences, so it's worth spending a bit of time on it.

Let's say that you want to use the above accept-changes step as a step within a larger pipeline — for example, a pipeline that removes all of the notes from 2nd-level sections in a document.  Here's how you would do that:

<p:pipeline name="delete-notes" 
    xmlns:p="http://www.w3.org/ns/xproc" 
    xmlns:atict="http://www.arbortext.com/namespace/atict"
    xmlns:u="http://www.wordsinboxes.com/xproc">

  <p:serialization port="result" encoding="utf-8" method="xml"  
    omit-xml-declaration="false"/>

  <!-- Declare accept-changes step -->

  <p:pipeline type="u:accept-changes">    
    <p:xslt>
      <p:input port="stylesheet">
        <p:document href="accept_changes.xslt"/>
      </p:input>
    </p:xslt>
  </p:pipeline>
  
  <!-- Pipeline flow starts here -->  

  <u:accept-changes name="remove-tracking-markup" />
  
  <p:delete name="delete-level-2-notes" 
    match="/chapter/section/section/note" />
     
</p:pipeline>

What's happening is that we are declaring a new step type called "u:accept-changes," and then invoking an instance of it as the first step in the pipeline.  A few notes:

  • The u:accept-changes step declaration itself is not executed directly; only it's instance on line 21 is.  More formally, a pipeline element that is the child of another pipeline element (unlike what we've seen before) is used to define a step that can be invoked later. As an analogy to other programming languages, you are defining a object type and then creating an instance of it.  As a result, there is no need in this pipeline for explicit piping; the default implicit connections suffice.
  • A step's type attribute is different than its name attribute.  The value you place in type becomes the name of the element you use to invoke it (lines 11 and 21).  That instance can itself be given a name, which can then be used when controlling the flow of XML using pipes.  So, a type refers to the object, and a name refers to the instance.
  • A step's type must be in a non-XProc namespace, which you must remember to declare. 
  • The p:serialization element defines how a specified pipeline output port serializes XML when outputting from the pipeline.  It is the equivalent of the xsl:output instruction in XSLT, and takes many of the same options.

Of course, if you're going to just reference your new step type once within the same pipeline that you've declared it in, there's not really much of a point.

The real power of declaring your own step type comes from the ability to place it in an external library and create instances of it within any number of pipelines.  Here's how you do that, using p:import.  First, the library document:

<p:library 
    xmlns:p="http://www.w3.org/ns/xproc" 
    xmlns:atict="http://www.arbortext.com/namespace/atict"
    xmlns:u="http://www.wordsinboxes.com/xproc">

  <p:pipeline type="u:accept-changes">    
    <p:xslt>
      <p:input port="stylesheet">
        <p:document href="accept_changes.xslt"/>
      </p:input>
    </p:xslt>
  </p:pipeline>
       
</p:library>

Note the root element p:library, which is used as a container for a collection of custom steps.  If I wanted, I could have made the p:pipeline element the root element, and imported that single step, but using p:library leaves more room for future growth.

Here's the final pipeline, which imports the above library:

<p:pipeline name="delete-notes" 
    xmlns:p="http://www.w3.org/ns/xproc" 
    xmlns:atict="http://www.arbortext.com/namespace/atict"
    xmlns:u="http://www.wordsinboxes.com/xproc">

  <p:serialization port="result" encoding="utf-8" method="xml"  
    omit-xml-declaration="false"/>

  <!-- Import accept-changes step -->

  <p:import href="lib.xpl" />
  
  <!-- Pipeline flow starts here -->  

  <u:accept-changes name="remove-tracking-markup" />
 
  <p:delete name="delete-level-2-notes" 
    match="/chapter/section/section/note" />
     
</p:pipeline>

There.  A nice and simple pipeline.  I (and anyone else) can reference this u:accept-changes step in any number of pipelines, and never worry about how it actually works.  (And, if you stored the transform inside the u:accept-changes definition using p:inline, it would be even more portable.)  Later, if I decide that I'm better served by using a Python script to accept changes instead of a transform, I can simply update the step declaration (using p:exec) once in the library.  My hope is that XProc will become a common language for sharing XML manipulation tasks. 

It's also worth noting that if I gave this note-deleting pipeline a type attribute, then it too could be invoked as step in another pipeline.  And that pipeline could be invoked as a step, and so on.  Turtles all the way down.

image

Here's where the difference between an atomic step and a compound step comes into play.  Although we defined our accept-changes step as a compound step (because it contains a subpipeline of one step), when we invoke an instance of it, that instance is an atomic step

If that is confusing (and it sure confused me at first), it helps if you think about the step from two separate perspectives — from within, looking at the implementation details, and without, looking at what actually enters and leaves the step.  We've already covered the inside.  From the outside, when you call an instance, that instance is a black box which XML enters and XML leaves — exactly like a built-in atomic step.  The fact that it's actually implemented by assembling XProc steps is irrelevant to the calling pipeline. 

There's one more important XProc fact I've talked around but never explicitly stated.  The only information that can flow between steps through pipes is XML. At first, I thought this a stifling restriction, but as I work more with XProc, I think that this prohibition is the key to its power. 

Because, when you combine all these ideas, you end up with a very high level of enforced encapsulation that provides a very simple yet powerful abstraction for working with XML.  It allows you to worry about what needs to be done instead of how.  When any pipeline can be used as an atomic step, you can work fractally, making parts out of the same raw material as the whole, creating higher and higher levels of abstraction.

image 

So, to summarize:

  • A pipeline is a step.
  • Although the declaration of a step may be a compound step, its invocation is not. 
  • XML is the only information that can flow between steps through pipes. 

That's it for now.  If you're looking for more XProc information, Dave Pawson has been working on his introduction to XProc, which is shaping up to be a great resource.

I may start posting my XProc-related material in a more suitable space, away from my blog.  I'm considering using Docbook Website system.  Any thoughts?

Tuesday, December 16, 2008

Recently In Markup