Words in Boxes

Nouns, verbs, and occasionally adjectives.

Wednesday, July 02, 2008

The Unsolvable Problem

(There are experts in my field.  I'm not one of them.  By far.  Remember that when I start selling you broad generalizations.) 

Rick Jelliffe, in two recent posts on the O'Reilly XML blog, discusses some of the technologies and problems behind taking XML content and programmatically laying it out on a printed page.  He doesn't say it, but I will — if you really and truly care about quality, it's a fundamentally impossible problem to solve.

One of my big work projects for the past year has been developing such a system for our books.  By "books" I mean law books, heavy thousand-page tomes full of section symbols and case law citations.  Like a lot of technical publishers, our content is stored in XML.  Well, mostly.  We're still converting it from Word documents and refactoring our schemas, which is part of the charm and frustration of the whole enterprise.

XML turns out to work pretty well for data-heavy documents.  As opposed to a Word document, you identify information not by how it is styled (bold or italic), but instead what it is (a phone number or recipe ingredient).  In other words, you're taking the semantic information that is usually only implicitly available through a document's formatting and making it explicit.  Two advantages of this, for those who haven't drunk the XML Kool-Aid, are the following. 

  • First, it's much easier to extract information from that document to create things like reports or tables of contents. 
  • Second, by separating the content from the presentation, it's easy to publish that document in multiple formats — for example, in a printed book or on the Internet. 

There's a whole philosophy on designing markup schemas, but now's not the time to get into that.

But having divorced content from presentation, you're now faced with a very serious problem: How do you put them back together?

In theory, it's simple — you've defined what each chunk of content is, so you should be able to define in a program how each of those chunks should be displayed, and how it should interact with the other chunks of information near it in the document.

In reality, for print, this can be incredibly frustrating.  Jelliffe provides a good representative example:

To give an idea of what I mean by expert (also known as “quality” or “industrial”) typesetting and decision-making, consider the case of typesetting a Yellow Pages (phone directory for businesses categorized by type of business.) Imagine you have to produce a Yellow Pages document using your favorite tool. The page designer and sales force come up with a design and timetable. The layout will be five columns. Entries may not span pages. Some entries take up part of a column and should be put as near to alphabetical order as possible, but rather than break they can be placed before or after their alphabetical position with previous or subsequent entries swapped before them. And there may be two, three, four or five column display ads, which also have this arrangement. And there can even be ads that take a half page but span over two pages.

And it is important that ads should not be orphaned or widowed, with one ad on a previous page by itself or on a subsequent page. And there are 6,000 pages of this. And you get the final data 24 hours before you have to deliver it.

As those of you who have used InDesign or FrameMaker know, laying out a book or a newspaper isn't too difficult; it's just a series of aesthetic judgment calls.  A person, given enough time, would have absolutely no problem arranging page elements in the above telephone book.  However, when you try to create deterministic rules to do it, you start to run into problems.

yellow pages

One is simply the available tools — inexplicably, despite the explosion of XML usage in publishing in the last 5 years, the state of the art in automated XML layout software has not kept pace.  The system we use was launched in 1983, and still employs the same Unix user interface that we have to emulate in Windows.  It's very powerful, and gets the job done well, but does it in the same way a two-decade-old Ford pickup reliably hauls bales of hay. 

A second problem is limited lookahead ability.  In other words, how much space should you reserve for an automatically generated page number in a cross reference to an upcoming chapter?  You don't know the page number until you lay out every page up until that upcoming chapter.  If you leave too little space — say for a two-digit number instead of a three-digit number — then content will be bumped forward, which means you have to re-lay-out the rest of the book, which means that the page number will change... you can see the problem.

But the biggest issue is that page layout belongs to that class of problems that are really easy for a human but really hard for a computer.  Fundamentally it's an aesthetic problem, and aesthetic problems can never be fully solved by a series of deterministic rules. 

That's a bit dramatic.  I'm actually proud of the results I'm getting with my combination of XSLT 2 and XPP, and there are plenty of people out there doing far better work.  If you're willing to limit your markup, and limit what you expect your output to look like, you can actually get very good results using today's technology.

That's the key — accepting limits.  Your XML schema delimits the entire universe of what your content can say, and your programmed layout rules delimit the entire universe of what your pages can look like.  Remember, the entire point of XML is that you mark your content with semantic meaning, not with formatting information.  So you can never really deviate from that one-to-one relationship between semantics and formatting.

Going back to the telephone book, let's say for some reason that one company's listing, unlike all of the others, really needs to list an e-mail address.  It's really easy for a person manually laying out those pages to adjust on the fly and do these things.  But if you haven't accounted for a particular type of information both in your schema and in your layout rules, you just can't do it.  In other words, if your phone book schema doesn't include an <email> element, then you can't put an e-mail address into your phone book, and you certainly can't style it.

(Well, you can, by putting the e-mail address in a <phone> element because they look the same on the page.  But this creative tagging is very, very, very bad for reasons I shouldn't need to explain.)

Now, I'm purposely calling the layout problem unsolvable to make a point.  There are plenty of companies doing it and making money (including mine).  But they all, I suspect without exception, made compromises to get there.  All have solutions approaching the ideal, but never reach it.  Truly automating page layout is fundamentally impossible — but that's what makes it so interesting.

I'm James Sulak, a software developer in Houston, Texas.

You can also find me on Twitter, or if you're curious, on my old-fashioned home page. If you want to contact me directly, you can e-mail comments@wordsinboxes.com.