Words in Boxes

Nouns, verbs, and occasionally adjectives.

Sunday, June 29, 2008

Everybody Stand Back. I Know Regular Expressions.

The ever-readable Jeff Atwood has just posted a love letter to regular expressions.  Go read it, but first, stop and appreciate this xkcd comic:

I am also a lover of regexes.  I don't understand how I ever got along in those dark days before I knew how to use them.  Their use is a skill as fundamental to using a computer as knowing how to properly chop vegetables is to cooking. 

From this point forward, this post is going choose-your-own-adventure.  If you aren't a programmer, choose path "A."  If you are, then choose path "B."


Regular expressions aren't just for programmers — using them is a skill anyone who uses Microsoft Word for more than a few hours a week should invest the time to master.  Regexes help you make large numbers of repetitive changes in a document, such as finding and deleting duplicate text, finding all page number references so you can update cross references, or finding and reformatting all valid dates.  (Strictly speaking, Word's version, called "wildcards," has fewer features and slightly different syntax than true regex, but it's close enough.) 


Regexes look scary, but they're really not.  Microsoft has a great two-part tutorial (1 & 2) introducing the functionality, and the built-in help file isn't bad, either.  So no excuses.  Open up the "Find" dialog box and become a regex hero.


Jeff, in his example, uses regular expressions in a small script to detect dangerous HTML tags in a web input form.  This is exactly what regexes are good for.  What they are not good for — as he says — is actually parsing markup.  If you need to modify or retrieve information from marked-up documents, there are plenty of good HTML and XML parsers available in almost any major programming language.  Use one of those.  Parsing markup is a truly difficult problem, so save yourself the senseless pain of constructing baroque regexes that will never work as well as real parsers at handling the infinite yet valid variations of markup.  For example:

<para />

All of these are functionally identical.  So are these:

<para align="right" >
<para align='right'>

And forget about balanced tag matching or handling nested elements.  I've seen some really nasty Perl regexes for parsing XML, and you don't want to go there.

If you need to use regular expressions inside of XML markup, one good choice is XSLT 2.0's <xsl:analyze-string/> instruction.  XSLT parses out the irrelevant distinctions in the actual raw XML text like those shown above, leaving you free to work with a conceptual document tree.  That means you can regex the actual text content of an XML file — and even choose which specific subset of content to regex — all without having to worry about those pesky, inconsistent tags.  You'll be a regex hero.

I'm James Sulak, a software developer in Houston, Texas.

You can also find me on Twitter, or if you're curious, on my old-fashioned home page. If you want to contact me directly, you can e-mail comments@wordsinboxes.com.