Monday, August 20, 2007

Fighting the (West)Law

From O’Reilly Radar:

Carl Malamud has this funny idea that public domain information ought to be... well, public. He has a history of creating public access databases on the net when the provider of the data has failed to do so or has licensed its data only to a private company that provides it only for pay. His technique is to build a high-profile demonstration project with the intent of getting the actual holder of the public domain information (usually a government agency) to take over the job.

Carl's done this in the past with the SEC's Edgar database, with the Smithsonian, and with Congressional hearings. But now, he's set his eyes on the crown jewels of public data available for profit: the body of Federal case law that is the foundation of multi-billion dollar businesses such as WestLaw.

In a site that just went live tonight, Carl has begun publishing the full text of legal opinions, starting back in 1880, and outlined a process that will eventually lead to a full database of US Case law. Carl writes:

1. The short-term goal is the creation of an unencumbered full-text repository of the Federal Reporter, the Federal Supplement, and the Federal Appendix.

2. The medium-term goal is the creation of an unencumbered full-text repository of all state and federal cases and codes.

More information is available from the New York Times.

First off, this is great. While I can’t imagine that many people will actually use it (even attorneys, as I explain below), as a principle, public law should be as publicly assessable as it can be. Even though this material is available for free in libraries, putting it online is a huge step forward.

I am, however, concerned about accuracy. I work at a small legal publisher, and we spend a lot of effort verifying, through both manual reads and electronic compares, the accuracy of the public law we print. Malamud is starting by scanning the ultrafiche version of Federal Reporters published in the 1880s, cleaning up the images, and using optical character recognition to pull out the text. OCR is certainly not 100 percent accurate, especially on images that look like this (click to see full-sized image):

Granted, these are raw and unprocessed images, and my knowledge of OCR is anecdotal at best. But details matter in case law. Entire legal galaxies can fit between the phrases “is” and “isn’t.” If I were an attorney, I would be loath to rely on a database of case law that I wasn't confident was 100 percent accurate, and that reliability is a large part of what Westlaw’s users shell out for. The other part, Westlaw’s proprietary heading notes and key number system, won’t be available on Malamud’s site for obvious reasons.

Another site, Altlaw, produced by Tim Wu has put online most federal case law from the past 15 years. Quoted in the New York Times:

“I’m a legal academic and I woke up one day and thought, ‘Why can’t I get cases the same way I get stuff on Google?’” said Tim Wu, a Columbia law professor who is one of the leaders of the project. “People should be able to get cases easily. This is a big exception to the way information has opened up over the past decade.”

I don’t know how user-friendly the Westlaw website or case law downloads are for actual attorneys, but I’ve certainly had plenty of frustrations trying to parse that data – extracting only the bits of a case that cites a federal rule, for instance.

These sites would be much more useful (for me at least) if they were parsed and marked up in XML, with case citations, statute citations, etc. clearly labeled and linked. Since case law has some pretty specific stylistic conventions, this wouldn't be too difficult. Couple this with an XQuery engine and some full-text search, and attorneys – and the rest of us – could find stuff in caselaw more easily than we find stuff in Google.

