Revisiting the Blair and Maron Study: Evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System

Full-Text_Document-Retrieval_SystemWhether at Sedona, Georgetown, Legaltech or any other of the many discovery conferences one might attend, a common debate centers on the efficacy of keyword search. “Keyword search is dead,” some argue, touting the effectiveness of the newer predictive analytics engines. “Long live keyword search,” comes back in return from lawyers who have relied on it for decades both to find legal precedent and, more recently, relevant documents for their cases.

Often, the critics of keyword searching cite the 1985 Blair and Maron study for the Association of Computing Machinery that suggested that full-text retrieval systems brought back only 20 percent of the relevant documents. That assertion is true but I wonder how many of the debaters have ever read the study itself. My guess is not many, including me. So I decided to give it a read.

Here is my Cliff Notes guide to the study. Unlike many scientific papers, the study is an easy read for people interested in the subject. It might be helpful for those who want to join in the debate at the next conference.

The Problem

The test case stemmed from an accident involving the BART system (Bay Area Rapid Transit). The lawyers handling the case had amassed what was then considered a massive number of documents¾fully 40,000 documents comprising over 350,000 pages of scanned materials. The study doesn’t state this but suggests that the document text had to be keyed into the system. I am not sure they had effective OCR back in the day and for sure few if any documents were generated on a PC at that time.

The question to be answered was whether a full-text retrieval system might be as or more effective than hand indexing the documents with issue codes and a summary description. Although it seems ironic today, I believe Blair and Maron were arguing for the latter approach as more reliable and cost effective. At the least, they decided to test keyword search to determine how it did in recalling relevant documents.

Using IBM STAIRS

The scientists were fortunate to have access to the newest IBM STAIRS (STorage and Information Retrieval System) software. It offered the latest in Boolean search and could sort the results set in ascending or descending order (date or ranking) so long as the search didn’t retrieve more than 200 documents. The system likely filled a room with hardware but had far less power than your smart phone today. But it was the latest and greatest.

The Protocol

Fortunately, the researchers had access to the trial team, consisting of two senior lawyers who were the principal defense lawyers and two seasoned paralegals. When asked, the lawyers using the system specified that STAIRS should retrieve at least 75 percent of the documents relevant to a particular information request in order to be considered effective. Indeed, they suggested that retrieving 75 percent of the relevant documents was “essential” to the defense of the case.

The lawyers divided key case documents into three groups: Vital, Satisfactory and Marginally Relevant. The rest were considered irrelevant. The task was to find as many of the key documents as possible using full-text retrieval.

This may surprise a few readers, but the team conducted 51 different information requests for the study. In each case they generated keyword searches that they believed would be effective. These searches were turned into proper Boolean queries by, you guessed it, the paralegals helping on the case and then submitted to STAIRS.

The retrieved documents were not viewed on a computer screen – it was a decade too early for that. Rather, the paralegals made copies of the documents from the war room and handed them to the attorneys for review. Doubtless, each had a stack of yellow sticky notes for document classification.

Each attorney was asked to consider the results set and determine whether it met the initial recall criteria of 75 percent. If not, and that was the norm for early rounds, the attorneys were able to continue revising the query until they were satisfied with the results. As Blair and Maron stated:

In the test, each query required a number of revisions, and the lawyers were not generally satisfied until many retrieved sets of documents had been generated and evaluated.

They noted also that the lawyers and paralegals were permitted as much interaction as necessary to make sure they were satisfied with the results.

The Results

So, how did they do?

The lawyers thought it went great. They were convinced they had found 75 percent of the relevant documents for each of the 51 requests. The research showed otherwise. In two requests, the lawyers found 50 percent of the relevant documents. For the rest, the numbers were much lower, dropping to as low as 4 percent for a few. The average recall was 20 percent with the average precision of the searches standing at 80 percent. In essence, the lawyers were satisfied because they found a lot of relevant documents in their results set. However, they were fooled because those documents represented only a small fraction of the relevant population.

Here are the results for a number of the requests (some were held out for bias testing and others not shown in this chart).

BlairMaronTable1

Here is a scatter plot of all the requests.

BlairMaronFigure4

As you can see, the numbers hover around the 20 percent recall mark and below.

What can we make of this?

When they found out the results, the lawyers (and likely the paralegals) were astonished. Blair and Maron made a number of follow-up inquiries to see what additional inferences they might draw. In short order they ruled out the following:

  1. There did not seem to be a meaningful difference between the searches created by different lawyers.
  2. The fact that the lawyers ran through several iterations did not seem to improve the results.
  3. Retrieval wasn’t improved by having the lawyers use the computer system directly, versus working with the paralegals. In most cases, the results were better but not to a significant degree.

Perhaps the most interesting part of the study is their speculation on why recall turned out to be so low. The authors offer a lot of insightful comments. They started by describing what is considered a classic problem with keyword search:

The low values of Recall occurred because full-text retrieval is difficult to use to retrieve documents by subject because its design is based on the assumption that it is a simple matter for users to foresee the exact words and phrases that will be used in the documents they will find useful, and only in those documents. This assumption is not a new one; it goes back over 25 years to the early days of computing.

They went on to address the difficulty of choosing the right keywords in the context of an actual case. Here the lawyers were focused on the accident which was the subject of the litigation:

Formal queries were constructed that contained the word “accident(s)” along with several relevant proper nouns. In our search for unretrieved relevant documents, we later found that the accident was not always referred to as an “accident,” but as an “event,” “incident,” “situation,” “problem,” or “difficulty,” often without mentioning any of the relevant proper names. The manner in which an individual referred to the incident was frequently dependent on his or her point of view. Those who discussed the event in a critical or accusatory way referred to it quite directly-as an “accident.” Those who were personally involved in the event, and perhaps culpable, tended to refer to it euphemistically as, inter alia, an “unfortunate situation,” or a “difficulty.” Sometimes the accident was referred to obliquely as “the subject of your last letter, ” “what happened last week was . ,” or, as in the opening lines of the minutes of a meeting on the issue, “Mr. A: We all know why we’re here . . . .” Sometimes relevant documents dealt with the problem by mentioning only the technical aspects of why the accident occurred, but neither the accident itself nor the people involved. Finally, much relevant information discussed the situation prior to the accident and, naturally, contained no reference to the accident itself.

Here was another great example of the problems of following a linguistic trail:

Sometimes we followed a trail of linguistic creativity through the database. In searching for documents discussing “trap correction” (one of the key phrases), we discovered that relevant, unretrieved documents had discussed the same issue but referred to it as the “wire warp.” Continuing our search, we found that in still other documents trap correction was referred to in a third and novel way: the “shunt correction system.” Finally, we discovered the inventor of this system was a man named “Coxwell” which directed us to some documents he had authored, only he referred to the system as the “Roman circle method.” Using the Roman circle method in a query directed us to still more relevant but unretrieved documents, but this was not the end either. Further searching revealed that the system had been tested in another city, and all documents germane to those tests referred to the system as the “air truck.” At this point the search ended, having consumed over an entire 40-hour week of on-line searching, but there is no reason to believe that we had reached the end of the trail; we simply ran out of time.

And, a classic discussion of the problems inherent from spelling errors:

Even misspellings proved an obstacle. Key search terms like “flattening,” “gauge,” “memos,” and “correspondence,” which were essential parts of phrases, were used effectively to retrieve relevant documents. However, the misspellings “flatening,” “guage,” “gage,” “memoes,” and “correspondance,” using the same phrases, also retrieved relevant documents. Misspellings like these, which are tolerable in normal everyday correspondence, when included in a computerized database become literal traps for users who are asked not only to anticipate the key words and phrases that may be used to discuss an issue but also to foresee the whole range of possible misspellings, letter transpositions, and typographical errors that are likely to be committed.

What did Blair and Maron Conclude?

Perhaps most interesting is the position Blair and Maron took at the conclusion of their study. At the time, the alternative was human indexing and summarizing of document contents. Blair and Maron concluded that keyword search would never be effective for a large document collection¾in this case it had only 40,000 documents, which would not be considered large by today’s standards. Rather, they seemed to be making the case that human coders were the only reasonable solution, one that would be cheaper and more effective in the long run.

You almost have to smile when you read their argument for this conclusion:

SamuelJohnsonHowever, there are costs associated with a full-text system that a manual system does not incur. First, there is the increased time and cost of entering the full text of a document rather than a set of manually assigned subject and context descriptors. The average length of a document record on the system we evaluated was about 10,000 characters. In a manually assigned index-term system of the same type, we found the average document record to be less than 500 characters. Thus, the full-text system incurs the additional cost of inputting and verifying 20 times the amount of information that a manually indexed system would need to deal with. This difference alone would more than compensate for the added time needed for manual indexing and vocabulary construction. The 20-fold increase in document record size also means that the database for a full-text system will be some 20 times larger than a manually indexed database and entail increased storage and searching costs. Finally, because the average number of searchable subject terms per document for the full-text retrieval system described here was approximately 500, whereas a manually indexed system might have a subject indexing depth of about 10, the dictionary that lists and keeps track of these assignments (i.e., provides pointers to the database) could be as much as 50 times larger on a full-text system than on a manually indexed system. A full-text retrieval system does not give us something for nothing. Full-text searching is one of those things, as Samuel Johnson put it so succinctly, that “. . . is never done well, and one is surprised to see it done at all.”

At the time, I believe Lexis and Westlaw were still using offshore coders to key in the text of case decisions to keep costs under control. I wonder what Blair and Maron would think about the size of large cases today (which can have as many as 100 times as many documents) or the evolution of keyword search backed by modern predictive analytics. I don’t think they would be arguing for the efficacy of human coding and summarization anymore.

mm

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.