Latest Grossman-Cormack Research Supports Using Review Teams for TAR Training

blog_apple_and_booksA key debate in the battle between TAR 1.0 (one-time training) and TAR 2.0 (continuous active learning) is whether you need a “subject matter expert” (SME) to do the training. With first-generation TAR engines, this was considered a given. Training had to be done by an SME, which many interpreted as a senior lawyer intimately familiar with the underlying case. Indeed, the big question in the TAR 1.0 world was whether you could use several SMEs to spread the training load and get the work done more quickly.

SME training presented practical problems for TAR 1.0 users—primarily because the SME had to look at a lot of documents before review could begin. You started with a “control” set, often 500 documents or more, to be used as a reference for training. Then, the SME needed to review thousands of additional documents to train the system. After that, the SME had to review and tag another 500 documents to document effectiveness of the training. All told, the SME could expect to to look at and judge 3,000 to 5,000 or more documents before the review could start.

Adding to the inconvenience, this training process assumed you had already collected all of the documents and that no new issues would arise. If new documents appeared later in the process, the SME would have to go through control set, training and validation process again and, possibly, still again, depending on how the collections progressed. Same thing if you wanted to rank on new issues, perhaps not considered at the start of the case.

No wonder review administrators complained about delays in getting going. Few senior attorneys want to sit and review discovery documents, particularly when many of them are irrelevant to the case or of marginal importance. At 60 documents an hour, it comes to 50 hours of mind-numbing work just to train the system. Meanwhile the review team is sitting on its hands.

In prior writings, we have questioned whether SMEs were needed to train TAR systems. Our research suggested that review teams backed by smart QC from SMEs could do as well more quickly and at lower total cost. To read more about this, see:

Others have pointed out that SMEs, being human after all, aren’t necessarily consistent in their own tagging—even against the same documents viewed weeks or months later. Some of us, in private conversations at least, have wondered whether the SME is the best person to tag documents for training purposes. Senior lawyers often make fine-grained distinctions about individual documents that make sense but might not be as effective in training the algorithm as perhaps a more liberal interpretation of relevance made by a review team member. We offered anecdotal evidence to support these ideas but in many cases we couldn’t back them with hard and fast research.

Grossman & Cormack Strike Again

In their most recent research, Maura R. Grossman and Gordon V. Cormack, joined by two of Professor Cormack’s colleagues from University of Waterloo, Adam Roegiest and Charles L.A. Clarke, tackle this question for TAR 1.0 systems using one-time training with training documents being selected randomly. Their paper, “Impact of Surrogate Assessments on High-Recall Retrieval,” is slated for delivery at SIGIR 2015 (Special Interest Group for Information Retrieval), which takes place during the August annual meeting of the Association of Computing Machinery in Santiago, Chile.

After reviewing related research conducted by others,[1] the authors report on a series of experiments they ran against different document sets to compare the effectiveness of SMEs with regular reviewers, whom they call “surrogate assessors.” With apologies to the authors, I will use the term reviewer rather than surrogate assessor.

The Research

The authors based their research on three different document sets (corpora) taken from various TREC tracks, including the 2009 Legal track.

In essence, the authors ran a series of experiments comparing the training effectiveness of the Topic Authority, or a single person designated as an SME, against training done by one or more independent reviewers. In each instance, the training and reference decisions made by the designated SME were treated as the “gold” standard.

The tests included:

  1. Comparing the results from single reviewers working alone.
  2. Combining judgments from several reviewers; and
  3. Alternating decisions from different reviewers.

In addition, the authors decided to test the efficacy of training using a more liberal review standard than might be expected from an SME. Specifically, they took the results from earlier work done by the Waterloo TREC team, who created a third category for their tagging decisions called “iffy.” They treated documents marked iffy as positive training seeds, which made the tagging more liberal than it otherwise might have been even using a review team.

The Results

The results were interesting to say the least. The first conclusion was that the individual designated as the Topic Authority generally provided better training results than those of the individual reviewers going head to head. By better, the authors simply mean that you would have to review fewer documents in order to achieve 75% recall.

Topic Authority as the Gold Standard?

One possible explanation for this lies in the fact that the Topic Authority is providing the standard by which his/her judgment is correct or not. In effect, it becomes a self-fulfilling prophecy: I am better at picking things I like than you are because I decide what I like and then pick others to match. As the authors point out:

We question whether it is possible to sweep away uncertainties in relevance determination simply by arbitrarily deeming relevance to be the judgment of a single authoritative assessor. It is well known that informed, qualified assessors disagree, and even the same assessor will disagree with him or herself, at different times and in different circumstances. We wonder whether it is useful to expend heroic efforts to anticipate the judgments of one particular assessor, and posit, instead, that it might be better to target a hypothetical “reasonable authority,” selected from a pool of equally competent choices. In any event, it is important when evaluating the recall of a retrieval effort, to ask, “according to whom?” 75% recall measured through independent assessment is a formidable achievement, but the same 75% recall measured through self-assessment is unremarkable.

Put another way (and following on personal discussion with Gordon Cormack and our chief scientist, Jeremy Pickens) the notion of a topic authority is an artificial construct. It stems from the fallacy that one person’s conception of relevance is infallible, when we know that different people will classify marginal documents differently and even contradict themselves when given a chance to review the same document at different times. So, of course, a  classifier trained by that person will better reflect that person’s judgments than a classifier trained by somebody else.

Going further,  if you manually review the entire dataset, the topic authority will achieve 100% recall and 100% precision, by definition. A second, equally knowledgeable and competent reviewer might achieve only 70% recall and 70% precision. If you switched topic authority roles, the same results would likely obtain in the opposite way.

As the authors state:

Finally, we show that our results still hold when the role of surrogate [reviewer] and authority are interchanged, indicating that the results may simply reflect differing conceptions of relevance between surrogate and authority, as opposed to the authority having special skill or knowledge lacked by the surrogate.

Using Multiple Reviewers

A different result obtained when the judgments of several reviewers were used to train, particularly when the reviewers were instructed to liberally interpret relevance.

As the authors stated:

Our hypothesis—that surrogate assessors taking a more liberal view of relevance would produce better classifiers—is supported by the results presented in Figures 3a and 4a, where training using the liberal assessor is seen to achieve significantly better recall depth than both the conservative assessor and the NIST assessor.

Put another way, if you instruct the review team to intentionally take a liberal interpretation of relevance, you will likely do as well in the training as you would with the SME, at a lower cost and quicker speed.

In sum, this research supports the notion that review team can do an equal or better job at training a classifier, especially when instructed to construe relevance broadly. One reason may be that the broad interpretation helps the classifier by including additional words that help it find more relevant documents.

Limitations of the Research

This paper focused on only one of the TAR 1.0 protocols: SPL or simple passive learning. It did not consider active learning approaches used in one-time training protocols or the TAR 2.0 approach called continuous active learning.

Our experiments study only the case of simple passive learning, where a fixed training set is to used to train a learning method to rank the entire corpus, and the top ranked documents are reviewed until high recall is achieved. Although this practice appears to be widely employed in eDiscovery today, the state of the art is perhaps better represented by interactive, active-learning approaches. Accordingly, our results are applicable only to the former method; their utility in guiding individual stages of an interactive or active approach has not been established.

In fact, the authors note findings by Dr. Jeremy Pickens, Catalyst’s senior applied research scientist, which suggest that non-authoritative training assessments may improve high-recall effectiveness when active learning methods are used. See, In TAR, Wrong Decisions Can Lead to the Right Documents (A Response to Ralph Losey).

With continuous active learning, it doesn’t make sense to talk about having an SME do the training. The SME would have to do the whole review him or herself. After all, with CAL, review is training and training is review.


[1] Their review included two articles written by Dr. Jeremy Pickens, Catalyst’s senior applied research scientist: Assessor Disagreement and Text Classier Accuracy, co-authored with William Webber, and In TAR, Wrong Decisions Can Lead to the Right Documents (A Response to Ralph Losey), published on this blog.



About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.