Killing Two Birds With One Stone: Latest Grossman/Cormack Research Shows that CAL is Effective Across Multiple Issues

No actual birds were harmed in the making of this blog post!

Since the advent of Technology Assisted Review (aka TAR, predictive coding or computer-assisted review), one of the open questions is whether you have to run a separate TAR process for each item in a document request. As litigation professionals know, it is rare to have only one numbered request in a Rule 34 pleading. Rather, you can expect to see scores of requests (typically as many as the local rules allow).

Often the requests will focus around a single subject, but occasionally the requests will cover a broader range of topics. What are best practices in such a case?

Our Insight Predict technology is based on an advanced Continuous Active Learning (CAL) protocol (mixing contextual diversity with relevance feedback). Over the past few years, we have opted for a combined process, at least in most cases. Our experience suggested that a unified CAL process was effective for requests across similar topics, achieving equal or better results than separate efforts with a lot less effort. Sampling at the end of the process confirmed we were finding high percentages of relevant documents and not leaving many relevant documents behind. We were literally killing two (actually a whole flock) of birds with one proverbial CAL stone.

There has been debate on this question among TAR professionals. In their 2013 law review article, for example, Karl Schieneman and Tom Gricks suggested that training seeds should be specially selected to cover all aspects of the multiple requests:

In sum, training the technology-assisted review tool is an important step in the process, and one that requires explicit consideration under Rule 26(g). In order to ensure that counsel is conducting a comprehensive, good faith search, the creation of the seed set should reasonably reflect the full breadth of relevance within the entire ESI collection.

“The Implications of Rule 26(g) on the Use of Technology-Assisted Review,” Federal Courts Law Review, 7(1):239-274 at 263 (2013). I have heard reports of others claiming that separate TAR processes should be run for each of the individual requests.

Seminal Research on CAL Topical Coverage

I am excited to advise that there is now independent research supporting a unified  approach. Gordon Cormack and Maura Grossman recently released the results of new peer-reviewed research they conducted. Their paper, “Multi-Faceted Recall of Continuous Active Learning for Technology-Assisted Review,” will be presented in August at SIGIR 2015, the annual conference of the Special Interest Group for Information Retrieval of the Association for Computing Machinery, in Santiago, Chile.

In brief, the authors wanted to investigate whether CAL “achieves high recall for technology-assisted review, not only for an overall information need, but also for various facets of that information need, whether explicit or implicit.” Specifically, they wanted to see whether CAL would be effective across multiple document requests, which they call “facets,” when used in a single process rather than running separate processes for each facet.

The authors fashioned several experiments, first using topics from the TREC 2009 Legal Track program and then using a second dataset made available by Reuters for academic research, the Reuters Corpus Volume I (RCVI). See Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. This gave them a number of examples involving multiple topics/requests for their testing. Indeed, for the RCV1 collection, there were as many as 40 topics they could test against.

The bottom line of their findings is this: CAL proved to be effective at finding relevant documents relating to each of the individual requests/topics even when run as a single process. Here were the results for the TREC topics: The chart shows the recall performance for a single, combined review (the Overall line) and then tracks the progress made against finding relevant documents for each of the individual requests/topics. As you can see, the recall gain curves for the independent topics varied. The CAL process found documents for Topics 202 and 207 more quickly than topics 201 and 203.[1] Why is that? It could be because topics 202 and 207 had a lot more relevant documents than topics 201 and 203. Also, topic 207 was about fantasy football, which might have been easier for the algorithm to identify.

The significant point is that all of the lines converged at about the 80% recall level. What that shows is that the combined CAL process succeeded in finding relevant documents across the topics as the review progressed. By the time the review would likely stop (at about 80% in this case), the coverage spanned all of the topics. The combined process thus found the relevant documents without requiring that the team undertake four separate TAR review projects.[2]

The same results obtained for the wider variety of topics contained in the RCV1 collection. Here is one example showing the topic coverage of the unified CAL process: Why is this important? Because the alternative would seem to be running separate and independent CAL projects for each request/topic. Doing so would be costly both in terms of review costs and time. Imagine the effort it would take to initiate and manage 40 separate TAR projects just to cover a single request for production.

A New Approach to Determining the Stopping Point?

In the course of their research, Cormack and Grossman suggested a practical way to determine when to stop the CAL review, one with which I agree. Up to now, most of the debate on this topic was focused around how to do what many call “elusion” sampling. I wrote a two-part article on the topic suggesting that the elusion sample would have to be much larger than most people expected if we wanted a reasonable margin of error. You can read those articles here and here.

Perhaps drawing from their experiments, Cormack and Grossman suggested we can determine the stopping point for a CAL review by watching the richness of the CAL batches as the review progresses. As the authors noted, CAL based on relevance feedback caused the batches to increase in richness until they came close to 100%. At some point during the review, that figure drops until it falls well below the richness level for the entire collection. The authors suggested that review can stop when batch richness drops to a certain level:

[I]n these experiments, stopping the review when marginal precision falls below one-tenth of its previously sustained value is a good predictor of high recall for the overall information need, as well as the facets, with proportionate effort.

Does that number apply to all situations? Cormack and Grossman make no such claim. Research would be needed before we could draw such a conclusion. But personally I believe that the key to determining when to stop a CAL review should be tied to the level of effort required to find additional relevant documents rather than a set recall percentage. When batch richness falls dramatically, the cost to continue the review increases, at least as a function of finding relevant documents.

Our experiments suggest that when a review achieves sustained high precision, and then drops off substantially, one may have confidence that substantially all facets of relevance have been explored. In addition to offering a potentially better prediction of completeness, precision can be readily calculated throughout the review, while recall cannot. Further research is necessary to determine the extent to which marginal precision may afford a reliable quantitative estimate of review completeness, including coverage of different facets of relevance.

Assuming a base threshold of recall has been achieved, say 70% or better, this approach would allow the documents to speak for themselves as to when the review should stop. The Federal Rules require reasonable efforts to obtain the requested information, not perfection nor extraordinary steps.

Killing Two Birds?

The bottom line from the Cormack Grossman research was to answer the question: Are relevant documents missed if we don’t run separate TAR projects for each document request? The answer from this research at least is “no.”

For all experiments, our results are the same: CAL achieves high overall recall, while at the same time achieving high recall for the various facets of relevance, whether topics or file properties. While early recall is achieved for some facets at the expense of others, by the time high overall recall is achieved—as evidenced by a substantial drop in overall marginal precision—all facets (except for a single outlier case that we attribute to mislabeling) also exhibit high recall. Our findings provide reassurance that CAL can achieve high recall without excluding identifiable categories of relevant information.

I am hopeful we will see more research on this topic in the coming year.

While I am not a hunter, and have nothing against our feathered friends, killing two birds with one stone is a good thing for electronic discovery. Clients are burdened with increasing costs for discovery, and they are looking for alternatives to the “turn over every rock” approach to litigation. If we can quickly find a good stone and go hunting, that is a good thing. Substantial savings in both time and review costs are the prize if we can hit the target with one shot.


[1] However, I am advised that some of the topics in the RCV1 collection had thousands of times as many documents as did others. So that theory may not be correct.

[2] In point of fact, Insight Predict would allow you to tag for all four issues separately and rank them in four projects if desired. However, this would still require extra work, at least in running systematic samples for each of the individual projects and managing multiple processes.



About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.