Does Recall Measure TAR’s Effectiveness Across All Issues? We Put It To The Test

Does_Recall_Measure_TARs_EffectivenessFor some time now, critics of technology assisted review have opposed using general recall as a measure of its effectiveness. Overall recall, they argue, does not account for the fact that general responsiveness covers an array of more-specific issues. And the documents relating to each of those issues exist within the collection in different numbers that could represent a wide range of levels of prevalence.

Since general recall measures effectiveness across the entire collection, the critics’ concern is that you will find a lot of documents from the larger groups and only a few from the smaller groups, yet overall recall may still be very high. Using overall recall as a measure of effectiveness can theoretically mask a disproportionate and selective review and production. In other words, you may find a lot of documents about several underlying issues, but you might find very few about others.

As an example, consider a simple case where general responsiveness consists of only two discrete issues. Imagine a 250,000 document collection with 50,000 responsive documents (or 20 percent prevalence). Issue one consists of 47,000 documents, while issue two is much less prevalent and consists of only 3,000 documents. If you found 85 percent of the documents from issue one (or 39,950 documents), you could effectively hit 80 percent recall across the entire collection, but have found only 50 of the documents from issue two – or just 1.7 percent of the issue two documents.

Wouldn’t it be great if you could have some level of comfort that you won’t face this problem when you run a TAR review?

Simulated Predict Review Achieves High Recall Across the Board

We recently put this concern directly to the test. Using Insight Predict, we ran a review on a collection of documents that had been fully coded for a real-world production. We found that the critics’ fears were not substantiated. By the time the responsiveness review reached a reasonable level of recall overall, Predict had effectively achieved a reasonable level of recall for each of the underlying issues as well.

Here is how the simulation ran.

We received a collection of just under 700,000 documents that had previously been reviewed for production in an active litigation. Of those, 521,669 documents had responsiveness judgments and each document was also coded for relevance to 13 separate issues. There were 218,775 responsive documents, putting collection richness at just under 42 percent. The issue-coding prevalence ranged from extremely sparse at 0.13 percent (681 documents) to 27.7 percent (114,724 documents).

Using these previously coded documents, we ran a simulated Predict responsiveness review. We began by selecting a set of random documents to begin training. Once these initial seeds were selected, we applied the responsiveness judgments to those documents and submitted them to Predict to rank the entire collection. The initial seeds were considered to have been seen and reviewed, and the balance of the collection was considered as unseen. Next, we selected a batch of roughly 1,000 unseen documents from the top of the ranking, applied the responsiveness judgments to these newly selected documents, and then submitted all of the seen documents back to the algorithm. From there, the algorithm re-ranked the entire collection and a new batch of unseen documents was selected and the process repeated.

We tracked responsiveness recall throughout the simulated Predict review. We also tracked the recall for each of the 13 issues as the review progressed. We plotted the responsiveness recall, as well as the recall for each individual issue, on a single gain curve (see Figure 1). On Figure 1, the blue line is the gain curve for the responsiveness review, and the green lines represent the gain curves for each of the underlying substantive issues.

Figure 1

Figure 1

Looking at Figure 1, the responsiveness review achieved 80 percent recall after the simulated review of roughly 261,000 documents (notably, eliminating the linear review of more than 155,000 documents). By the time the responsiveness review hit 80 percent recall, the level of recall for all 13 substantive issues exceeded 70 percent, with 10 of the issues exceeding 80 percent.

What this means in practice is that a general responsiveness review using Predict will often uncover the majority of documents relating to the individual substantive issues underlying a responsiveness judgment. Although every collection and every review is obviously different, there is a good chance that Predict will locate a substantial body of documents relating to even the sparsest of issues. This capability goes a long way toward providing comprehensive insight into your document collection, and assuaging the concern that general recall may not be an appropriate measure of the effectiveness of a Predict review.

The Simulated Predict Review is Consistent with Existing Studies

The fact that Predict was able to achieve high recall across the board is consistent with the existing studies of continuous active learning (CAL) protocols. Grossman and Cormack studied this specific phenomenon, and reported the results of their studies in their SIGIR paper Multi-Faceted Recall of Continuous Active Learning for Technology-Assisted Review. They recognized that “[w]hile CAL has been shown to achieve high recall with less effort than competing methods, it has been suggested that CAL’s emphasis on the most-likely relevant documents may bias it to prefer documents like the ones it finds first, causing it to fail to discover an important, but dissimilar, class of relevant documents.” So they set out to “test the ability of CAL to achieve high recall on all facets of an information need.”

There were two components to the Grossman-Cormack study. In one, they combined four related TREC topics into a single, composite topic. They ran a CAL review of the composite topic, and tracked the recall of the composite topic as well as each subtopic. In the other, they used a number of Reuters newswires (known as RCV1) that were classified into a number of categories, with related categories further combined into four top-level groups. They ran a CAL review of the top-level groups, and tracked the recall of each group as well as each constituent category.
Ultimately, Grossman and Cormack determined that a general continuous active learning review would achieve high levels of recall for every aspect underlying the general review. According to their paper:

Continuous active learning achieves high recall for technology-assisted review, not only for an overall information need, but also for various facets of that information need, whether explicit or implicit. Through simulations…, we show that continuous active learning, applied to a multi-faceted topic, efficiently achieves high recall for each facet of the topic. Our results assuage the concern that continuous active learning may achieve high overall recall at the expense of excluding identifiable categories of relevant information.

The Flexibility of Predict

On a final note, two of the nice features of Predict are that it is simple to use and flexible in application. There may be times when you would like to delve even more deeply into specific substantive issues underlying responsiveness, even though the Predict review located many of the pertinent documents. Not a problem. Just spin up a new Predict project focused on the particular issue of interest, use the existing documents to rank the collection, and watch the remaining documents float to the top of the ranked list for prioritized review. Either way, you can use Predict to make sure that you pretty much know everything about everything.Does_Recall_Measure_TARs_Effectiveness


About Thomas Gricks

Managing Director, Professional Services, Catalyst. A prominent e-discovery lawyer and one of the nation's leading authorities on the use of TAR in litigation, Tom advises corporations and law firms on best practices for applying Catalyst's TAR technology, Insight Predict, to reduce the time and cost of discovery. He has more than 25 years’ experience as a trial lawyer and in-house counsel, most recently with the law firm Schnader Harrison Segal & Lewis, where he was a partner and chair of the e-Discovery Practice Group.