Review Efficiency Using Insight Predict

An Initial Case Study

Much of the discussion around Technology Assisted Review (TAR) focuses on “recall,” which is the percentage of the relevant documents found in the review process. Recall is important because lawyers have a duty to take reasonable (and proportionate) steps to produce responsive documents. Indeed, Rule 26(g) of the Federal Rules effectively requires that an attorney certify, after reasonable inquiry, that discovery responses and any associated production are reasonable and proportionate under the totality of the circumstances.

In that regard, achieving a recall rate of less than 50% does not seem reasonable, nor is it often likely to be proportionate. Current TAR decisions suggest that reaching 75% recall is likely reasonable, especially given the potential cost to find additional relevant documents. Higher recall rates, 80% or higher, would seem reasonable in almost every case.

But recall is only half of the story. Achieving any level of recall comes at a price. That price can be expressed in terms of “precision,” which is the ratio of relevant to non-relevant documents that must be reviewed to reach any level of recall. The cost of review is a function of the precision of your TAR process, just as it is driven by the level of recall you have to attain.

How many documents must be reviewed to find one relevant document in a TAR process? There is no one answer to that question. The precision of any TAR process will depend on a number of factors including the nature of the documents themselves, the algorithm used, the effectiveness of the training process and the level of recall obtained at the point of measurement.

For example, in several studies Maura Grossman and Gordon Cormack have suggested one should expect to review two documents for every relevant one found (based on achieving 75% recall). This amounts to a 50% precision rate which seems pretty good, particularly in collections with low richness.

In an effort to contribute to this discussion, we took a look at three simulations and a dozen cases where our clients used Predict, Catalyst’s advanced TAR 2.0 technology, for their review. Our purpose in doing so was to calculate the precision rates obtained to see if we could discern a pattern. In doing so we recognized that our small sample wasn’t statistically predictive, either in the way the cases were selected or their number. Rather, we had data for these cases and decided we would report on them for whatever value could be derived. At some future point, we hope to aggregate more data on a larger case population and repeat the experiment.

The Projects

We can’t say too much about the cases or the simulations because of client confidentiality. We have named them Sim 1 through Sim 3, and Case 1 through Case 12 as a result. For each we have listed the number of documents in the collection as well as the estimated richness of the collection. And, because the question from clients is most often framed in terms of the number of documents that will need to be reviewed, we essentially show precision as it’s reciprocal, i.e., the number of documents the team reviewed to find each relevant document.

We named this statistic Predict Efficiency, and we generally want the figure to be as close to 1.0 (a “perfect” review) as possible. Thus, for example, the richness in Case 1 was well under 1%. The team had to review 5.77 documents for each relevant document found. At the other end of the spectrum, in Sim 2, with richness running at almost 42%, the team had to review just one and one-half documents for each relevant one found. Sim 2 was obviously more efficient, since the team needed to review fewer non-responsive documents to find one responsive document.
In this regard, we should note that none of these reviews stopped at 75% recall. All went above 80% and many went above 90%. As a result, we would expect to see lower precision figures than might be expected at a 75% recall rate.

Here is a plot showing the precision numbers (as Predict Efficiency)  for all fifteen projects.

Many of the cases came in at a roughly two to one precision ratio. This is in line with the Grossman Cormack results and, to our knowledge, beats a keyword review by a wide margin.

A few of the cases had higher numbers, e.g. over five to one, but there were typically reasons for this. For example, it is not at all unusual to see only modest precision results with extremely sparse collections like Cases 1, 8 and 12, where richness was fairly low. In several of the other cases, the review continued well beyond the 75% and 80% recall marks, at which point there are simply fewer responsive documents in the collection and, therefore, in the final batches being reviewed.

The average across these cases was just under three to one, reflecting a precision rate of just about 37.5%. Is that a good result for the team? When compared to linear review, there is no question about it. In each case the team was able to achieve a higher than required recall while still reviewing only a fraction of the total population. When compared to linear review, the results are substantially better. With keyword search, we doubt many would achieve similar levels of recall without having to review a far larger percentage of the documents.

We can make one other observation, particularly by comparing the three simulations to the actual case reviews. Each of the simulated reviews was below the average in terms of the number of documents that had to be reviewed to find one responsive document, and two of the three exhibited the best results of all projects.

Ultimately, that is probably not surprising. Real reviews suffer, for example, from incorrect or inconsistent coding that gets fixed through the QC process. What that means practically is that the algorithm is improperly trained at times during the review process, and then rectified through QC. A simulation uses the final coding decisions once a review is finished, so every coding decision that informs the algorithm is correct, and the algorithm is optimally trained. Again, this is not a statistical analysis, but does provide some insight into what you might expect when you are running a Predict review.


As we mentioned at the outset, this is not by any means a statistical analysis of Predict efficiency. But we can observe a few trends, even among these few examples. First, the realities of an actual Predict review — things like coding errors and inconsistency, quality control measures, relevance drift, etc. — will likely make the review less efficient than a perfect, simulated review where the true coding decisions are always known and applied. As a corollary, a thorough, careful and considered review, with less discrepancies, will likely improve efficiency. Second, sparse collections will also likely be less efficient than collections with a more reasonable richness level. Finally, these numbers are not inconsistent with our general observations of the performance of Predict, even outside these specific cases. So, in the absence of a more precise statistical evaluation, you may be able to use this data as a quick rule of thumb to guide your projections as you plan a review directed toward achieving a high level of recall using Predict.


About Thomas Gricks

Managing Director, Professional Services, Catalyst. A prominent e-discovery lawyer and one of the nation's leading authorities on the use of TAR in litigation, Tom advises corporations and law firms on best practices for applying Catalyst's TAR technology, Insight Predict, to reduce the time and cost of discovery. He has more than 25 years’ experience as a trial lawyer and in-house counsel, most recently with the law firm Schnader Harrison Segal & Lewis, where he was a partner and chair of the e-Discovery Practice Group.


About Andrew Bye

Andrew is the director of machine learning and analytics at Catalyst, and a search and information retrieval expert. Throughout his career, Andrew has developed search practices for e-discovery, and has worked closely with clients to implement effective workflows from data delivery through statistical validation. Before joining Catalyst, Andrew was a data scientist at Recommind. He has also worked as an independent data consultant, advising legal professionals on workflow and search needs. Andrew has a bachelor’s degree in linguistics from the University of California, Berkeley and a master’s in linguistics from the University of California, Los Angeles.


About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.