Ask Catalyst: Is Recall A Fair Measure Of The Validity Of A Production Response?

[Editor’s note: This is another post in our “Ask Catalyst” series, in which we answer your questions about e-discovery search and review. To learn more and submit your own question, go here.]  

Ask_Catalyst_TC_John_TredennickThis week’s question:

Is recall a fair measure of the validity of a production response?

Today’s question is answered by John Tredennick, founder and CEO.

This question actually arose out of a discussion I recently had on LinkedIn. A commenter there questioned whether recall is a fair measure of the validity of a production response. The person who initiated the discussion argued that it is not. In fact, he called it “wrongheaded.” I disagreed. I believe recall—built on a reasonable search/investigation process—is not only a good measure for the success of a production, but the best one currently available to us.

First, I don’t take issue with the notion that good recall alone doesn’t guarantee that the most important documents will be found. Imagine a world where there were 100 relevant documents, but 99 were duplicates and of only marginal relevance. One could produce 99 percent of the “relevant” documents and leave out the smoking gun. That is certainly conceivable.

But that isn’t how it works in the real world, at least not in my 30 years of legal experience. In most cases, the trial team diligently searches for important documents using a variety of techniques. They start by reading the complaint or speaking to their client. Then they move on to key witnesses, focusing on correspondence (email today) or other relevant files. They take depositions to learn more about the case.

As they learn more, they get better informed about what matters and can better direct the investigation. Key documents typically surface throughout the process.

TAR is designed to supplement and enhance the investigation rather than detract from it. In a TAR 2.0 process at least, the team uses techniques at its disposal to find relevant documents. As relevant document are found, they can be used as training seeds to help the TAR 2.0 algorithm find even more relevant documents.

In the LinkedIn discussion, it was stated that the “superior performance [of TAR] has not been established.” I have to respectfully disagree. In almost every case, TAR will find a given percentage of relevant documents more quickly than keyword search or linear review. The research on this point is strong, if not overwhelming, involving hundreds of examples. Our research and experience in hundreds of actual cases confirms the point as well. Although our work is not peer reviewed, as is the case for the academic papers, we have published case studies showing how effective a TAR 2.0 process can be.

I don’t think the LinkedIn commenter was taking issue with this point but rather was pushing the argument that machine learning might inherently miss some of the “most relevant” documents. In one respect, that could be true. Producing 80 percent of the relevant documents does not mean that key relevant documents weren’t left behind in the remaining 20 percent that were not produced.

But how is that different from using keyword search to cull the population prior to review? The commenter and I agree that it is impractical to require humans to review every document that might be collected. Using keyword search to reduce review populations doesn’t promise that the most relevant documents will be found either. To the contrary, there is strong evidence that keyword search is worse at finding relevant documents than a good TAR process.

In many respects this is a validation problem. For our cases, we suggest taking a statistically valid sample of the “discard pile” to estimate how many relevant documents might remain. In most cases, the ones that turn up in the sample are not smoking guns or otherwise “highly relevant” documents. More often they are marginally relevant and thus not important to the case.

Assuming something highly relevant were found in the final sample, I would suggest counsel would have a duty to investigate further. This could be done by adding the newly found documents as training seeds and doing more review. Or by searching on the new content to see if more relevant documents turned up. Or by doing both.

In fact, I haven’t heard stories of this happening, at least in a validation sample taken at the end of a review. Nor, frankly, have I heard stories of highly relevant documents being left out of a TAR 2.0 production.

It is good to remember that document productions are part of a larger discovery process. I have had cases where a smoking gun turned up after a production, perhaps being referenced at a deposition. The typical reaction is to request that counsel go back and look for more like this (whether through revised keyword search or otherwise). As you find more documents of these kinds of documents, you continue your discovery education or perhaps settle the case. This is a natural part of the discovery process,

Ultimately, I would never vote to “replace human review with machine predictions,” as the commenter suggested some might. Our TAR 2.0 continuous learning process is built around human thinking and review, with machine learning simply helping the reviewers get there faster and more efficiently—sort of like substituting a car for a person on foot. The human drives the car and makes decisions about destination but the car cuts travel time by orders of magnitude.

A good TAR 2.0 system will do the same. The lawyers are encouraged to analyze the case and find relevant documents as quickly and efficiently as possible. The review team gets the benefit of this work, while they add to the mix by finding other relevant documents. The TAR 2.0 system helps by synthesizing the team’s efforts into a ranking which pushes additional relevant documents to the surface. As smoking guns are unearthed, they become training seeds to find more of those highly-relevant documents.

The end result is that the review (human, not machine) gets done more quickly and at a much lower cost because the humans don’t have to review so many irrelevant documents. How can that be wrongheaded?

4 thoughts on “Ask Catalyst: Is Recall A Fair Measure Of The Validity Of A Production Response?

  1. William Hamilton

    This argument against TAR that it does not necessarily find every conceivable relevant document (notwithstanding that other search methods are much inferior) resembles the argument that we should not try to be ethical because human nature is such that being ethical is difficult and we will never be perfect.

    Reply
  2. William Kellermann

    John,
    As always an insightful and well crafted post. As to the issue of “recall” as a fair measure of production response, I think you are right on the money, generally, but the discussion is targeting the wrong part of the process. An element of search is inherent in each step of the electronic discovery process. What is the recall at the identification phase? What is the recall at the collection phase? Recall in objective culling (date cuts, deNISTing, etc.)? The quality of the production is more likely affected by a failure at an earlier step then it is by document review. If you look at some of the biggest “failure to produce” sanction cases, it was not the attorney review that failed. Qualcomm v. Broadcom comes to mind. And a couple of surveys show judges believe (in my mind rightly so) electronic discovery failures center on identification and collection failures, not review. Those surveys show over 80% of the failures are grounded in identification, preservation and collection, with 0% from review. Those failures could be cast as ‘search failures.’ So this “debate” like many others that captivate audiences the world over, is much ado about nothing. Now, if you move auto-classification schemes to behind the firewall, you could both measure recall and really do something to alleviate >80% of eDiscovery failure. Wouldn’t that be something!

    Reply
  3. William Kellermann

    Recall and precision are important. So is error. Recent surveys of judges indicate 85% of eDiscovery errors occur in identification, preservation and collection. The same surveys demonstrate judges find zero eDiscovery error in the review phase. So the attack against TAR is dramatically misplaced, unless one wants to make a true luddite argument about preserving the jobs of review attorneys. Production failures are the result of failures much earlier in the process than TAR. If one wanted to improve the science of eDiscovery, the focus should be on recall and precision at the identification and collection phase, assuming identification was accurate enough to ensure preservation. Errors at each earlier phase, as well as processing, accumulate such that the process is inherently unreliable long before a single record hits the review database. It is also one of the reasons judges are probably tiring of the law and motion over TAR.

    Reply
  4. Pingback: TAR Wars Episode IV: The CAL Zone

Leave a Reply

Your email address will not be published. Required fields are marked *