This week’s question:
Is recall a fair measure of the validity of a production response?
Today’s question is answered by John Tredennick, founder and CEO.
This question actually arose out of a discussion I recently had on LinkedIn. A commenter there questioned whether recall is a fair measure of the validity of a production response. The person who initiated the discussion argued that it is not. In fact, he called it “wrongheaded.” I disagreed. I believe recall—built on a reasonable search/investigation process—is not only a good measure for the success of a production, but the best one currently available to us.
First, I don’t take issue with the notion that good recall alone doesn’t guarantee that the most important documents will be found. Imagine a world where there were 100 relevant documents, but 99 were duplicates and of only marginal relevance. One could produce 99 percent of the “relevant” documents and leave out the smoking gun. That is certainly conceivable.
But that isn’t how it works in the real world, at least not in my 30 years of legal experience. In most cases, the trial team diligently searches for important documents using a variety of techniques. They start by reading the complaint or speaking to their client. Then they move on to key witnesses, focusing on correspondence (email today) or other relevant files. They take depositions to learn more about the case.
As they learn more, they get better informed about what matters and can better direct the investigation. Key documents typically surface throughout the process.
TAR is designed to supplement and enhance the investigation rather than detract from it. In a TAR 2.0 process at least, the team uses techniques at its disposal to find relevant documents. As relevant document are found, they can be used as training seeds to help the TAR 2.0 algorithm find even more relevant documents.
In the LinkedIn discussion, it was stated that the “superior performance [of TAR] has not been established.” I have to respectfully disagree. In almost every case, TAR will find a given percentage of relevant documents more quickly than keyword search or linear review. The research on this point is strong, if not overwhelming, involving hundreds of examples. Our research and experience in hundreds of actual cases confirms the point as well. Although our work is not peer reviewed, as is the case for the academic papers, we have published case studies showing how effective a TAR 2.0 process can be.
I don’t think the LinkedIn commenter was taking issue with this point but rather was pushing the argument that machine learning might inherently miss some of the “most relevant” documents. In one respect, that could be true. Producing 80 percent of the relevant documents does not mean that key relevant documents weren’t left behind in the remaining 20 percent that were not produced.
But how is that different from using keyword search to cull the population prior to review? The commenter and I agree that it is impractical to require humans to review every document that might be collected. Using keyword search to reduce review populations doesn’t promise that the most relevant documents will be found either. To the contrary, there is strong evidence that keyword search is worse at finding relevant documents than a good TAR process.
In many respects this is a validation problem. For our cases, we suggest taking a statistically valid sample of the “discard pile” to estimate how many relevant documents might remain. In most cases, the ones that turn up in the sample are not smoking guns or otherwise “highly relevant” documents. More often they are marginally relevant and thus not important to the case.
Assuming something highly relevant were found in the final sample, I would suggest counsel would have a duty to investigate further. This could be done by adding the newly found documents as training seeds and doing more review. Or by searching on the new content to see if more relevant documents turned up. Or by doing both.
In fact, I haven’t heard stories of this happening, at least in a validation sample taken at the end of a review. Nor, frankly, have I heard stories of highly relevant documents being left out of a TAR 2.0 production.
It is good to remember that document productions are part of a larger discovery process. I have had cases where a smoking gun turned up after a production, perhaps being referenced at a deposition. The typical reaction is to request that counsel go back and look for more like this (whether through revised keyword search or otherwise). As you find more documents of these kinds of documents, you continue your discovery education or perhaps settle the case. This is a natural part of the discovery process,
Ultimately, I would never vote to “replace human review with machine predictions,” as the commenter suggested some might. Our TAR 2.0 continuous learning process is built around human thinking and review, with machine learning simply helping the reviewers get there faster and more efficiently—sort of like substituting a car for a person on foot. The human drives the car and makes decisions about destination but the car cuts travel time by orders of magnitude.
A good TAR 2.0 system will do the same. The lawyers are encouraged to analyze the case and find relevant documents as quickly and efficiently as possible. The review team gets the benefit of this work, while they add to the mix by finding other relevant documents. The TAR 2.0 system helps by synthesizing the team’s efforts into a ranking which pushes additional relevant documents to the surface. As smoking guns are unearthed, they become training seeds to find more of those highly-relevant documents.
The end result is that the review (human, not machine) gets done more quickly and at a much lower cost because the humans don’t have to review so many irrelevant documents. How can that be wrongheaded?