A Discussion About Dynamo Holdings: Is 43% Recall Enough?

blog_john_and_tomIn September 2014, Judge Ronald L. Buch became the first to sanction the use of technology assisted review (aka predictive coding) in the U.S. Tax Court. See Dynamo Holdings Limited Partnership v. Commissioner of Internal Revenue, 143 T.C. No. 9. We mentioned it here.

This summer, Judge Buch issued a follow-on order addressing the IRS commissioner’s objections to the outcome of the TAR process, which we chronicled here. In that opinion, he affirmed the petitioner’s TAR process and rejected the commissioner’s challenge that the production was not adequate. In doing so, the judge debunked what he called the two myths of review, namely that human review is the “gold standard” or that any discovery response is or can be perfect.

While the outcome of the opinion is relatively straightforward—a TAR-based production doesn’t have to be perfect—the procedures followed by the parties, the level of recall obtained and the arguments advanced had us scratching our heads. We found ourselves in a heated discussion about the TAR protocol used by the parties, how they managed the training process and why nobody took steps to validate the results.

John Tredennick: To start, the court didn’t mandate a particular TAR methodology but rather stated that it is the obligation of the producing party to determine the method of responding to the discovery. The court went further to say that if the requesting party “can articulate a meaningful shortcoming in that response,” it was free to seek relief.

This position has been articulated by Sedona and other courts but I think it can’t be stated enough. From the days of paper discovery, it has always been the right and responsibility of the responding party to choose the method of identifying and producing relevant documents. That kind of rule is clear and will reduce the uncertainty over “transparency” that has hindered the adoption of newer methods to reduce review costs like TAR. I stated my views on this in “None of Your Beeswax: Or Do I Have to Invite Opposing Counsel to my Predictive Ranking Party?

What do you think? Should we go back to transparency and the tag-team approach to reviewing documents?

Tom Gricks: Looking at the agreed TAR protocol in the case, I think that’s somewhat of a trick question. I am a strong believer in Sedona Principle 6, and the responding party’s right to determine the best way to locate and produce documents, consistent with the Federal Rules of Civil Procedure. We haven’t historically given the requesting party the right to participate in the production process, and there’s no reason to deviate from that philosophy in a TAR case. And TAR has clearly progressed to the point that there is no longer any need to engraft transparency merely for the sake of providing comfort in the TAR process, as I did in Global Aerospace. As long as validation evinces a reasonable production, that should be the end of the inquiry.

That said, I think this was in fact an extremely transparent TAR protocol—to some extent the antithesis of Sedona Principle 6. While the respondents chose the TAR tool, the commissioner actually trained the tool. And then the production (what I would call the presumptively relevant set) selected by the tool was turned over to the commissioner for review in its entirety, subject to clawback. I can’t imagine a more transparent TAR protocol — the commissioner saw every responsive and nonresponsive document used to train the tool, and every responsive and nonresponsive document selected by the tool for potential production. Other than selecting the (mostly) random samples used to train the tool, the commissioner saw everything.

Personally, I think that’s too liberal, and almost an abdication of the producing party’s responsibility to make a reasonable inquiry under the Federal Rules.

Given the opportunity to challenge the response, I have a more bothersome question for you. Although it’s not entirely clear, the ESI protocol suggests that the collection was about 3 percent rich, with about 13,500 responsive documents. According to the court, the production was only 3 percent rich (precision), with about 5,800 responsive documents. That’s roughly 43 percent recall, and a result that approaches a random selection of documents. With today’s TAR tools, would that generally be considered a reasonable result?

John Tredennick: You raise an important point, one you can’t really see simply by reading Judge Buch’s recent order. Rather, the key information about the numbers can only be found by retrieving the ESI protocol itself, which was set forth in Judge Buch’s Order Concerning ESI Discovery.

That document is interesting on a number of levels, including the chart the court includes to lay out the estimated number of relevant documents to achieve different levels of recall. In this case, the commissioner demanded a 95 percent recall rate. The parties estimated they would need to find 12,705 out of a total of 13,500 relevant documents to achieve a 95 percent recall.

As you note, the commissioner found only about 5,800 relevant documents in the production. I don’t know how one justifies a production where you only produce 43 percent of the relevant documents.

Surprisingly, the commissioner didn’t raise that point. Rather, he challenged the production as inadequate because not all of the documents found through keyword search were included. I wonder why they missed what seemed to be an obvious and more compelling argument. And why couldn’t the producing party find more than 43 percent of the relevant documents?

I believe one reason they couldn’t find more relevant documents was because they used a TAR 1.0 system that trained using randomly selected seeds. When there are only a few relevant documents in the collection, it becomes hard to find enough relevant seeds to train the algorithm.

Do you think it would have been different using a TAR 2.0 engine that incorporated continuous learning into the process?

Tom Gricks: I think it absolutely would have been different if this review were done with a TAR 2.0 tool. The simple fact is that continuous learning, the backbone of a TAR 2.0 tool, is much more effective than TAR 1.0 in low richness situations.

In their paper, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, Maura Grossman and Gordon Cormack evaluated the performance of two TAR 1.0 protocols against what they called a “Continuous Active Learning” or “CAL” protocol for collections for which the richness ranged from 0.25% to 3.92%. In nearly every case, it took many, many more documents for the TAR 1.0 protocols to achieve a reasonable recall. Continuous active learning is much more efficient and effective in finding the responsive documents in these low richness collections.

And that makes sense to me. If you consider what was done here, you see that the first two rounds of training consisted of 1,000 document random samples. Since richness was only roughly 3 percent, that means that there were probably only about 60 positive training documents in both sets combined. Even if the remaining 12 sets (another 1,200 documents) had a higher richness, there simply weren’t a lot of positive training documents among the 3,200 that the commissioner reviewed, which hindered the training of the tool.

Because CAL relies on relevance feedback, and proactively looks for documents like the ones coded as positive, it can be much more effective with low richness collections. Gordon Cormack once compared CAL to a bloodhound—once you put it on the scent of what you are looking for, it follows the scent from document to document to continually find more.

That leads me to another observation made by the court that I find interesting. In discussing the interplay between recall and precision, the court said:

Those numbers are often in tension with each other: as the predictive coding model is instructed to return a higher percentage of responsive documents, it is likely also to include more nonresponsive documents. Thus, when setting the recall rate at 95%, the Commissioner likewise chose a model that would return more nonresponsive documents (in this case, a precision rate of 3%).

In my experience, that’s not typically how continuous active learning operates. What do you think?

John Tredennick: I agree. When you are doing one-time training, there is no opportunity to improve the algorithm over the course of the review. With a continuous learning process, the algorithm keeps learning during the course of the review.

If you have trained on only a couple thousand documents, the algorithm will have a lot harder time distinguishing between relevant and non-relevant documents. As a result, you may end up with a 3 percent precision rate. That means you have to review 97 irrelevant documents for every three relevant ones. That isn’t my idea of a good process. Indeed, it sounds more like what you would expect in a linear review.

When you use a TAR 2.0 system, the algorithm is continuously learning during the review. As it gets smarter about relevance, it continuously pushes those documents to the front of the review. As a result, the review team sees a much higher percentage of relevant documents, and it can get better with each new batch. For a low richness collection like this, we might expect to see 50 percent precision—one relevant document for each non-relevant one you review.

You can see this in the Grossman and Cormack study you cited: Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery. In one example labeled Matter 201, the total population was about 750,000 documents. To get to 75 percent recall using a TAR 1.0 approach (one-time training; 2,000 training documents), you would have to review over 284,000 documents. If, instead, you used a TAR 2.0 continuous learning protocol, the job would be done after reviewing only 6,000 documents. Precision using a TAR 2.0 system would be about 41 percent. Precision using a TAR 1.0 system would be less than 1 percent.

Moving on, here is another thing that puzzled me. According to the opinion, the technical professionals suggested that the commissioner review and tag another 1,000 documents as a validation sample to “test the performance model” but warned that the added review would be unlikely to improve the model. Likely tired of reviewing largely irrelevant documents, the commissioner declined to do this. Thus, there was no attempt to validate the results or to show the level of recall actually achieved.

If you don’t validate your results in some fashion, how do you know what level of recall you achieved? And, what did validation have to do with improving the model? The purpose of validation, at least as I understand it, is to demonstrate that there aren’t many relevant documents left in the discard (non-reviewed) pile.

What do you make of it?

Tom Gricks:That was indeed an odd and confusing part of the protocol. The way I read it, I think you are right. It looks to me like they were planning to use that final validation sample pretty much the same way they apparently used the second 1,000 document sample. That is, they would review and code the sample to further train the tool, and then run the tool against that same sample to see how well the model was performing.

Using the same set to both train and validate the tool violates everything I have ever heard about the proper use of a control set. Every expert I know stresses the importance of keeping the control set separate from the training documents. Going further, I don’t see how they could use the control set for “validation.” So nothing about this validation step in the protocol seems appropriate.

In my opinion, validation is the most important step in the TAR process, which really takes us back to the two myths raised by Judge Buch. First, human review is not perfect or the “gold” standard. Second, no process produces perfect results, even a TAR process. Rather, your obligation is to ensure that your response constitutes a reasonable inquiry under the Federal Rules of Civil Procedure—not perfect, but reasonable. You do that with an appropriate sampling protocol to confirm production of a reasonable quantum of responsive documents, as you noted.

Frankly, I don’t think that was done here, and I’m not sure I agree that this production satisfied the Federal Rules. From the numbers we have seen in the filings, the TAR recall seemed no better than an average linear review, perhaps worse. Maybe that’s good enough here. We can’t tell without a proportionality analysis. But as a general matter, TAR should be performing much better than linear review.

Ultimately, John, this case is puzzling to me. On one hand, I agree wholeheartedly with the principles underlying Judge Buch’s opinion and his commentary on the two myths of review. On the other, however, I simply do not find the protocol used, the training methods and the validation process made sense. And, the results from using TAR in this case did not seem to be anywhere near the level of effectiveness achievable with modern TAR tools.

John Tredennick: I am with you. I have no doubt there is more to the story than we see in the opinions we could access but a process that returns 43 percent recall would not seem to meet the obligation to take reasonable steps to find and produce relevant documents. I can’t understand why the commissioner didn’t raise that issue since he was clearly unhappy with the results. Like Judge Buch, I would not hold a TAR process to a higher standard than any other discovery process but this response wouldn’t pass muster even for a linear review.

2 thoughts on “A Discussion About Dynamo Holdings: Is 43% Recall Enough?

  1. Larry Briggi

    Great discussion gentlemen.
    With the learning curve the industry has been going through the last 5+ years regarding TAR and all the various methods/options, it is understandable opposing parties would want to know how it was applied and what the numerical results were. But ultimately, it is the responding party’s burden to produce responsive documents regardless of whether TAR is used or not. One of TAR’s advantages, and maybe biggest drawback, is that it does provide numbers.
    “Perfect” has likely never been achieved in a linear document review, but the numbers TAR provides beg the question “What success rate is acceptable?”. Barring factors we may not be aware of, I too would question a result of 43% recall.

    Reply
  2. Bill Dimm

    I don’t agree that they hit only 43% recall. Assuming that they made their measurement competently on a sample of 1,000 random documents when choosing their relevance score cutoff aiming to hit 95% recall, I believe they hit between 80% and 99% recall (since there would only be about 33 relevant documents in the sample, this is roughly a +/-15% recall measurement, not the usual +/-5%).

    The producing party seems to think that there are 13,374 responsive documents (since hitting 95% recall would involve finding 12,705 responsive documents according to their table) in the full population of 406,939 documents. Assuming that they got this number by finding the prevalence to be 3.3% on a sample of 1,000 documents, the confidence interval for the prevalence is 2.3% to 4.6%, so there are a total of somewhere between 9,360 and 18,719 responsive documents in the full population. If they produced only 5,797 responsive documents, their recall is between 31% and 62% with 95% confidence. But, did they produce only 5,797 responsive documents? 5,797 is the number of documents that the requesting party chose to keep. If there was any inconsistency between how the 180,000 produced documents were reviewed by the requesting party and how the 1,000 document sample was reviewed, things could be way off. For example, in the 1,000 document sample, marginally responsive documents may have been tagged as responsive (an effort to ensure that everything that is even somewhat relevant gets produced), whereas the requesting party may have chosen not to keep marginally responsive documents (thus, 5,797 would not be all of the responsive documents produced – it excludes the marginally responsive ones). The fact that calculation giving 43% (or 31% to 62%) recall involves comparing numbers from two different reviews makes it much more susceptible to bias than the single review of 1,000 documents giving 95% (or 80% to 99%) recall using the “direct recall” method.

    Even a poorly performing TAR 1.0 system would be expected to hit more than 43% recall if 44% of the population (180,000 / 406,939) was produced. A system that performs worse that producing documents randomly would be quite an anomaly.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *