TAR for Smart Chickens

Special Master Grossman offers a new validation protocol in the Broiler Chicken Antitrust Cases

Validation is one of the more challenging parts of technology assisted review. We have written about it— and the attendant difficulty of proving recall—several times:

The fundamental question is whether a party using TAR has found a sufficient number of responsive1 documents to meet its discovery obligations. For reasons discussed in our earlier articles, proving that you have attained a sufficient level of recall to justify stopping the review can be a difficult problem, particularly when richness is low.

Special Master Maura Grossman recently issued an Order crafting a new validation protocol In Re Broiler Chicken Antitrust Litigation, (Jan. 3, 2018), which is currently pending in the Northern District of Illinois. You can download a copy to the Order here.

While the Order was issued in the context of what seems to be document intensive litigation, the validation method it offers is important because it could work for other matters, whether the review is based on TAR 1.0, TAR 2.0 or even a simple linear review.2

Broiler Chickens?

This matter involves antitrust claims brought against a dozen or so poultry producers from around the country who were raising broiler chickens. A quick trip to Wikipedia tells us that broiler chickens are a gallinaceous domesticated fowl, bred and raised specifically for meat production. Typical broilers have white feathers and yellowish skin.  Brock, brock, brock.

Early on in this case, the court appointed Maura Grossman to act as Special Master for discovery issues. (Order here.) One of her first steps was to work out an ESI protocol, a copy of which can be found here.

From there, and presumably after much discussion among the parties, Special Master Grossman issued her “Order Regarding Search Methodology for Electronically Stored Information.” The Order covers a wide range of topics ranging from deduplication and threading to culling and keyword search. Our focus in this article will be on its validation protocol (Part III of the Order).

The Goal

Special Master Grossman began her validation discussion with a simple affirmation of the twin goals for a production review:

The review process should incorporate quality-control and quality-assurance procedures to ensure a reasonable production consistent with the requirements of Federal Rule of Civil Procedure 26(g).

She articulated the following validation protocol that applies to all documents “identified for review for responsiveness and/or privilege following the application of keywords or other culling criteria.” In doing so, she presumed that the parties had agreed that the collection process for that review had been established as complete and adequate.

The completeness of the review will be assessed by estimating recall.  Recall measures how much of the relevant material in the entire collection has actually been found and coded.  To calculate recall, you simply divide the number of relevant documents that have been coded by the total number of relevant documents in the collection.  So if you’ve coded 15,000 relevant documents and there are 20,000 total relevant documents in the collection, the review has 75% recall (15,000/20,000).

In order to calculate recall without looking at every single document in the collection, we’re going to have to do some sampling to arrive at a recall estimate.  We know how many relevant documents have been coded, but we don’t know how many relevant documents are in the entire collection.  We can easily get a reasonable estimate of the total number of relevant documents by sampling the documents that haven’t been reviewed (the “discard pile”) for relevance, and then adding in all of the relevant documents that have been coded (making slight adjustments for coding errors).

Creating a Validation Sample

As a first step toward estimating recall, Special Master Grossman divided the review documents into three categories:

  1. Documents identified as responsive by the review. This does not include non-responsive family members. C1
  2. Documents coded as non-responsive by a human. C2
  3. Documents excluded from manual review by a TAR system as non-responsive (also called the discard pile or the null set). C3

She then set forth a sampling protocol requiring the validation team to select 500 documents randomly from C1 (sample D1) to estimate the precision of the responsiveness coding, 500 from C2 (sample D2) to estimate the number of false negatives,  and 2000 from C3 (sample D3) to estimate the number of responsive documents left in the discard pile. The total Validation Sample (D1, D2 and D3) would be 3,000 documents.

The 3,000 documents selected randomly from C1, C2 and C3 are then combined into a single “Validation Sample.” Previous tagging (responsive, non-responsive) must be hidden from view along with information showing from which sample the document came. The goal is that a reviewer should not be able to tell anything about the prior tagging history of the new Validation Sample. The review would thus be “`blind.”

Creating the Composite Validation Sample

SME Review

The next step is to select one or more “subject matter experts” or “SMEs” to review the Validation Sample. There has been a lot of talk about who qualifies as an SME—a senior lawyer, a junior partner, a sharp associate?  But titles shouldn’t matter. For this protocol, an SME is defined as “someone knowledgeable about the subject matter of the litigation.” Grossman specifies that it should be “an attorney who is familiar with the RFPs and the issues in the case.”

She also reiterated that the review must be blind, which is a key requirement to meet her first objective of quality control:

During the course of the review of the Validation Sample, the SME shall not be provided with any information concerning the Subcollection or Subsample from which any document was derived or the prior coding of any document. The intent of this requirement is to ensure that the review of the Validation Sample is blind; it does not preclude a Party from selecting as SMEs attorneys who may have had prior involvement in the original review process.

That the SMEs can have prior involvement in the review process is important. It means that, during the review process, the producing party can make sure the review team has the same view on responsiveness as the SMEs. If the validation team is divorced from the process, they may have a markedly different view of responsiveness. In that case the resulting recall calculations are different, and possible lower, than expected.

Harkening to Grossman’s first goal for the process, blind tagging by the SMEs provide an important QC step to help ensure that documents marked responsive are responsive and documents marked non-responsive are not responsive.


We now move to part three of the process, the final validation step.

After the SME has finished reviewing and tagging the Validation Sample, the responding party is directed to create a table showing the following information for each document in the Validation Sample:

  1. The Bates number for the document.
  2. The sample group from which the document came (D1, D2 or D3).
  3. The SME’s responsiveness coding.

In addition, the party is directed to provide a copy of each non-produced (non-privileged) document found in the sample. These would come from sample sets D2 (marked non-responsive by the review team) and D3 (unreviewed documents)

The final step in the process is to calculate recall. This is done through a relatively simple formula set forth in the Appendix to the Order. For a TAR review, it works like this:

Relevant docs found = % relevant found in D1 X C1(responsive)

Relevant docs miscoded = % relevant found in D2 X C2 (non-responsive)

Relevant docs not reviewed= % relevant found in D3 X C3 (non-reviewed)

Thus, if the SMEs marked 450 out of the 500 documents in sample D1 responsive, that comes to 90%. To estimate the actual number of responsive documents in C1 (produced as responsive), you simply multiply the total number produced against the estimated percentage that are actually accurate.

For example, if 20,000 documents were produced as responsive, we would now estimate—based on the SME’s coding—that only 18,000 of them were actually responsive (90% of 20,000).

You can make the same calculations to determine how many documents marked non-responsive were miscoded and should therefore be counted as unfound responsive documents.  And finally, how many responsive documents remain in C3, the non-reviewed set (discard pile or null set).

The end game here is to determine the estimated recall of the production. Is it 70% or greater? 80%? 90%. Or perhaps below 60%.

The formula to make this estimate is now simple based on the figures determined above. To determine the percentage of responsive documents produced, we simply calculate:

# Relevant Docs Found / (# Relevant Found + # Relevant Miscoded + # Relevant Not Reviewed)

Going back to our simple example, let’s use 18,000 as the number of relevant documents found (produced). Let’s use 1,000 as the number of relevant documents miscoded. And, let’s use 3,000 as the number of relevant docs not reviewed.

Our calculation comes out like this: 18,000 / (18,000 + 1,000 + 3,000)

That comes to 18,000 / 22,000 which suggests we believe the party has produced 81% of the responsive documents. That is a recall of 81%.

Simple enough? Yes, once you work through the protocol.

Applying the Protocol to a Linear Review

For parties choosing to do a linear review, Grossman directed that 2,500 documents be sampled from C2 (to make the Validation Sample equal for TAR and linear reviews). In a linear review, there would be no documents to validate in category C3.

Special Master Grossman also set out the formula for calculating recall in a linear review. You simply make the first and second calculations from above.

Relevant docs found = % relevant found in D1 X C1

Relevant docs miscoded = % relevant found in D2 X C2

Once you have those figures, you can quickly calculate recall:

# Relevant Docs Found / (# Relevant Found + # Relevant Miscoded)

Built In Flexibility

There is a lot to say about this validation protocol but for starters know that it goes beyond simple recall percentages. To the contrary Special Master Grossman states emphatically:

An estimate of recall shall be computed to inform the decision-making process . . . however, the absolute number in its own right shall not be dispositive of whether or not a review is substantially complete. Also of concern is the novelty and materiality (or conversely, the duplicative or marginal nature) of any responsive documents identified in Subsamples D(2) and/or D(3).

She goes on to state:

It should be noted that, when conducted by an SME . . . a recall estimate on the order of 70% to 80% is consistent with, but not the sole indicator of, an adequate (i.e., high-quality) review. A recall estimate somewhat lower than this does not necessarily indicate that a review is inadequate, nor does a recall in this range or higher necessarily indicate that a review is adequate; the final determination also will depend on the quantity and nature of the documents that were missed by the review process.

Thus, achieving estimated recall of 75% might suggest an adequate review but if the SME finds a number of important documents marked as non-responsive or in the discard pile, the review might not be adequate. Conversely, if the newly-identified responsive documents are marginal in importance or duplicative of other produced document, then it is likely that the review will be deemed adequate if it achieves a reasonable degree of recall.

What do we think?

What do we make of this new protocol? It certainly meets Special Master Grossman’s goals for a reasonable production in that it provides: 1) an independent QC process to supplement  the general review QC process; and 2) a straightforward method to estimate recall and provide validation that the production process was reasonable. We are not aware of any other validation process, either proposed or used in reported cases, that covers both bases as well as this methodology.

In that regard, let’s talk about having the SME do a blind review of the validation—specifically that i.e. that the validating SME not know from which pool (C1, C2, or C3) a particular document was drawn. The simple reason is this: No matter how careful or how professional the SMEs are, if they know that a document is from C1 (the documents already tagged as relevant), they may be subtly (and even unconsciously) influenced to confirm that it is indeed responsive. And conversely, if the validating SME knows that it is from the discard pile, there may be an unconscious desire to mark it non-responsive. This would weaken, if not nullify the validation, because they would just be confirming what has already happened.

How about drawbacks? There is a lot to think about here and we plan to offer follow-on analysis through future articles or perhaps a webinar. But, for starters we offer three thoughts.

1. The Cost: The protocol requires that SMEs review at least 3,000 documents. If we assume that the SME reviews 60 documents an hour and bills at a middling $450 per hour, it could cost at least $22,500 to validate the review. Is that reasonable?

It could be, at least for large productions. But note that the process may need to be repeated if the SME disagrees with the reviewer’s interpretation (perhaps tagging more documents in C2 and C3 as responsive than the review team might) then recall will be lower than expected. If recall dips much below 70%, review might need to continue and the validation process must be repeated.

Also, Grossman didn’t mandate that the SMT be a senior member of the trial team. A good lawyer reviewer who understands the case and RFP might be able to do the blind review. If so, the hourly rate might drop to, say, $80. In that case the validation might be more like $4,500.

2. Small Cases: Will this process work for smaller cases? We have reported success in TAR projects involving as few as 10,000 documents. You can read about it here.

Let’s work through an example. Say you have 10,000 to review for possible production. Going further, assume your team reviews 3,000 documents before stopping, with 1,000 of them marked responsive. That leaves another 7,000 documents in the discard pile.

If we work through the protocol, an SME would be required to review another 3,000 documents, which would double the size of the review. Using reviewer rates (say $50 an hour), the cost of the initial review would be about $2500. If SMEs then tagged another 3,000 docs, the cost would at least double. Is that reasonable.

Special Master Grossman put a provision in her Order that might provide relief in this situation. She stated:

Should a producing Party believe that the sample sizes . . . would be disproportionate or unduly burdensome under the circumstances, that Party shall promptly raise the issue with the requesting Party. To the extent a dispute remains concerning the sample sizes to be used after good faith negotiations have occurred, either Party may request the assistance of the Special Master in resolving such dispute.

In this case, we might suggest that the sample sizes be much smaller. Our proposal might be that the SMEs review 350 docs each from S1 and S2 and 500 from S3. This would come to a total of 1200 documents, which seems more proportionate for a population of this size.

3. The Statistics: What do we make of the underlying validation statistics? This is beyond the scope of this article but let us note the obvious. The protocol creates three samples and each resulting estimate comes with a margin of error. What happens to the calculations if we consider the weakest numbers each sample?

We talked about this problem in depth in these articles: Measuring Recall in E-Discovery Review, Part Two: No Easy Answers; Measuring Recall in E-Discovery Review, Part One: A Tougher Problem Than You Might Realize.

Just realize that when you consider margins of error, recall estimates can go from a reasonable percentage to something that seems less reasonable. For example, a sample suggesting a 75% recall may drop to 50% when you use the low values from the positive sample and the high value from the non-reviewed sample.

Could that be the case here? We don’t believe so and we support the approach Special Master Grossman advocates. As we have noted in the past, placing too much credence on the ends of the margin of error range might make validation disproportionate, which doesn’t benefit anyone.

The bottom line is we commend Special Master Grossman for her validation protocol. It provides a new approach to a difficult issue, it adds a valuable QC component to the process and it provides a practical solution for a difficult problem.

We don’t know if the chickens have come home to roost but we think a few might have.

By John Tredennick and Jeremy Pickens*

The comments made in this post represent the opinions of the authors and do not necessarily reflect the views of Catalyst Repository Systems or any of its other employees, clients or affiliates.

1. Typically we use “relevant” to refer to the positive documents found during a review because the documents are relevant to the inquiry that spawned the review. In this case, the validation method is for production documents so the word “responsive” seems more appropriate.

2. For the record, this validation methodology also works for keyword culling validation as well.


About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.


About Jeremy Pickens

Jeremy Pickens is one of the world’s leading information retrieval scientists and a pioneer in the field of collaborative exploratory search, a form of information seeking in which a group of people who share a common information need actively collaborate to achieve it. Dr. Pickens has seven patents and patents pending in the field of search and information retrieval. As Chief Scientist at Catalyst, Dr. Pickens has spearheaded the development of Insight Predict. His ongoing research and development focuses on methods for continuous learning, and the variety of real world technology assisted review workflows that are only possible with this approach. Dr. Pickens earned his doctoral degree at the University of Massachusetts, Amherst, Center for Intelligent Information Retrieval. He conducted his post-doctoral work at King’s College, London. Before joining Catalyst, he spent five years as a research scientist at FX Palo Alto Lab, Inc. In addition to his Catalyst responsibilities, he continues to organize research workshops and speak at scientific conferences around the world.