Ask Catalyst: How Can You Validate Without A Control Set?

[This is another post in our “Ask Catalyst” series, in which we answer your questions about e-discovery search and review. To learn more and submit your own question, go here.]

Ask_Catalyst_TC_Thomas_GricksWe received this question:

I hear you don’t use a control set in your TAR 2.0 processes? If so, how can you validate your results?

Today’s question is answered by Thomas Gricks, managing director of professional services.

This question really took me by surprise the first time someone raised it. Insight Predict doesn’t use a control set and never has and I didn’t understand how the control set could be used for validation in the first place. Without more input from folks who are still using control sets, most of whom come from a TAR 1.0 mindset, I didn’t really have an answer. So the next time it was raised, I decided to investigate control sets and find out how they might be used for validation.

I am going to answer this question in two parts. Today, I am going to discuss control sets and why they are not accurate tools for validating results. Next week, I will explain the procedure Catalyst uses for validation, which does not involve control sets.

(Let me begin by thanking Meesun Yang of Paul Hastings LLP for giving me an overview of the use of control sets.)

Let’s start by looking at what validation is, because validation actually has a precise application within the TAR process. According to The Grossman-Cormack Glossary of Technology-Assisted Review, validation is “the act of confirming that a process has achieved its intended purpose.” In the typical TAR context, then, validation is the act of confirming that the TAR process has achieved a target recall level — nothing more and nothing less.

Now, consider the purpose of a control set. If you look at the Grossman-Cormack Glossary, a “control set” in TAR jargon is a very specific group of documents, having a very specific purpose:

Control Set: A Random Sample of Documents coded at the outset of a search or review process that is separate from and independent of the Training Set. Control Sets are used in some Technology-Assisted Review processes. They are typically used to measure the effectiveness of the Machine Learning Algorithm at various stages of training, and to determine when training may cease.

By definition, then, the control set is intended to measure the effectiveness of the training at the outset of the review (or at certain stages during the review), but not necessarily to confirm that the TAR process has in fact achieved a target recall level at the end of review.

This is consistent with my understanding of how a control set is used in practice. First, the control set is coded. Each document in the control set is marked as either relevant or nonrelevant. Then, as the algorithm is being trained, the effectiveness of the algorithm is periodically measured by looking at how well the algorithm predicts and matches the relevant and nonrelevant calls in the control set. Once the algorithm reaches an acceptable level of consistency (e.g., it correctly predicts 80 percent of the coding calls in the control set), or the algorithm ceases to improve (as measured against the control set), training concludes and the tool is used to generate the final set of relevant documents. This is what we at Catalyst have come to refer to as a typical “TAR 1.0” workflow — train to stability and then stop.

Although some users refer to this process as validation, it really isn’t. It may validate the effectiveness of the algorithm with reference to the control set, but it doesn’t confirm the level of recall in the production set. In fact, William Weber has studied this precise issue and has concluded that “a control sample used to guide the producing party’s process cannot also be used to provide a statistically valid estimate of that process’s result.” (See The Bias of Sequential Testing in Predictive Coding.)

Instead, Weber determined that the level to which the algorithm characterizes the control set will inevitably overstate the level of recall in the final production set. In other words, if the algorithm correctly identifies 80 percent of the relevant documents in the control set, the recall level in the final production set is most likely something less than 80 percent. According to Weber, “if a statistically valid estimate of true production effectiveness is required … then a separate certification sample must be made.”

In fact, there are a number of reasons why control sets are problematic. Our chief scientist Jeremy Pickens has written on this blog about the problems with control sets, and his post was a follow-up to a post by Ralph Losey calling for the abolishment of control sets in e-discovery.

As Dr. Pickens wrote, in a dynamic, continuous learning TAR 2.0 environment, control sets “are not only unnecessary but irrelevant.” Next week, in part two of this answer to your question, I will explain how we at Catalyst validate results.


About Thomas Gricks

Managing Director, Professional Services, Catalyst. A prominent e-discovery lawyer and one of the nation's leading authorities on the use of TAR in litigation, Tom advises corporations and law firms on best practices for applying Catalyst's TAR technology, Insight Predict, to reduce the time and cost of discovery. He has more than 25 years’ experience as a trial lawyer and in-house counsel, most recently with the law firm Schnader Harrison Segal & Lewis, where he was a partner and chair of the e-Discovery Practice Group.