Why Control Sets are Problematic in E-Discovery: A Follow-up to Ralph Losey

Why_Control_Sets_are_Problematic_in_E-DiscoveryIn a recent blog post, Ralph Losey lays out a case for the abolishment of control sets in e-discovery, particularly if one is following a continuous learning protocol.  Here at Catalyst, we could not agree more with this position. From the very first moment we rolled out our TAR 2.0, continuous learning engine we have not only recommended against the use of control sets, but we actively decided against ever implementing them in the first place and thus never even had the potential of steering clients awry.

Losey points out three main flaws with control sets. These may be summarized as (1) knowledge Issues, (2) sequential testing bias, and (3) representativeness. In this blog post I offer my own take and evidence in favor of these three points, and offer a fourth difficulty with control sets: rolling collection.

Knowledge Issues

Knowledge issues means that the attorney’s understanding of what the case is about is never complete at the very beginning of the TAR process.  Thus, the attorney might make a certain call on a document one way at the beginning and the opposite call on that same document later in the process.  It therefore may not be ideal that overall TAR progress is being measured using a control set whose judgments are from an earlier point of understanding.

Sequential Testing Bias

Sequential testing bias is a trickier topic, so I’ll leave the deeper explanations to Losey’s post and to William Webber, whom he also cites.  In a nutshell, however, the concern is that if you repeat a random event often enough you will observe the occurrence of an event (such as hitting a high recall stopping point) that has happened due to random chance.

This general concept is illustrated in a classic XKCD comic, popular in math and computer geek circles. There is no correlation between jelly beans (of any color) and acne, but if you repeat the test often enough, there will occur by chance an outcome in which there does appear to be a correlation.


Losey’s third critique is about whether a random control set sample is representative of the collection.  This problem is of course more acute when richness is low, but even when richness is higher, is it really the case that a single random sample of 500, 1,000, or 1,534 documents (or whatever) in your control set is going to topically represent every single aspect of every single aspect of relevance, such that the one sample will be able to represent progress of an entire TAR session?

At the DESI VI Workshop in June 2015, I published paper, “An Exploratory Analysis of Control Sets for Measuring E-Discovery Progress,” in which I did an exploratory analysis of the relationship between control sets and actual (real) progress toward high recall results.  The paper is worth at least a quick skim due to its empirical nature, but the high level takeaway is that control sets do not often correlate well with progress.  And for more reading on why a random sample might not hit every topic, see this Catalyst blog post from July 2014.

In additional to my empirical tests showing the problem of using control sets to monitor progress, there is a logical argument that can be made. That argument is this: If we really believe that a sample control set completely topically represents a collection, such that it can reliably detect training progress (i.e. that progress on control set will stop improving exactly when the full ranking or classification over the whole collection stops improving) and not before or after, why not simply use that control set itself as a one shot training of the algorithm? If the control set really contains everything there is to know about relevance, such that it can detect when there are any changes in the ranking of relevant documents, why not just use that control set once to train, and be done?

Of course, when you turn it around like that, it becomes clear that a random sample-based control set would never be enough to train a whole system on. So why do we think that that same control set would be capable of detecting changes, especially at high recall?  (For more in-depth reading as it relates to random sampling and how well it works for training, see Gordon Cormack and Maura Grossman’s SIGIR 2014 paper, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” and in particular their study of full random sampling for training, aka simple passive learning.)

Rolling Collections

Finally, in addition to these three areas that Losey mentioned, there is one more area in which control set-based TAR protocols are problematic. This is a real world problem, rather than a philosophical or theoretical problem. It has to do with the fact that, except in rare circumstances, one rarely manages to assemble one’s entire collection at the outset. If you have already created a control set and more documents arrive (e.g. from a new custodian), those new documents will not be represented by the control set. This is even worse than a random sample failing to hit subpockets of relevance — with rolling collection the random sample won’t even have had a chance to hit any newly arrived document whatsoever. We delve into this issue in greater detail in Section 4 of our blog post from February 2014.


In summary, there are a large number of reasons why control sets are problematic and why, in a dynamic, continuously learning environment, they are not only unnecessary but irrelevant. We are glad to see more voices in the industry joining what we’ve been saying for years, and welcome the continued open discussion of protocols.

As for those of you who still believe in the necessity of a control set, watch out for those green jelly beans!


About Jeremy Pickens

Jeremy Pickens is one of the world’s leading information retrieval scientists and a pioneer in the field of collaborative exploratory search, a form of information seeking in which a group of people who share a common information need actively collaborate to achieve it. Dr. Pickens has seven patents and patents pending in the field of search and information retrieval. As Chief Scientist at Catalyst, Dr. Pickens has spearheaded the development of Insight Predict. His ongoing research and development focuses on methods for continuous learning, and the variety of real world technology assisted review workflows that are only possible with this approach. Dr. Pickens earned his doctoral degree at the University of Massachusetts, Amherst, Center for Intelligent Information Retrieval. He conducted his post-doctoral work at King’s College, London. Before joining Catalyst, he spent five years as a research scientist at FX Palo Alto Lab, Inc. In addition to his Catalyst responsibilities, he continues to organize research workshops and speak at scientific conferences around the world.