Comparing Active Learning to Random Sampling: Using Zipf’s Law to Evaluate Which is More Effective for TAR

Maura Grossman and Gordon Cormack just released another blockbuster article,  “Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’” 7 Federal Courts Law Review 286 (2014). The article was in part a response to an earlier article in the same journal by Karl Schieneman and Thomas Gricks, in which they asserted that Rule 26(g) imposes “unique obligations” on parties using TAR for document productions and suggested using techniques we associate with TAR 1.0 including:

Training the TAR system using a random “seed” or “training” set as opposed to one relying on judgmental sampling, which “may not be representative of the entire population of electronic documents within a given collection.”

From the beginning, we have advocated a TAR 2.0 approach that uses judgmental seeds (selected by the trial team using all techniques at their disposal to find relevant documents). Random seeds are a convenient shortcut to approximating topical coverage, especially when one doesn’t have the algorithms and computing resources to model the entire document collection. But they are neither the best way to train a modern TAR system nor the only way eliminate bias and ensure full topical coverage. We have published several research papers and articles showing that documents selected via continuous active learning and contextual diversity (active modeling of the entire document set) consistently beat training documents selected at random.

zipfIn this latest article and in a recent peer-reviewed study (which we discussed in a recent blog post), Cormack and Grossman also make a compelling case that random sampling is one of the least effective methods for training. Indeed, they conclude that even the worst examples of keyword searches are likely to bring better training results than random selection, particularly for populations with low levels of richness.

Ralph Losey has also written on the issue recently, arguing that relying on random samples rather than judgmental samples “ignores an attorney’s knowledge of the case and the documents. It is equivalent to just rolling dice to decide where to look for something, instead of using your own judgment, your own skills and insights.”

Our experience, like theirs, is that judgmental samples selected using attorneys’ knowledge of the case can get you started more effectively, and that any possible bias arising from the problem of “unknown unknowns” can be easily corrected with the proper tools. We also commonly see document collections with very low richness, which makes these points even more important in actual practice.

Herb Roitblat, the developer of OrcaTec (which apparently uses random sampling for training purposes), makes cogent arguments for the superiority of a random-only sampling approach. (See his posts here and here.) His main argument is that training using judgmental seeds backed by review team judgments leads to “bias” because “you don’t know what you don’t know.” Our experience, which is now backed by the peer-reviewed research of Cormack and Grossman, is that there are more effective ways to avoid bias than simple random sampling.

We certainly agree with Roitblat that there is always a concern for “bias” – at least in the sense of not knowing what you don’t know (rather than any potential “lawyer manipulation” that Ralph Losey properly criticizes in his recent post). But it isn’t necessarily a problem that prevents us from ever using judgmental seeds.  Sometimes – depending on the skill, knowledge, and nature of the relevant information in the matter itself – judgmental selection of training documents can indeed cover all relevant aspects of a matter. At other times, judgmental samples will miss some topics because of the problem of “unknown unknowns” but this deficiency can be easily corrected by using an algorithm such as contextual diversity that models the entire document population and actively identifies topics that need human attention rather than blindly relying on random samples to hit those pockets of documents the attorneys missed.

The goal of this post, however, is not to dissect the arguments on either side of the random sampling debate. Rather, we want to have a bit of fun and show you how Zipf’s Law and the many ways it is manifest in document populations argue strongly for the form of active learning we use to combat the possibility of bias. Our method is called “contextual diversity” and Zipf’s law can help you understand why it is more efficient and effective than random sampling for ensuring topical coverage and avoiding bias.

What is Contextual Diversity?

A typical TAR 1.0 workflow often involves an expert reviewing a relatively small set of documents, feeding those documents into the TAR system to do its thing, and then having a review team check samples to confirm the machine’s performance. But in TAR 2.0, we continuously use all the judgments of the review teams to make the algorithm smarter (which means you find relevant documents faster). Like Cormack and Grossman, we feed documents ranked high for relevance to the review team and use their judgments to train the system. However, our continuous learning approach also throws other options into the mix to further improve performance, combat potential bias, and ensure complete topical coverage. One of these options that addresses all three concerns is our “contextual diversity” algorithm.

Contextual diversity refers to documents that are highly different from the ones already seen and judged by human reviewers (and thus under a TAR 2.0 approach have been used in training), no matter how those documents were initially selected for review. Because our system ranks all of the documents in the collection on a continual basis, we know a lot about documents – both those the review team has seen but also (and more importantly) those the review team has not yet seen. The contextual diversity algorithm identifies documents based on how significant and how different they are from the ones already seen, and then selects training documents that are the most representative of those unseen topics for human review.

It’s important to note that we’re not solving the strong AI problem here – the algorithm doesn’t know what those topics mean or how to rank them.  But it can see that these topics need human judgments on them and then select the most representative documents it can find for the reviewers. This accomplishes two things: (1) it is constantly selecting training documents that will provide the algorithm with the most information possible from one attorney-document view, and (2) it is constantly putting the next biggest “unknown unknown” it can find in front of attorneys so they can judge for themselves whether it is relevant or important to their case.

We feed in enough of the contextual diversity documents to ensure that the review team gets a balanced view of the document population, regardless of how any initial seed documents were selected. But we also want the review team focused on highly relevant documents, not only because this is their ultimate goal, but also because these documents are highly effective at further training the TAR system as Cormack and Grossman now confirm. Therefore, we want to make the contextual diversity portion of the review as efficient as possible. How we optimize that mix is a trade secret, but the concepts behind contextual diversity and active modeling of the entire document population are explained below.

Contextual Diversity: Explicitly Modeling the Unknown


In the above example, assume you started the training with contract documents found either through keyword search or witness interviews. You might see terms like the ones above the blue dotted line showing up in the documents. Documents 10 and 11 have human judgments on them (indicated in red and green), so the TAR system can assign weights to the contract terms (indicated in dark blue).

But what if there are other documents in the collection, like those shown below the dotted line, that have highly technical terms but few or none of the contract terms?  Maybe they just arrived in a rolling collection. Or maybe they were there all along but no one knew to look for them. How would you find them based on your initial terms? That’s the essence of the bias argument.

With contextual diversity, we analyze all of the documents. Again, we’re not solving the strong AI problem here, but the machine can still plainly see that there is a pocket of different, unjudged documents there. It can also see that one document in particular, 1781, is the most representative of all those documents, being at the center of the web of connections among the unjudged terms and unjudged documents. Our contextual diversity engine would therefore select that one for review, not only because it gives the best “bang for the buck” for a single human judgment, but also because it gives the attorneys the most representative and efficient look into that topic that the machine can find.

So Who is This Fellow Named Zipf?

Zipf’s law was named after the famed American linguist George Kingsley Zipf, who died in 1950. The law refers to the fact that many types of data, including city populations and a host of other things studied in the physical and social sciences, seem to follow a Zipfian distribution, which is part of a larger family of power law probability distributions. (You can read all about Zipf’s Law in Wikipedia, where we pulled this description.)

Why does this matter? Bear with us, you will see the fun in this in just a minute.

It turns out that the frequency of words and many other features in a body of text tend to follow a Zipfian power law distribution. For example, you can expect the most frequent word in a large population to be twice as frequent as the second most common word, three times as frequent as the third most common word and so on down the line. Studies of Wikipedia itself have found that the most common word, “the,” is twice as frequent as the next, “of,” with the third most frequent word being “and.” You can see how the frequency drops here:


Topical Coverage and Zipf’s Law

Here’s something that may sound familiar. Ever seen a document population where documents about one topic were pretty common, and then those about another topic were somewhat less common, and so forth down to a bunch of small, random stuff? We can model the distribution of subtopics in a document collection using Zipf’s law too. And doing so makes it easier to see why active modeling and contextual diversity is both more efficient and more effective than random sampling.

Here is a model of our document collection, broken out by subtopics. The subtopics are shown as bubbles, scaled so that their areas follow a Zipfian distribution. The biggest bubble represents the most prevalent subtopic, while the smaller bubbles reflect increasingly less frequent subtopics in the documents.

BubbleModel1Now to be nitpicky, this is an oversimplification. Subtopics are not always discrete, boundaries are not precise, and the modeling is much too complex to show accurately in two dimensions. But this approximation makes it easier to see the main points.

So let’s start by taking a random sample across the documents, both to start training a TAR engine and also to see what stories the collection can tell us:

BubbleRandomWe’ll assume that the documents are distributed randomly in this population, so we can draw a grid across the model to represent a simple random sample.  The red dots reflect each of 80 sample documents. The portion of the grid outside the circle is ignored.

We can now represent our topical coverage by shading the circles covered by the random sample.

BubbleTopicalYou can see that a number of the randomly sampled documents hit the same topical circles. In fact, over a third (32 out of 80) fall in the largest subtopic.  A full dozen are in the next largest. Others hit some of the smaller circles, which is a good thing, and we can see that we’ve colored a good proportion of our model yellow with this sample.

So in this case, a random sample gives fairly decent results without having to do any analysis or modeling of the entire document population. But it’s not great. And with respect to topical coverage, it’s not exactly unbiased, either. The biggest topics have a ton of representation, a few tiny ones are now represented by a full 1/80 of the sample, and many larger ones were completely missed. So a random sample has some built-in topical bias that varies randomly – a different random sample might have biases in different directions. Sure, it gives you some rough statistics on what is more or less common in the collection, but both attorneys and TAR engines usually care more about what is in the collection rather than how frequently it appears.

So what if we actually can perform analysis and modeling of the entire document population? Can we do better than a random sample? Yes, as it turns out, and by quite a bit.

Let’s attack the problem again by putting attorney eyes on 80 documents – the exact same effort as before – but this time we select the sample documents using a contextual diversity process.  Remember: our mission is to find representative documents from as many topical groupings as possible to train the TAR engine most effectively, avoid any bias that might arise from judgmental sampling, and to help the attorneys quickly learn everything they need to from the collection. Here is the topical coverage achieved using contextual diversity for the the same size review set of 80 documents:

BubbleCoverageContextualNow look at how much of that collection is colored yellow. By actively modeling the whole collection, the TAR engine with contextual diversity uses everything it can see in the collection to give reviewing attorneys the most representative document it can find from each subtopic. By using its knowledge of the documents to systematically work through the subtopics, it avoids massively oversampling the larger ones and relying on random samples to eventually hit all the smaller ones (which, given the nature of random samples, need to be very large to have a decent chance of hitting all the small stuff). It achieves much broader coverage for the exact same effort.

BubbleYellowBelow is a comparison of the two different approaches to selecting a sample of 80 documents. The subtopics colored yellow were covered by both. Orange indicates those that were found using contextual diversity but missed by the random sample of the same size. Dark blue shows those smaller topics that the random sample hit but contextual diversity did not reach in the first 80 seed documents.

BubbleCompareFinally, here is a side by side comparison of the topical coverage achieved for the same amount of review effort:

BubblesSideNow imagine that the attorneys started with some judgmental seeds taken from one or two topics. You can also see how contextual diversity would help balance the training set and keep the TAR engine from running too far down only one or two paths at the beginning of the review by methodically giving attorneys new, alternative topics to evaluate.

When subtopics roughly follow a Zipfian distribution, we can easily see how simple random sampling tends to produce inferior results compared to an active learning approach like contextual diversity. (In fact, systematic modeling of the collection and algorithmic selection of training documents beats random sampling even if every topic were the exact same size, but for other reasons we will have to discuss in a separate post.) For tasks such as a review for production where the recall and precision standards are based on “reasonableness” and “proportionality,” random sampling – while not optimal – may be good enough. But if you’re looking for a needle in a haystack or trying to make sure that the attorneys’ knowledge about the collection is complete, random sampling quickly falls farther and farther behind active modeling approaches.

So while we strongly agree with the findings of Cormack and Grossman and their conclusions regarding active learning, we also know through our own research that the addition of contextual diversity to the mix makes the results even more efficient.

After all, the goal here is to find relevant documents as quickly and efficiently as possible while also quickly helping attorneys learn everything they need to know to litigate the case effectively. George Zipf is in our corner.


About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.