The Importance of Contextual Diversity in Technology Assisted Review

How do you know what you don’t know? This is a classic problem when searching a large volume of documents in litigation or an investigation.

In a technology assisted review (TAR), a key concern for some is whether the algorithm has missed important relevant documents, especially those that you may know nothing about at the outset of the review. This is because most modern TAR systems focus exclusively on relevance feedback, which means that the system feeds you the unreviewed documents that are likely to be the most relevant because they are most like what you have already coded as relevant. In other words, what is highly ranked depends on the documents that were tagged previously.

When you train a TAR algorithm using documents with which you are already familiar, or documents you located using a focused keyword search, the algorithm assumes you know the full scope of your review. The TAR tool assumes you generally know what topics, concepts and themes to look for.

But what about other relevant documents you didn’t find? Maybe they arrived in a rolling collection. Or maybe they existed all along but no one know to look for them. How would you find them based on your initial terms? When there are unexpected documents, concepts or terms in the collection, you could miss them simply because you don’t know to search for them.

This introduces the potential for what is called review bias, that is, looking only for documents with concepts that you know, and essentially ignoring potentially relevant documents that you don’t know anything about.

Contextual diversity, which can be used as part of a continuous active learning (CAL) process, is a powerful tool to combat the risk of missing pockets of potentially relevant documents, by finding documents that are different from those already seen and used for training. It ensures that reviewers aren’t missing documents that are relevant but different from the mainstream of documents being reviewed.

Below we provide an overview of contextual diversity, a brief summary of how it works and its important use cases and benefits in many types of reviews.

What Is Contextual Diversity?

A typical TAR 1.0 (e.g., first generation TAR system) workflow involves a subject matter expert (often a senior lawyer) reviewing several thousand documents for training purposes before the TAR algorithm can rank the remainder of the population. It is an iterative process that entails significant human time, effort and cost for training and re-training the system (particularly when new documents are added to the collection). The review team can’t begin until an SME does the training, and depending on their inclination to look at random documents, the review can be held up for days or weeks.

In a TAR system based on CAL, we continuously use all the judgments of the review team to make the algorithm smarter (which means that you find relevant documents faster). Documents ranked high for relevance are fed to the review team, who uses their judgments to train the system. The CAL approach can also include contextual diversity, which improves performance, combats potential bias, and ensures topical coverage.

Contextual diversity refers to documents that are different from the ones already seen and judged by human reviewers. Because the system ranks all of the documents on a continual basis, we know a lot about documents—both those the review team has seen but also (and more importantly) those the review team has not yet seen. The contextual diversity algorithm identifies documents based on how significant and how different they are from the ones already seen, and then selects training documents that are the most representative of those unseen topics for human review.

It’s important to note that the algorithm doesn’t know what those topics mean or how to rank them. But, it can see that these topics need human judgments on them and then select the most representative documents it can find for the reviewers to assess their relevance (or not).

This process is iterative, meaning that it’s re-computed every time the TAR system re-ranks your collection, which is often several times an hour during active review. Another way to think about contextual diversity could be “continuous active exploration.” As the review progresses and more documents are reviewed, it explores deeper into smaller and smaller pockets of different, unseen documents.

The system feeds in enough of the contextual diversity documents to ensure that the review team gets a balanced view of the document population.

Here’s why this is so important in a TAR review.

Practical Ways Contextual Diversity is Used

Contextual diversity serves a number of important purposes in a TAR review, and can save substantial costs of review in almost any review in which TAR is used. In addition to training efficiency, discussed above, here are some additional practical benefits and use cases.

1. Rolling collections

Rolling collections provide one of the best examples of how contextual diversity works. TAR 1.0 systems typically train against a reference set, which makes training difficult since it requires new training every time new documents are added to the collection.

With contextual diversity, you can integrate rolling document uploads into the review process. When you add new documents to the mix, they simply join in the ranking process and become part of the review. Depending on whether the new documents are different or similar to documents already in the population, they may integrate into the rankings immediately or fall to the bottom. In the latter case, the contextual diversity algorithm pulls samples from the new documents for review. As the new documents are reviewed, they integrate further into the ranking. This process of seeking contextually diverse groups of documents continues throughout the review to penetrate deeper and deeper into the collection to locate groups of unknown documents. We wrote about a good case example in a previous post.

2. Proving a negative in an investigation.

Proving to government agency that you simply don’t have any responsive documents can be a costly proposition; by adding contextual diversity as a strategy, you maximize the breadth of your search. If you still fail to locate any document of value, you have essentially shown there are no responsive documents in the collection and any further review would not be worth the effort. Contextual diversity, in a sense, is another set of eyes looking for the requested documents.

3. More thoroughly reviewing an opponent’s rolling productions.

Finding relevant documents in an opposing party’s production is rarely easy. And when those productions are large and arrive on a rolling basis (and the opposing party is trying to bury revealing or damaging documents within a large, late production), the search can be even more cumbersome, costly and time-consuming—and key documents may not be noticed for some time, if at all. However, with a contextual diversity engine re-ranking and analyzing the entire document set every time, a pocket of new documents unlike anything reviewers have seen before is immediately recognized, and exemplars from those new pockets will be pulled as contextual diversity seeds and puts in front of reviewers in the very next batch of documents to be reviewed.

4. Satisfying your obligation to make a reasonable inquiry in responding to discovery:

Finally, employing contextual diversity helps establish that you’ve met the standard of reasonable inquiry by using every tool at your disposal to ensure that your search for responsive documents was reasonable, thorough and proportional.

Of course, contextual diversity also improves performance for ECA and other relevance searches when finding themes, topics and concepts is important (rather than a recall search). For example, you want to find the best examples of all the different things the documents can tell you rather than finding all the documents of a certain type. After you’ve searched for the topics and themes you know to look for, you can use contextual diversity sampling to efficiently review whatever topics are still unseen.

In sum, when a TAR system with continuous active learning includes a companion contextual diversity algorithm, the system is generally better at locating responsive documents than a TAR system that doesn’t have this feature. In our experience, contextual diversity allows the algorithm to penetrate much more deeply into the collection to effectively let you know precisely what you don’t know.

For a more thorough discussion on contextual diversity, download Catalyst’s free eBook, TAR for Smart People or register for this related webinar, Coming Up Empty: Strategies for Proving a Negative in an Investigation.


About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.


About Thomas Gricks

Managing Director, Professional Services, Catalyst. A prominent e-discovery lawyer and one of the nation's leading authorities on the use of TAR in litigation, Tom advises corporations and law firms on best practices for applying Catalyst's TAR technology, Insight Predict, to reduce the time and cost of discovery. He has more than 25 years’ experience as a trial lawyer and in-house counsel, most recently with the law firm Schnader Harrison Segal & Lewis, where he was a partner and chair of the e-Discovery Practice Group.