Ask Catalyst: What is Contextual Diversity and Why Is It Important in TAR?

[Editor’s note: This is another post in our “Ask Catalyst” series, in which we answer your questions about e-discovery search and review. To learn more and submit your own question, go here.]

We received this question:Ask_Catalyst_Mark_Noel_What_is_Contextual_Diversity

What is contextual diversity and why is it important to a technology assisted review process?

Today’s question is answered by Mark Noel, managing director of professional services. 

Contextual Diversity is an exploratory tool found only in the Insight Predict system that runs automatically as part of a technology-assisted review project. Many TAR systems concentrate exclusively on relevance feedback — that is, giving you the unreviewed documents predicted to be the most relevant. But Insight Predict’s Contextual Diversity system also adds in some exploratory documents to help make sure you’ve looked into all the corners of your document collection — even the ones that you don’t know about.

A classic problem with searching a large volume of documents is that you don’t know what you don’t know. If there are unexpected documents, concepts or terms in the collection, you could miss them simply because you don’t know to search for them. For example, a collection could contain emails among people who used code words in order to actively evade searching. Even though you can’t see all the details of all the documents, Insight Predict can. Predict can also see what you’ve already reviewed, and what you haven’t yet seen.

The Contextual Diversity system is constantly searching through the entire collection for the next biggest clump of similar but unseen stuff. From that unseen region, it picks the best example and puts it in front of you to review. The technical description of this is “explicit modeling of the unknown,” but that just means that the machine is actively making sure you get a look into those unexplored pockets of the collection.

This process is iterative, meaning that it’s re-computed every time Predict re-ranks your collection, which is often several times an hour during active review. So another way to think about Contextual Diversity could be “continuous active exploration.” As the review progresses and more documents are reviewed, it explores deeper into smaller and smaller pockets of different, unseen documents.

This kind of active exploration is much more efficient than random sampling for making sure you see all the topics in the collection, because topics are not all of the same size. Random sampling would oversample the large topics and miss many of the smaller ones. But active exploration of all the documents, constantly updated as you review more documents, lets you quickly and systematically explore smaller topics without wasting time reviewing redundant documents from the largest topics.

Contextual Diversity serves a number of important roles in a TAR review:

  1. Training efficiency — By looking for other “flavors” of responsive documents, it helps ensure that your TAR training doesn’t go too far down one path to the exclusion of any others. You get better training performance earlier than with relevance feedback alone.
  2. Knowledge generation — Active exploration can help you search more effectively when doing ECA, investigations, opposing party production reviews, and any other task where you’re doing relevance search rather than a recall search (i.e. you want to find the best examples of all the different things the documents can tell you rather than finding all the documents of a certain type). After you’ve searched for everything you know to look for, you can use Contextual Diversity sampling to efficiently review whatever topics are still unseen.
  3. Rolling collections — Adding new documents to a collection means adding new topics to the collection (especially if the new data comes from new custodians). Since Contextual Diversity is an automated process that is constantly running, it automatically starts sampling from any new topics added to the collection. This keeps you from having to take new samples, review new control sets, or take any other special action to accommodate the newly loaded documents.
  4. Defensible productions — Contextual Diversity adds another component to your defensibility story. In addition to getting a quantitative measure of recall like 90%, you also a qualitative assurance that it is very unlikely any significant clumps of different, unseen documents have been missed. Contextual Diversity runs continuously during the review, identifying all those different, unseen pockets and surfacing hundreds or maybe even thousands of exploratory documents to human reviewers to decide whether they’re important. It helps ensure that your search for responsive documents was both reasonable and thorough.

As you can see, automatically running continuous, active exploration of the entire document collection has many powerful uses throughout discovery. That’s what Contextual Diversity provides.