Ask Catalyst: Why Can’t You Tell Me Exactly How Much TAR Will Save Me?

[Editor’s note: This is another post in our “Ask Catalyst” series, in which we answer your questions about e-discovery search and review. To learn more and submit your own question, go here.]

We received this question:Ask_Catalyst_MN_Why_Can't_You_Tell_Me_Exactly_How_Much_TAR_Will_Save_Me-04

Why can’t you tell me exactly how much I’ll save on my upcoming review project by using technology assisted review?

Today’s question is answered by Mark Noel, managing director of professional services. 

Well, I could tell you, but it would be the classic lawyer’s answer: “It depends.”

Okay, that’s not very helpful, I know. “What does it depend on?” I hear you ask with a slight note of frustration.

A big part of that answer is homogeneity. What do we mean by that?

A quick look at the dictionary gives us some synonyms for homogeneous: similar, consistent, uniform, unvaried. As you might imagine, a search for things that are similar and uniform can be a lot easier than a search for things that are inconsistent and wildly different. And there are multiple properties of a document collection and a search task that can be more or less homogeneous.

For example, a search performed as part of an FCPA investigation may be very focused on one narrow topic of interest, while an antitrust second request may contain 40 different requests for production covering a wide variety of different (though related) topics. The former would be a more homogeneous search, while the latter would be a more heterogeneous search. The first is typically easier, since there aren’t as many different subtopics or “flavors” of relevance that the TAR engine has to learn about in order to achieve high recall.

But it’s not just the search task that can be more or less homogeneous. The document collection itself has many properties that might be more or less varied:

  • The number and distribution of different document types (e.g., email, loose documents, images, audio, structured data).
  • The subject matter of the different documents, which is often related to the number of different custodians and the differences in their roles.
  • Languages used in the documents.
  • Average length of documents.
  • Specialized vocabulary related to the subject matter of the case or investigation.

Let’s look at an example. Imagine that we’re trying to find photos rather than trying to find emails and Office documents. You could have three different “cases,” each with 100,000 photos, and each with (say) 10,000 relevant photos. Thus, you’d have an identical richness of 10% in all three cases. But imagine one photo case is about the “Nike logo,” the second is about “apples” (the fruit), and the third is about “new beginnings.” In the first case, you’re trying to find all instances of photos that show the Nike logo; in the second, apples; and in the third, new beginnings.

The numbers in each case are exactly the same: 100,000 documents to search and 10,000 documents to find. But the difficulty in finding the documents is going to differ.  Finding instances of the Nike logo is going to be relatively easy for a machine. It has to deal with perspective shifts, or maybe there’s mud on part of the logo, and the logo may appear in several combinations of foreground and background color. But generally the logo looks like the logo. It’s pretty homogeneous. Finding all 10,000 Nike logos is going to be easy because of that homogeneity.

Now consider apples. Apples are a bit more difficult to find than Nike logos. Not only do you have perspective shifts (e.g., taking a photo of the apple from the side, the top or the bottom) but the apples themselves are not all exactly the same shape and size.  They’re similar, but they’re not as consistent, uniform and unvaried the way the Nike “swoosh” is. In other words, they’re not as homogeneous.

Further, apples come in many different colors: red, green, yellow, pink and mixed. Apples may be mature, or they may be small fruits still on the tree. Some apples will be whole, some will be sliced, some will have a bite taken out of them, some will be rotten. Some might have leaves attached to the stem, while others don’t. Finding all 10,000 apple photos is therefore going to be a little more difficult, even though we’re searching through a collection of the same size for the same number of relevant photos.

Finally, picture the third case, involving “new beginnings.” Imagine just a few of the ways that you might illustrate this concept and you’ll see that photos of new beginnings are going to be very heterogeneous. You’ll have some of the first day of school. You’ll have plants coming out of the ground in the springtime. You’ll have landscape photos at dawn. Addiction or cancer treatment centers. Moving trucks. The starting line on a racetrack. You get the idea. There are hundreds of different ways that a photo could represent new beginnings. On top of that, if you thought there was variety among apples, there is going to be even more variety among school kids, plants, buildings, and landscapes. There is very little homogeneity in the relevant documents in this collection.

So, you can see how both the homogeneity of the search task and the homogeneity of the documents being searched will affect the difficulty of the search. Without knowing all the properties of your document collection and the search tasks ahead of time, it is very difficult to predict exactly how much review you’ll save by using TAR 2.0 and continuous active learning (CAL).

That said, CAL is still the way to bet. Gordon Cormack and Maura Grossman’s 2014 research paper on TAR learning protocols makes that clear. Sometimes CAL will save you 30 percent over other methods, everything else being equal. Other times, it might save you 3,000 percent. It always beat the other methods –- the only question was how big its margin of victory would be in any given case.

And finally, Insight Predict has a specialized tool for dealing with heterogeneity that wasn’t even a part of that research: Contextual Diversity. One of our many sub-algorithms in Insight Predict is an algorithm that continuously models unknown information and then presents reviewers with documents that best represent topics the the reviewers have seen the least. With such an algorithm the challenges of heterogeneity are further reduced.

So while it’s difficult to quote a precise number beforehand telling you how much you’ll save in a given case by using TAR 2.0 and CAL, we can show that for any given case it will significantly outperform any alternatives.