How Much Can I Save with CAL? A Closer Look at the Grossman/Cormack Research Results

As most e-discovery professionals know, two leading experts in technology assisted review, Maura R. Grossman and Gordon V. Cormack, recently presented the first peer-reviewed scientific study on the effectiveness of several TAR protocols, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” to the annual conference of the Special Interest Group on Information Retrieval, a part of the Association for Computing Machinery (ACM).

download-pdfPerhaps the most important conclusion of the study was that an advanced TAR 2.0 protocol, continuous active learning (CAL), proved to be far more effective than the two standard TAR 1.0 protocols used by most of the early products on the market today—simple passive learning (SPL) and simple active learning (SAL).

To quote Grossman and Cormack:

The results show that entirely non-random training methods, in which the initial training documents are selected using a simple keyword search, and subsequent training documents are selected by active learning [CAL], require substantially and significantly less human review effort . . . to achieve any given level of recall, than passive learning, in which the machine-learning algorithm plays no role in the selection of training documents [SPL]. …

Among active-learning methods, continuous active learning with relevance feedback yields generally superior results to simple active learning with uncertainty sampling [SAL], while avoiding the vexing issue of “stabilization” – determining when training is adequate, and therefore may stop.

But how much can you expect to save using CAL over the simple passive and active learning methods used by TAR 1.0 programs? While every case is different, as are the algorithms that different vendors employ, we can draw some interesting conclusions from the Grossman/Cormack study that will help answer this question.

Comparing CAL with SPL and SAL

Grossman and Cormack compared the three TAR protocols—continuous active, simple passive and simple active learning—against eight different matters. Four were from an earlier TREC program and four were from actually litigated cases. You can read more about their methods and results here.

After charting the results from each matter, they offered summary information about their results. In this case I will show them for a typical TAR 1.0 project with 2,000 training seeds.[1]

A quick visual inspection confirms that the CAL protocol requires the review of far fewer documents than required for simple passive or simple active learning. In Matter 201, for example, a CAL review requires inspection of 6,000 documents in order to find 75% of the relevant files. In sharp contrast, reviewers using a SPL protocol would have to view 284,000 documents. For SAL, they would have to review almost as many, 237,000 documents. Both TAR 1.0 protocols require review of more than 230,000 documents. At $4 per document for review and QC, the extra cost from using the TAR 1.0 protocols would come to almost a million dollars.

Clearly some of the other matters had numbers that were much closer. Matter C, for example, required the review of 4,000 for a CAL protocol but only 5,000 for SAL and 9,000 for SPL (clearly the least efficient of the TAR approaches). In such a case, the savings are much smaller, hardly justifying a switch in TAR applications. So what might we expect as a general rule if we were considering different approaches to TAR?

Averaging the Results Across Matters

Lacking more comparative data, one way to answer this question is to use the averages across all eight matters to make our analysis. Using the magic of Excel, it is easy to add these figures to our chart.

Our average matter size is just over 640,000. The CAL protocol would require review of 9,375 documents. With SPL you would have to review 207,875 documents. With SAL, you would only have to review 95,375 documents. Clearly SAL is to be preferred to SPL but it still required the review of an extra 86,000 documents.

How much would that cost? To determine this there are several factors to consider. First, the TAR 1.0 protocols, SPL and SAL, require that a subject matter expert do the initial training. CAL does not require this. Thus, we have to determine the hourly rate of the SME (typically a lot more than a regular reviewer). We then have to determine how many documents an hour the expert (and later the reviewers) can get through. Lastly, we have to have a figure for reviewer costs.

Here are some working assumptions. If you take issue with any of them, it is easy enough to recalculate these figures based on different assumptions.

  1. Cost for a subject matter expert: $350/hour.
  2. Cost for a standard reviewer: $60/hour.
  3. Documents per hour reviewed (for both SME and reviewer): 60.

If we use these assumptions and work against our matter averages, we find this information about the costs of using the three protocols.

On an average review, at least based on these eight matters, you can expect to save over a quarter million dollars in review costs if you use continuous active learning as your TAR protocol. You can expect to save $115,000 over a simple active learning system. These are significant sums.

What About Using More Training Seeds?

As I mentioned earlier, Grossman and Cormack reported the results when substantially more training seeds were used: 5,000 and 8,000. If your subject matter expert is willing to review substantially more training documents, the cost savings from using CAL is less. However, at 60 documents an hour, your SME will spend 83 hours (about two weeks) doing the training with 5,000 seeds. He/she will spend  more than 133 hours (about 3.5 weeks) if you go for 8,000 seeds. Even worse, he/she may have to redo the training if new documents come in later.

That said, here is how the numbers worked out for 5,000 training seeds.

And for 8,000 training seeds.

The first thing to note is that the number of documents that ultimately have to be reviewed reduces as you add more training seeds. This seems logical and supports the fundamental CAL notion that the more training seeds you give to the algorithm the better the results. However, also note that the total review cost for SAL increases as you go from 5,000 to 8,000 training seeds. This is because we assume you have to pay more for SME training than review team training. With CAL, the reviewers do the training.

How Much Time Can I Save?

So far, we have only spoken about cost savings. What about time savings? We can quickly see how much time the CAL protocol saves as well.

For 2,000 training seeds:

For 5,000 training seeds:

And, for 8,000 training seeds:

As with cost savings, there are substantial review time savings to be had using CAL over simple passive learning and simple active learning. The savings range from 121 hours (SAL at 8,000 training seeds) to as much as 3,308 hours (SPL at 2,000 training seeds).

So How Much Can I Save with CAL?

“A lot” is the answer, based on the Grossman/Cormack research. We have published similar studies with similar results. Given this evidence, it is hard to imagine why anyone would use these out-of-date TAR protocols.

There are a number of other benefits that go beyond cost and time savings. First, CAL works well with low richness collections, as Grossman/Cormack point out. While some populations have high percentages of relevant documents, not all do. Why not choose one protocol that covers both ends of the spectrum equally well?

Second, as mentioned earlier, the CAL protocol allows for the continuous addition of documents without need for costly and time-consuming retraining. Simply add the new documents to the collection and keep reviewing. This is particularly true if you use our contextual diversity engine to find documents that are different from those you have already seen. Contextual diversity protects against the possibility of bias stemming from using documents found through keyword searches. See our postings about contextual diversity here.

Third, review can begin right away. With TAR 1.0 protocols, the review team can’t begin until an SME does the training. Depending on the SME’s inclination to look at random documents and schedule, the review can be help up for days or weeks. With CAL, the review starts right away.

These are just a few ways in which the TAR 1.0 protocols cause real world problems. Why pay more in review costs and time to use an inferior protocol? How much can you save with CAL?


Footnotes

[1] Grossman and Cormack also offered similar information for larger training sets, namely 5,000 and 8,000 documents, which I will discuss later. We start with data from the 2,000 document training sets because they are typical of TAR 1.0 reviews. Very few senior SMEs want to review larger numbers of documents in order to train the system. And, assuming rolling document collections, even fewer SMEs would be willing to engage in such an exercise multiple times. With CAL, there is no requirement that SMEs do training. Reviewers do it as part of their work.
mm

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.