In a recent blog post, we reported on a technology-assisted review simulation that we conducted to compare the effectiveness and efficiency of a family-based review versus an individual-document review. That post was one of a series here reporting on simulations conducted as part of our TAR Challenge – an invitation to any corporation or law firm to compare its results in an actual litigation against the results that would have been achieved using Catalyst’s advanced TAR 2.0 technology Insight Predict.
As we explained in that recent blog post, the simulation used actual documents that were previously reviewed in an active litigation. Based on those documents, we conducted two distinct experiments. The first was the family vs. non-family test. In this blog post, we discuss the second experiment, testing a TAR 1.0 review against a TAR 2.0 review.
Both of these experiments are reported in greater detail in this report.
TAR 1.0 vs. TAR 2.0
The question we answer in this simulation is whether it is more effective to conduct a simple learning (SAL or TAR 1.0) review trained using experts or a continuous learning (CAL or TAR 2.0) review trained using all available reviewers.
The TAR 1.0 approach uses a protocol that trains using the highest skilled or most knowledgeable reviewers for a finite amount of training followed by a batched-out review in which no machine learning takes place and the entire review team reviews documents in the order that is predicted by the machine learning algorithm as trained on the expert reviewers.
The TAR 2.0 approach uses all available reviewers, begins ranking and re-ranking the moment the first document is judged by any reviewer, and does not cease re-ranking until the target recall point is hit.
In our prior post, we compared family versus individual document review. For this experiment, we are going to keep it simple and take the individual document approach in all conditions, so that the effect of expert versus everyone, of simple versus continuous learning, can be directly observed, independent of the noise added by a family-based review. Thus, the basic parameters of the experiment are as follows:
Charting the Results
Figure 5 shows the main result. (Note that the numbering of the figures continues from our previous blog post based on this same data.) The TAR 2.0 line is shown in blue. TAR 1.0 lines are shown in a fade from red to green, with each line reflecting greater levels of training. That is, because different levels of TAR 1.0 training could be done, we show the gain curve after approximately every 3,600 documents of training.
However, because this information is so dense, we also show the same information broken out into three separate graphs. In Figure 6, we show training after approximately 3,600, 7,200 and 10,800 documents. In Figure 7, we show training after approximately 14,400, 18,000 and 21,600 documents. And in Figure 8, we show training after approximately 25,200, 28,800, 32,400, 36,000 and 39,600 documents.
Figure 5 conclusively shows that while TAR 1.0 can get close in places to TAR 2.0, at no point does it outperform it. Of course, the answer is not quite as simple as just saying that one approach is better. Take, for example, the first (reddest) TAR 1.0 gain curve in Figure 7, which was produced after training for approximately 14,400 documents. If the stopping point is 85% recall (halfway between 2.39 and 2.69 on the y-axis), then there is practically no difference between the TAR 1.0 and the TAR 2.0 approach. However, if the stopping point is 90% recall (2.69 on the y-axis), then the TAR 2.0 approach beats the expert approach by approximately 28,000 documents.
Or take the gain curves on Figure 8. These show that with enough expert training, the TAR 1.0 approach gets to 94-96% recall at about the same point as the TAR 2.0 approach. However, if the target is 80% recall, the TAR 2.0 approach beats the TAR 1.0 approach by approximately 17,000 documents.
Therefore, in addition to presenting the raw results, we would like to do a cursory analysis of some of the factors that go in to interpreting these results. While a full discussion of these factors is beyond the scope of this write-up, certain general observations can be made.
The three factors that we believe should go into a full analysis of these results are:
- Knowing when to stop.
- The cost of using the expert.
- The time it takes to execute a simple learning protocol.
Which of these factors is most important might change from matter to matter. Sometimes time is of the essence; sometimes cost is more important. The goal of this report is simply to raise awareness of the effect of TAR 1.0 vs. TAR 2.0 protocols on these factors.
Knowing When to Stop
The first problem of an expert-trained, TAR 1.0 protocol is knowing when to stop. As Figures 5 through 8 show, the overall gain curve is sensitive to the stopping point. Stop training too early (Figure 6) and it will take much longer to get to high recall. Stop training too late (Figure 8) and you’ll more quickly get to high recall after that point, but you will have done so much training that your overall review effort (and therefore cost) is still greater than it should be.
The problem is hitting that sweet spot of exactly the right amount of training. It is beyond the scope of this report to delve into those challenges, but the point is that they are challenges. In general, the fewer critical decisions that have to be made, the better. The TAR 2.0 approach requires only one critical decision: When to stop reviewing. But the TAR 1.0 approach requires two critical decisions: (1) when to stop training and then (2) when to stop reviewing. While it is not impossible to get both decisions correct, it is much more difficult than getting just one decision correct. Figures 5 through 8 show the consequences of incorrectly deciding the stopping point for training.
Cost of Using the Expert
In order to do a comparison between TAR 1.0 and TAR 2.0, we must select one of the TAR 1.0 gain curves from Figure 5 as the basis of the comparison. Arguably, the best among these curves is the one that was produced using approximately 14,400 expert training documents. We reproduce this curve in Figure 9.
Here again, even though this is arguably the best TAR 1.0 curve, the TAR 2.0 curve beats it at all points. At 70% recall, TAR 2.0 wins by about 5,000 documents, at 85% recall by less than 1,000 documents, and at 90% recall by over 41,000 documents. Of course, there is always the question of whether it is possible to hit this curve by training neither too little nor too long. Glossing over that issue for the moment, we assume that we’ve been able to achieve this gain curve by training for exactly the correct amount of time.
So the question is: Even though TAR 1.0 is close to TAR 2.0 in the raw number of reviewed documents, what is the total cost of review? That total cost necessarily includes not only the review work, but the training work as well. And if it costs more to put eyeballs on a training document than on a review document, that must be factored in.
In this matter, there was no cost difference between the best reviewers (who were used as proxy for the experts) and the regular reviewers. They were both paid at approximately the same rate. If there were a cost difference, this analysis could be repeated to take that differential into account. But, in general, an expert reviewer tends to cost much more per hour than contract reviewers. A rule of thumb in the industry would be about $50 an hour for the contract reviewer and $400 an hour for the SME. For comparison, we are also going to show analysis using SMEs paid $100 an hour and $200 an hour.
For this analysis, we make the assumption that all reviewers work at a rate of about 50 documents per hour. Thus, a $50-an-hour reviewer costs $1 per document, a $100-an-hour reviewer costs $2 per document, a $200-an-hour reviewer costs $4 per document, and a $400-an-hour reviewer costs $8 per document.
Figure 10 shows the results. Along the y-axis is still the cumulative number of responsive documents found, in simulated review order. For the x-axis, however, instead of showing raw document count, it is showing the dollar amount to review each document. With TAR 2.0, all training may be done by $50-an-hour contract reviewers. Thus, every document costs $1 to review. For TAR 1.0, our analysis assumes expert training at anywhere from $50 to $400 an hour, or $1 to $8 per document, and then batched out review at the contract rate of $1 per document.
Under the assumption that the expert costs the same as the contract reviewer, the difference between TAR 1.0 and TAR 2.0 is the same on a cost basis as it is on a total document review count basis. For example, the gap in cost at 85% recall would be less than $1,000. However, if the expert costs even a modest $100 an hour, as opposed to the contract reviewer’s $50, then the total cost of the TAR 2.0 review to get to 85% recall is around $45,000 while the total cost of the TAR 1.0 review is around $60,000. If the expert’s hourly rate is $200, then it will cost almost $116,000 to get to 85% recall. With the $400-an-hour expert, the cost would be $158,000 – over $100,000 more than the TAR 2.0 review, even though the difference in documents reviewed would be less than 1,000.
For additional comparisons, we show two more gain curves and their associated cost curves. Figures 11 and 12 show the gain curve and associated cost curves after training for approximately 25,200 documents. Training takes longer, but in the gain curve the TAR 1.0 approach catches up to the TAR 2.0 approach at about 94% recall. However, the cost to catch up to that gain curve, as shown in Figure 11, is much larger – perhaps even prohibitively so – because of the additional expert training cost.
Finally, Figures 13 and 14 show the gain curve and associated cost curves after training for approximately 7,200 documents. The training costs are less, but the resulting TAR 1.0 batched-out gain curve is also less effective, which makes the total cost to get to the same level of recall higher as well.
One more figure may be of interest. Figure 15 shows the cost curves of the TAR 1.0 system after training on 7,200, 14,400 and 25,200 documents, all presuming a $200-an-hour expert. From Figure 15, we see that these three TAR 1.0 curves with different amounts of training hit 90% recall at vastly different points, at vastly different raw numbers of total relevant documents. However, when one takes the cost of reviewing the documents – not just the raw number – each of these techniques hits 90% recall at about the same point – at about $138,000.
What this means is that one can use more expert training and get a better ranking, but the value of a better ranking is offset by the additional cost of the expert to get to that better ranking.
Furthermore, the TAR 2.0 approach still gets to the same 90% recall point at a cost of about $65,000 – a savings of $76,000. We can conclude, therefore, that the TAR 2.0 approach is not only better in terms of raw document counts, but also in terms of total cost savings.
The final analysis that we perform is of elapsed time. One advantage of TAR 2.0 is that one can hit the ground running with one’s entire review team. By contrast, TAR 1.0 necessitates that the experts finish training before the rest of the review team can start their work. This creates a bottleneck, as there are usually many fewer experts than there are contract reviewers.
We begin with the gain curves from Figure 9. Though the TAR 2.0 approach is ahead at all points, these curves are relatively close at the vast majority of recall points. So, using these curves as the basis, we calculate the time it took to achieve these recall levels.
We know that, in this case, the firm started with eight reviewers and moved to four core reviewers for the later stages of the review. For simplicity’s sake, we average that to a review team of six across the entire review. Furthermore, we know that two of those reviewers were the skilled expert reviewers. So we presume a TAR 2.0 review team of six people across the entire review, whereas for the TAR 1.0 workflow, we presume two people doing training and six people doing the batched-out review. Finally, we assume a review rate of about one document per minute. In the TAR 2.0 approach, all six reviewers work in parallel throughout the entire process. In the TAR 1.0 approach, two reviewers work in parallel during training, and six reviewers work in parallel during batched-out review. These assumptions let us create the time-based gain curves in Figure 16.
As you can see, TAR 1.0’s expert-based training is a real bottleneck in the overall elapsed time. Even though the curves in Figure 9 are quite similar to each other, the fact that only two reviewers work in parallel during training almost doubles the amount of time (13.5 days under TAR 2.0, 25 days under TAR 1.0) to get to 80% recall, and it more than doubles the amount of time (22 days under TAR 2.0, 45 days under TAR 1.0) to get to 90% recall.
In fact, after 15 days, the TAR 1.0 system has just barely finished its training, whereas at that same point, the TAR 2.0 system has hit a respectable 84% recall. TAR 1.0 has only finished training when TAR 2.0 has already put eyeballs on just about everything that it needs to.
In conclusion, the TAR 2.0 “training and review using everyone” workflow far and away outperforms the TAR 1.0 expert-only training and limited learning workflow, not only in terms of the raw number of documents that need to be reviewed, but also in the cost of doing the review, the time it takes to do the review, and the ease with which the review can be done.