In this research we answer two main questions: (1) What is the efficiency of a TAR 2.0 family-level document review versus a TAR 2.0 individual document review, and (2) How useful is expert-only (aka TAR 1.0 with expert) training, relative to TAR 2.0’s ability to conflate training and review using non-expert judgments ?
1. EXPERIMENT DESCRIPTION
In this section we present a quick overview of our experimental methodology.
1.1 Matter (Data)
Even though these experiments are simulations, they are based on real data. The documents themselves are from an active litigation matter, and represent a complete review for production. The judgments are the real reviewer judgments on those documents. The experts (see Section 3) are the reviewers specifically identified by litigation counsel as the ones most skilled and knowledgeable in making relevance judgments. The system is Insight Predict. Thus, in a real world situation, with the exact same documents and the exact same judgments, the review would have proceeded exactly as each of these experiments indicate. The data is real.
The core of our experiments involves the process of simulation. A basic simulation starts with ground truth (relevance judgments assigned to docids) that is assembled from the actual final judgments given to documents during the course of an already-completed review. It then proceeds in the following manner:
- Initial (starting) documents are selected and added to the set of “seen” documents
- All documents in the seen set are assigned relevance judgments based on the ground truth values
- The now-judged documents in the seen set are fed to the Predictive Algorithm, and the collection is reranked
- Based on the goals of the simulation, unseen documents (not already in the seen set) are selected from the predictive rankings and added to the seen set
- If no more unseen documents remain, the process terminates. Otherwise, the simulation returns to step (2)
More details about how these simulations (experiments) play out in practice will be given in the following sections.
1.3 Gain Curve
No simulation is complete without a metric, some way of evaluating or visualizing the result of the simulation. The visualization that we will use in this report is a gain curve. As the simulation proceeds, more and more documents are being selected and added to the set of “seen” documents. This process of iteratively selecting more and more unseen documents, until the entire collection has been added, imposes a total ordering over all the documents in the collection, i.e. the order in which all the documents in the collection were reviewed by the simulated review team. We therefore plot this document ordering along the x-axis of a two-dimensional plot. However, not every document in the collection is a responsive document. Therefore, along the y-axis we plot the cumulative number of responsive documents that have been seen to that point in the simulated review (to that point along the x-axis). The idea is that a more effective result is one in which there is a higher rise in cumulative responsiveness earlier in the review.
The advantage of the gain curve is that it takes everything into account, all documents (simulatedly) reviewed in the order in which they were (simulatedly) reviewed, including but not limited to control sets, random samples for richness estimation purposes, judgmental samples, and so on. This allows the full comparison of an entire process, and everything that goes into that process for whatever reason, rather than just a final, often obfuscatory metric such as F1.
2. FAMILY VERSUS INDIVIDUAL DOCUMENT TAR REVIEW
2.1 Family Experiment #1
In this first experiment we test the question of how the production review would have proceeded in a family-based versus individual (non-family) document-based TAR-based review. Each condition (family and non-family) is run as a separate simulation and the results are shown on the same plot for comparison.
As per Section 1.2, a simulation generally proceeds by feeding iteratively growing sets of seen (and therefore simulatedly judged) documents to the core ranking engine and selecting (the ever-changing) top-ranked unseen docs. This is the basic procedure for the individual (non-family) document approach: At each iteration, the top unseen documents are selected and added to the simulated review. However, in the family-based approach, there is a slight difference. During the top-document selection phase, not only are the top documents selected, but any as-yet unseen family members of any of these documents are added to the simulated review as well. The following is a breakdown of the parameters for the experiment:
In both conditions (family and individual), all aspects but one were held constant. Both conditions started with the same 660 initial seed documents identified and foldered in Insight (of which 72 were relevant and 588 were non-relevant). Both follow the TAR 2.0, continuous learning (CAL) protocol, in which training is review, review is training, and never stops.
The feature extraction was the same, and the core learning/ranking algorithm was the same. So the primary difference is the review selection mechanism. Again, in the family condition, when a document was predicted by Insight Predict to be relevant and therefore selected and inserted in to the (experiment-simulated) reviewer queue, any as-yet simulation-unseen documents that belong to the same family as the predicted document are also added to the queue in the same position. In comparison, in the individual document condition, only the predicted document itself is added into the queue.
One more item to note is the update rate parameter. During our years of research, we have found that a system that retrains (updates) more frequently also produces better results. So when comparing family-based against document-based TAR, we wanted to hold the update rate constant, so as not to give unfair advantage to one condition just because it updates more frequently than the other. To wit: Our goal was to run simulations in which we updated the rankings in each experiment after selecting the top 250 documents. However, we found that (on average) when 250 top documents were selected under the family condition, another ≈ 435 documents came in as family members of those documents. Thus, on average, the family condition was updated every ≈ 685 documents. Therefore, instead of updating every 250 documents in the individual document condition, we switched that parameter to 685. Therefore, at each iteration, each condition has “seen” roughly the same number of documents.
The results of the experiment are found in Figure 1:We should note a couple of “features” of this graph. The first is the scale. Rather than expressing things in terms of precision and/or recall, we are expressing things in terms of their raw numbers. The raw number of simulatedly reviewed documents is along the x-axis, and the raw number of responsive documents is along the y-axis. The numbers are given at scale, 105 along the x-axis, 104 along the y-axis. Additionally, the tick marks are expressed every at every 10% recall point. That is, 2.39 (aka 23,900 documents) on the y-axis is the 8th tick mark, which is also 80% recall. This is to enhance interpretability.
The first thing that should be clear here is that the individual document review more that triples the performance, relative to the theoretical absolute best obtainable performance possible. That is, given that there are (approximately) 30,000 responsive docs, the bare minimum that would need to be reviewed in an eyeballs-on review to get to 80% recall is 30,000 * 0.8 = 24,000. That’s assuming that one could actually do it without looking at a single non-responsive document, ever, which is of course not a realistic assumption but nevertheless serves as a useful upper baseline. And the individual document approach gets there at about 36,000 documents, which means a “waste” of 12,000 documents, while the family-based approach gets there at about 70,000 documents, which is a waste of 46,000 documents.. or 3.83 times (383%) more waste.
Next, we present another slight variation on this graph. One of Catalyst’s long-term warnings about reviewing as families is simply that it’s inefficient, not that good predictions can’t be had. To illustrate this, we present a secondary “perfect” line. This red dotted line is based on a family-based review. That is, if somehow an oracle were to present only families with at least one responsive document, this dotted red line is the rate at which one would (on average) achieve a perfect result, i.e. find 100% of the families with at least one relevant document and not a single family without a responsive document. Of course, even families with at least one relevant document have multiple non-responsive documents within them, which is why the perfect family line is worse than the perfect individual document line. What is interesting, however, is that up until about 85% recall, Insight Predict on individual documents actually does better than the perfect family approach. This shows just how much cost there is to family-based review.
The results of the experiment are found in Figure 2:2.2 Family Experiment #2
In the previous experiment, we compared a raw individual document-based continuous learning review against a family-based review. The individual document approach was much more effective. However, it is often the case that families need to be produced to opposing counsel, not individual documents. Therefore, we propose an alternative workflow that satisfies the legal requirement to produce families with at least one responsive document, but does not share the same inefficiencies of a full family-based review. And we demonstrate the effectiveness of this workflow with another simulation experiment.
This second family-based experiment proceeds as follows: Documents are reviewed on an individual basis until a target recall point has been hit. At that point, all unreviewed family members of only documents that have been marked responsive are added to the review queue. We call this “individual document review with post hoc family padding”. To give an overall sense of how this review protocol works, rather than selecting a single target recall stopping point, we halt the individual document review at various recall points: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 85%, 90%, and 95% recall. The results are shown in the chart below, in thick blue lines.The first thing that we note is that the requirement to review all family members of even just the relevant documents found at that point in the review adds significant cost to the review. For example, notice the result at the 80% recall point (2.39 on the y-axis). That stopping point happens about 36,000 documents into the review. And as noted above, at that point, 24,000 documents are responsive and only 12,000 documents are not.
But adding the unreviewed family members of those 24,000 responsive documents increases the review queue by approximately 17,000 unseen additional documents, 1,500 of which are responsive. So recall does go up to 85%, but at a cost of around 15,500 additional non-relevant documents, i.e. more than a doubling of wasted effort.
Nevertheless, as the chart shows, this is still much more effective than a full family-based review.
2.3 Family Experiment #3
In our third and final family experiment, we propose a second alternative family workflow. In the previously proposed protocol, documents were reviewed on an individual basis, and then families were only “filled out” at the conclusion of the review, once the target recall point had been hit. Another approach would be to make the family review dynamic. That is, documents are still predicted and selected on an individual basis. But if a document is tagged as responsive, its family members are immediately brought into the review queue. However, if a document is not marked as responsive, its family members are not brought in.At first glance, this would appear to not be that different from the previous proposal in Section 2.2. However, it not only changes the training update rate, but it also brings in different sets of responsive and non responsive documents that would be used for predictions. This might have the potential to steer the review in different directions. Remember, in the previous approach, family members were only brought in at the end, after the continuous review had hit the target recall point, so they did not affect the document orderings. So we have to examine the effect of this protocol change. We name this approach “responsive-only family review”, and it is indicated in the following chart with a thick purple line.
Overall, though there are slight differences, the responsive-only family review protocol is about the same as the end result of the individual followed by post hoc family padding protocol. The former is slightly better at lower recall, slightly worse at higher recall, and better again at very high recall – though we have certain suspicions about that last 5% of responsive documents that might be interesting to consider and evaluate before we read too much into these results. Nevertheless, this second responsive-only family review is still significantly better than the full family review at almost all recall points.
3. EXPERT TAR 1.0 (SAL) VERSUS EVERYONE TAR 2.0 (CAL) REVIEW
The second question that we answer is whether it is more effective to conduct a simple learning (SAL or TAR 1.0) review trained using experts, or a continuous learning (CAL or TAR 2.0) review trained using all available reviewers. The former approach utilizes a protocol that trains using expert or the highest skilled or most knowledgeable reviewers for a short, finite amount of training followed by a batched-out review in which no machine learning takes place and the entire review team reviews documents in the order that is predicted by the machine learning algorithm as trained on the expert reviewers. The latter approach uses all available reviewers, begins ranking and re-ranking the moment the first document is judged by any reviewer, and does not cease re-ranking until the target recall point is hit.
For these experiments we are not going to complicate matters by doing family versus individual document review. We are going to take the individual document review approach in all conditions, so that the effect of expert versus everyone, of simple (limited) versus continuous learning can be directly observed, independent of the noise added by a family-based review. Thus, the basic parameters of the experiment are as follows:
Before presenting the results, we need to discuss one caveat about the overall experiment. In an ideal world, when doing an experiment such as this one, there would be two sets of “official” judgments on every document in the collection: one from the expert or most highly skilled reviewers and one from the regular reviewers. Thus, as training documents are being selected, those selections would be able to range across any and all documents in the collection. As things stand, however, the documents for which we have expert judgments are a subset of the entire judged collection. And for good reason: paying two sets of reviewers to review the entire collection is cost prohibitive.
Nevertheless, it means that when it came time for the TAR 1.0 approach to select a specific document that it could have used for training, that document might not have been available, because it was not judged in the original review by an expert. The TAR 1.0 approach could only use documents that had an expert judgment. So the question remains as to whether the outcome of the TAR 1.0 experiment is as accurate as it otherwise could be.
On the other hand, there is an argument to be made that this might not matter. In the original review, the expert reviewers were intermingled with the regular reviewers. There was no systematic bias toward what documents those expert reviewers were seeing; they were seeing a smattering of every kind of document, throughout the entire review. Thus it is just as likely that the TAR 1.0 approach was made better by not having access to (not being “watered down” by) certain documents, as it was that the TAR 1.0 approach was made worse. Overall, I do not wish to cast any significant doubt on this experiment; however, as a good scientist I have to do a mental check of all parameters of my experiment and at least make those known so that they may be discussed.
Figure 5 shows the main result. The TAR 2.0 line is shown in blue. TAR 1.0 lines are shown in a fade from red to green, as more and more training is done. That is, because different amount of TAR 1.0 training could be done, we show the gain curve after approximately every 3600 documents of training.
However, this information is much too dense; there are far too many lines to make sense of it all. So we also show the same information broken out into three separate graphs. In Figure 6, we show training after approximately 3600, 7200, and 10800 documents. In Figure 7 we show training after approximately 14400, 18000, and 21600 documents. And in Figure 8 we show training after approximately 25200, 28800, 32400, 36000, and 39600 documents.
Figure 5 conclusively shows that while the Expert TAR 1.0 approach can get close in places, at no point does it outperform the Everyone TAR 2.0 approach. Of course, the answer is not quite as simple as just saying that one approach is better. Take for example the first (reddest) Expert TAR 1.0 gain curve in Figure 7, which was produced after training for approximately 14400 documents. If the stopping point is 85% recall (halfway between 2.39 and 2.69 on the y-axis), then there is practically no difference between the expert and the everybody approach. However, if the stopping point is 90% recall (2.69 on the y-axis), then the everybody approach beats the expert approach by approximately 28,000 documents.
Or take the gain curves on Figure 8. These show that with enough expert training, the TAR 1.0 approach gets to 94-96% recall at about the same point as the TAR 2.0 approach. However, if the target is 80% recall, the TAR 2.0 approach beats the TAR 1.0 approach by approximately 17,000 documents.
Therefore, in addition to presenting the raw results, we would like to do a cursory analysis of some of the various factors that go in to interpreting these results. While a full discussion of these factors is beyond the scope of this write-up, certain general observations can be made. The three factors that we believe should go into a full analysis of these results are: (1) knowing when to stop, (2) the cost of using the expert, and (3) the time it takes to execute a simple learning protocol. Which of these three factors is most important in any given moment might change from matter to matter. Sometimes time is of the essence; sometimes monetary factors are more important. The goal of this report is simply to raise awareness of the effect of TAR 1.0 vs. TAR 2.0 protocols on these factors.
3.3.1 Knowing When to Stop
The first problem of an expert-trained, TAR 1.0 protocol is knowing when to stop. As Figures 5 through 8 show, the overall gain curve is very sensitive to the stopping point. Stop training too early (Figure 6) and it will take much longer (more review effort needed) to get to high recall. Stop training too late (Figure 8) and you’ll more quickly get to high recall after that point, but you will have done so much training that your overall review effort (and therefore cost) is still greater than it should be.
The problem is hitting that sweet spot, of exactly the right amount of training: Not too much and not too little. It is beyond the scope of this report to delve into those challenges, but the point is that they are challenges. If this is a point that is of more interest, we recommend the following paper: An Exploratory Analysis of Control Sets for Measuring E-Discovery Progress . In general, the fewer critical decisions that have to be made, the better. In contrast the TAR 2.0 approach only requires one critical decision: The decision when to stop reviewing. The TAR 1.0 approach requires two critical decisions: (1) The decision when to stop training and then (2) the decision when to stop reviewing. As such, while it is not impossible to get both decisions correct, it is much more difficult than just getting one decision correct. Figures 5 through 8 show the consequences of getting the training stopping point decision incorrect.
3.3.2 Monetary Cost of Using the Expert
In order to do a comparison between expert-trained TAR 1.0 and everybody TAR 2.0, we must select one of the TAR 1.0 gain curves from Figure 5 as the basis of the comparison. Arguably, the best among these curves is the one that was produced using approximately 14400 expert training documents. We reproduce this curve in Figure 9.
Again, even though this is (arguably) the best TAR 1.0 curve, the TAR 2.0 curve beats it at all points: At 70% recall TAR 2.0 wins by only about 5,000 documents, at 85% recall by less than a 1,000 documents, and at 90% recall by over 41,000 documents. Of course, as per Section 3.3.1, there is always the question of whether it is possible to hit this curve by neither training too little nor too long. Glossing over that issue for the moment, we assume that we’ve been able to achieve this gain curve by training for exactly the correct amount of time. So the question is: Even though TAR 1.0 is close to TAR 2.0 in terms of the raw number of reviewed documents, what is the total cost of review? That total cost necessarily includes not only the review work, but the training work as well. And if it costs more to put eyeballs on a training document than on a review document, that must be taken into account.
In this particular matter, there was no cost difference between the best reviewers (which were used as proxy for the experts) and the regular reviewers. They were both paid at approximately the same rate. If there were a cost difference, this same analysis can be repeated to focus more directly on that specific differential. But in general, an expert reviewer tends to cost much more per hour than contract reviewers. A rule of thumb in the industry would be about $50/hour for the contract reviewer and $400/hour for the SME. For comparison, we are also going to show analysis using $100/hour and $200/hour SMEs as well. So let’s examine these results in terms of that cost. For this analysis, I make the assumption that all reviewers work at a rate of about 50 documents per hour. Thus, a $50/hour reviewer costs about $1 per document, a $100/hour reviewer costs $2 per document, a $200/hour reviewer costs $4 per document, and a $400/hour reviewer costs about $8 per document.
The results of our analysis are found in Figure 10. Along the y-axis is still the cumulative number of responsive documents found, and is still shown in simulated review order. However, instead of that x-axis being expressed in terms of raw document count, it is expressed in terms of the dollar amount to review each document. In the TAR 2.0 approach, training is review and review is training, and all (training = review) may be done by $50/hour contract reviewers. So every document costs $1 to review. For the TAR 1.0 approach, our analysis assumes expert training with anywhere from $50/hour to $400/hour ($1 per document to $8 per document), and then batched out review (no additional learning) at the contract rate of $1 per document.
Under the assumption that the expert costs the same amount as the contract reviewer, the difference between TAR 1.0 and TAR 2.0 is the same on a cost basis as it is on a total document review count basis. For example, the gap in cost at 85% recall (on this curve) would be less than $1000. However, if the expert reviewer costs as little as $100/hour as opposed to the contract reviewer’s $50/hour, then the total cost of the TAR 2.0 review to get to 85% recall is around $45,000 while the total cost of the TAR 1.0 review is around $60,000. If the expert reviewer costs $200/hour, then it will cost almost $116,000 to get to 85% recall. With the $400/hour expert the cost would be $158,000, over a hundred thousand dollars more than the TAR 2.0 review, even though the total difference in number of documents reviewed would be less than 1000.
It should be clear that the reason this is happening boils down to the cost of the expert training. Part of this exercise is theoretical, in that we are assuming different expert training costs. But part of this exercise is realistic, in that we are using the actual matter and judgments to get a sense of the relative proportion of training and review that is necessary to get a good TAR 1.0 result. By showing a range of expert costs in Figure 10, we get a sense of the distribution over total cost under an expert TAR 1.0 versus an everybody TAR 2.0 workflow.
For additional comparisons, we show two more gain curves and their associated cost curves. Figures 11 and 12 show the gain curve and associated cost curves after training for approximately 25200 documents. Training takes longer, but in the gain curve the TAR 1.0 approach catches up to the TAR 2.0 approach at about 94% recall. However, the cost to catch up to that gain curve, as shown in Figure 11, is much larger – perhaps even prohibitively so – because of the additional expert training cost.
Finally, Figures 13 and 14 show the gain curve and associated cost curves after training for approximately 7200 documents. The training costs are less, but the resulting TAR 1.0 batched-out (no additional learning) gain curve is also less effective, which makes the total cost to get to the same level of recall higher as well.
There is one more figure which may be of interest. Figure 15 shows the cost curves of the expert TAR 1.0 system after training on 7200, 14400, and 25200 documents, respectively – all presuming a $200/hour expert. From Figure 15 we see that these three TAR 1.0 curves with different amounts of training hit 90% recall at vastly different points, at vastly different raw number of total relevant documents. However, when one takes the cost of reviewing the documents, not just the raw number, each of these techniques hits 90% recall at about the same point: At about $138,000 dollars. Essentially, what this means is that one can use more expert training and get a better ranking, but the value of a better ranking is offset by the additional cost of the expert to get to that better ranking.
Furthermore, the TAR 2.0 approach still gets to the same 90% recall point at a cost of about $65,000 (a savings of about $76,000) so all of that is a mute point, anyway. We therefore conclude from the range of experiments that the TAR 2.0 approach is not only better in terms of raw document counts, but in terms of
total monetary cost as well.
3.3.3 Expert-Attributable Time Bottlenecks
The final analysis that we perform is an elapsed clock time analysis. One advantage of the TAR 2.0 approach is that one can hit the ground running with one’s entire review team. In comparison, an expert-driven TAR 1.0 approach necessitates that the experts finish training before the rest of the review team can start their work. This becomes a bottleneck, as there are usually many fewer expert reviewers than there are contract reviewers.
For this analysis we begin with the gain curves from Figure 9. Though the TAR 2.0 approach is ahead at all points, these curves are relatively close at the vast majority of recall points. So using these curves as the basis, we calculate how long in total elapsed clock time to achieve these recall levels.
As the basis of our calculation, we are going to use the stated fact that in this case, the firm started with a team of 8 reviewers and moved to 4 core reviewers for the late stages of the review. For simplicity’s sake I am going to average that to a review team size of 6 people across the entire review. Furthermore, we were told that two of those reviewers were the skilled, “expert” reviewers. So for this analysis we presume a TAR 2.0 review team of 6 people across the entire review, whereas for the TAR 1.0 workflow we presume 2 people doing training and 6 people doing the batched-out review. Finally, we assume a document review rate of about one document per minute. In the TAR 2.0 approach, all 6 reviewers work in parallel throughout the entire process. In the TAR 1.0 approach, 2 reviewers work in parallel during training, and 6 reviewers work in parallel during batched-out review. These assumptions let us create the time-based gain curves in Figure 16.
As we see from those curves, TAR 1.0, expert-based training is a real bottleneck in terms of the overall elapsed clock time of the entire process. Even though the curves in Figure 9 are quite similar to each other, the fact that only two reviewers work in parallel during training is a severe bottleneck that almost doubles the amount of time (13.5 days under TAR 2.0, 25 days under TAR 1.0) to get to 80% recall, and more than doubles the amount of time (22 days under TAR 2.0, 45 days under TAR 1.0) to get to 90% recall.
In fact, one way of looking at this is that after 15 days, the TAR 1.0 system has just barely finished doing its training, whereas at that same point in time the TAR 2.0 system has hit a respectable 84% recall. TAR 1.0 has only finished training when TAR 2.0 has already put eyeballs on just about everything that it needs to.
We can certainly undertake additional time-based analysis, including but not limited to different TAR 1.0 gain curves and different numbers of experts and review team sizes. But, generally, this example more than illustrates the time bottleneck problems of a TAR 1.0 work.
In conclusion, the TAR 2.0 “training and review using everyone” workflow far and away outperforms the TAR 1.0 expert-only training, finite (limited) learning workflow not only in terms of the raw number of documents that need to be reviewed, but the cost of doing the review, the time it takes to do the review, and the ease with which the review can be done (i.e. one does not have to make two critical decisions about when to stop training and when to stop review.. only one decision about when to stop review).
 J. Pickens. An exploratory analysis of control sets for measuring e-discovery progress. In Proceedings of the ICAIL 2015 Workshop on Using Machine Learning and Other Advanced Techniques to Address Legal Problems in EDiscovery and Information Governance (DESI VI Workshop), San Diego, California, 2015.
 J. Tredennick, J. Pickens, and J. Eidelman. Predictive coding 2.0: New and better approaches to non-linear review. http://www.legaltechshow.com/r5/cob_page.aspcategory_id=72044&initial_file=cob_pageltech_agenda.asp#ETA3, January 2012. LegalTech Presentation.