Comparing Family-Level Review Against Individual-Document Review: A Simulation Experiment

Catalyst_Simulation_ExperimentIn two recent posts, we’ve reported on simulations of technology assisted review conducted as part of our TAR Challenge—an opportunity for any corporation or law firm to compare its results in an actual, manual review against the results it would have achieved using Catalyst’s advanced TAR 2.0 technology, Insight Predict.

Today, we are taking a slightly different tack. We are again conducting a simulation using actual documents that were previously reviewed in an active litigation. However, this time, we are using the documents to conduct two experiments.

This post reports on the first of these experiments. In this first experiment, we test the question of how the review would have proceeded in a family-based versus individual (non-family) document-based TAR review. We’ll run each condition (family and non-family) as a separate simulation and show the results on the same plot for comparison.

In the second experiment, which we will report on in a subsequent blog post, we answer the question of whether it is more effective to conduct a simple learning (SAL or TAR 1.0) review trained using experts, or a continuous learning (CAL or TAR 2.0) review trained using all available reviewers.

Both of these experiments are reported in greater detail in this report. This post is an abridged version of the information contained in that report.

Testing Family vs. Non-Family

At the outset, note that even though these experiments are simulations, they are based on real data. The documents we use are from an actual litigation and represent a complete review for production. The judgments are the real reviewer judgments on those documents. In a real-world setting, with the exact-same documents and the exact-same judgments, the review would have proceeded exactly as these experiments indicate.

When conducting a simulation, we start with ground truth – the relevance judgments assigned to docids – that we’ve assembled from the actual final judgments given to documents during the course of a previously completed review. We then proceed in the following manner:

  1. Initial (starting) documents are selected and added to the set of “seen” documents.
  2. All documents in the seen set are assigned relevance judgments based on the ground truth values.
  3. The now-judged documents in the seen set are fed to the TAR algorithm, and the collection is re-ranked.
  4. Based on the goals of the simulation, unseen documents (not already in the seen set) are selected from the predictive rankings and added to the seen set.
  5. If no more unseen documents remain, the process terminates. Otherwise, the simulation returns to step (2).

As we go through the simulation, we plot the results on a gain curve demonstrating the order in which responsive and non-responsive documents would have been reviewed with Predict, to provide an easy comparison to a linear review. A gain curve provides a simple but effective visual means to compare the TAR results to those from a linear review because it shows the cumulative number of positive documents that would be found throughout the entire review.

Family Experiment #1

As noted, this experiment tests the question of how the review would have proceeded in a family-based versus individual (non-family) document-based TAR review. Our simulation proceeds by feeding iteratively growing sets of seen (and therefore judged) documents to the core ranking engine and selecting the top-ranked unseen docs.

For the individual (non-family) approach, we follow this procedure: At each iteration, the top unseen documents are selected and added to the simulated review. However, in the family-based approach, there is a slight difference. Not only are the top documents selected, but any as-yet unseen family members of any of these documents are also added to the simulated review. The following is a breakdown of the parameters for the experiment:

For both conditions (family and individual), all aspects but one are held constant. Both conditions start with the same 660 initial seed documents identified and foldered in Insight (of which 72 were relevant and 588 were non-relevant). Both follow the TAR 2.0 continuous learning (CAL) protocol, in which training is review and review is training.

Table 1: Family versus Individual Document TAR

Table 1: Family versus Individual Document TAR

The feature extraction is the same, and the core learning/ranking algorithm is the same. The primary difference is the review-selection mechanism. Again, in the family condition, when a document is predicted by Insight Predict to be relevant and therefore selected and inserted into the simulated reviewer queue, any as-yet simulation-unseen documents that belonged to the same family as the predicted document are also added to the queue in the same position. In comparison, in the individual document condition, only the predicted document itself is added into the queue.

One more note. Through years of research, we have found that a system that retrains more frequently also produces better results. So when comparing family-based against document-based TAR, we wanted to hold the update rate constant, so as not to give unfair advantage to one condition just because it updates more frequently than the other.

Our goal was to run simulations in which we updated the rankings in each experiment after selecting the top 250 documents. However, we found that, on average, when 250 top documents were selected under the family condition, another 435 or so documents came in as family members of those documents. Thus, on average, the family condition was updated every 685 documents. Therefore, instead of updating every 250 documents in the individual document condition, we switched that parameter to 685. Therefore, at each iteration, each condition has “seen” roughly the same number of documents.

The results of the experiment are shown in Figure 1:

Figure 1: Family vs. Individual [family (red), individual (blue), perfect (dotted blue), manual (dotted black)]

Figure 1: Family vs. Individual [family (red), individual (blue), perfect (dotted blue), manual (dotted black)]

As you can see, the individual document review more than triples the performance, relative to the theoretical best-obtainable performance possible. That is, given that there are approximately 30,000 responsive docs, the bare minimum that would need to be reviewed in an eyeballs-on review to get to 80% recall is 30,000 * 0.8 = 24,000. That’s assuming that one could actually do it without ever looking at a single non-responsive document, which is not a realistic assumption but nevertheless serves as a useful upper baseline. The individual-document approach gets there at about 36,000 documents, which means a “waste” of 12,000 documents, while the family-based approach gets there at about 70,000 documents, which is a waste of 46,000 documents, or 3.83 times (383%) more waste.

The next graph shows a slight variation. One of Catalyst’s long-term warnings about reviewing as families is simply that it is inefficient, not that good predictions can’t be had. To illustrate this, we present a secondary “perfect” line. This red-dotted line plots a family-based review. That is, if somehow an oracle were to present only families with at least one responsive document, this dotted red line is the rate at which one would, on average, achieve a perfect result, i.e. find 100% of the families with at least one relevant document and not a single family without a responsive document.

Of course, even families with at least one relevant document have multiple non-responsive documents within them, which is why the perfect family line is worse than the perfect individual document line. What is interesting, however, is that up until about 85% recall, Insight Predict on individual documents does better than the perfect family approach. This shows just how much cost there is to family-based review.

Figure 2: Family vs. Individual [family (red), individual (blue), individual perfect (dotted blue), family perfect (dotted red), manual (dotted black)]

Figure 2: Family vs. Individual [family (red), individual (blue), individual perfect (dotted blue), family perfect (dotted red), manual (dotted black)]

Family Experiment #2

In the previous experiment, we compared a raw individual document-based continuous learning review against a family-based review. However, it is often the case that families need to be produced to opposing counsel, not individual documents. Therefore, our second experiment proposes an alternative workflow that satisfies the legal requirement to produce families with at least one responsive document, but does not share the same inefficiencies of a full family-based review.

This time, documents are reviewed on an individual basis until a target recall point is hit. At that point, all unreviewed family members of only documents that have been marked responsive are added to the review queue. We call this “individual document review with post hoc family padding.” To give an overall sense of how this review protocol works, rather than selecting a single target recall stopping point, we halt the individual document review at various recall points: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 85%, 90%, and 95% recall. The results are shown in the chart below, in thick blue lines.

Figure 3: Family vs. Individual [family (red), individual (blue), post hoc family padding (thick blue), individual perfect (dotted blue), family perfect (dotted red), manual (dotted black)]

Figure 3: Family vs. Individual [family (red), individual (blue), post hoc family padding (thick blue), individual perfect (dotted blue), family perfect (dotted red), manual (dotted black)]

Note that the requirement to review all family members of even just the relevant documents found at that point in the review adds significant cost. For example, notice the result at the 80% recall point (2.39 on the y-axis). That stopping point happens about 36,000 documents into the review. As noted above, at that point, 24,000 documents are responsive and only 12,000 documents are not.

But adding the unreviewed family members of those 24,000 responsive documents increases the review queue by approximately 17,000 unseen documents, of which 1,500 are responsive. Recall does go up to 85%, but at a cost of around 15,500 additional non-relevant documents, i.e. more than a doubling of wasted effort.

Nevertheless, as the chart shows, this is still much more effective than a full family-based review.

Family Experiment #3

In our third and final family experiment, we propose a second alternative family workflow. In the previous experiment, documents were reviewed on an individual basis, and then families were “filled out” only at the conclusion of the review, once the target recall point had been hit. Another approach would be to make the family review dynamic. That is, documents are still predicted and selected on an individual basis. However, if a document is tagged as responsive, its family members are immediately brought into the review queue. If a document is not marked as responsive, its family members are not brought in.

At first glance, this would appear not to differ much from the previous proposal. In fact, however, it not only changes the training update rate, but it also brings in different sets of responsive and non-responsive documents that would be used for predictions. This might have the potential to steer the review in different directions. Remember, in the previous approach, family members were only brought in at the end, after the continuous review had hit the target recall point, so they did not affect the document orderings. So, we have to examine the effect of this protocol change. We name this approach “responsive-only family review,” and it is indicated in the following chart with a thick purple line.

Figure 4: Family vs. Individual [family (red), individual (blue), post hoc family padding (thick blue), responsive-only family review (thick purple), individual perfect (dotted blue), family perfect (dotted red), manual (dotted black)]

Figure 4: Family vs. Individual [family (red), individual (blue), post hoc family padding (thick blue), responsive-only family review (thick purple), individual perfect (dotted blue), family perfect (dotted red), manual (dotted black)]

Overall, though, there are slight differences. The responsive-only family review protocol is about the same as the end result of the individual followed by post hoc family padding protocol. The former is slightly better at lower recall, slightly worse at higher recall, and better again at very high recall – though we have certain suspicions about that last 5% of responsive documents that might be interesting to consider and evaluate before we read too much into these results. Nevertheless, this second responsive-only family review is still significantly better than the full family review at almost all recall points.

In our next post, we will look at whether it is more effective to conduct a simple learning (SAL or TAR 1.0) review trained using experts, or a continuous learning (CAL or TAR 2.0) review trained using all available reviewers.

Leave a Reply

Your email address will not be published. Required fields are marked *