TAR 2.0: Continuous Ranking – Is One Bite at the Apple Really Enough?

For all of its complexity, technology-assisted review (TAR) in its traditional form is easy to sum up:

  1. A lawyer (subject matter expert) sits down at a computer and looks at a subset of documents.
  2. For each, the lawyer records a thumbs-up or thumbs-down decision (tagging the document). The TAR algorithm watches carefully, learning during this training.
  3. When training is complete, we let the system rank and divide the full set of documents between (predicted) relevant and irrelevant.[1]
  4. We then review the relevant documents, ignoring the rest.

The benefits from this process are easy to see. Let’s say you started with a million documents that otherwise would have to be reviewed by your team. If the computer algorithm predicted with the requisite degree of confidence that 700,000 are likely not-relevant, you could then exclude them from the review for a huge savings in review costs. That is a great result, particularly if you are the one paying the bills.

But is that it? Once you “part the waters” after the document ranking, you are stuck reviewing the 300,000 that fall on the relevant side of the cutoff. If I were the client, I would wonder whether there were steps you could take to reduce the document population even further. While reviewing 300,000 documents is better than a million, cutting that to 250,000 or fewer would be even better.

Can we reduce the review count even further?

The answer is yes, if we can change the established paradigm. TAR 1.0 was about the benefits of identifying a cutoff point after running a training process using a subject matter expert (SME). TAR 2.0 is about continuous ranking throughout the review process—using review teams as well as SMEs. As the review teams work their way through the documents, their judgments are fed back to the computer algorithm to further improve the ranking. As the ranking improves, the cutoff point is likely to improve as well. That means even fewer documents to review, at a lower cost. The work gets done more quickly as well.

It can be as simple as that!

Insight Predict is built around this idea of continuous ranking. While you can use it to run a traditional TAR process, we encourage clients to take more than one bite at the ranking apple. Start the training by finding as many relevant documents (responsive, privileged, etc.) as your team can identify. Supplement these documents (often called seeds) through random sampling, or use our contextual diversity sampling to view documents selected for their distinctiveness from documents already seen.[2]

The computer algorithm can then use these training seeds as a basis to rank your documents. Direct the top-ranked ones to the review team for their consideration.

In this scenario, the review team starts quickly, working from the top of the ranked list. As they review documents, you feed their judgments back to the system to improve the ranking, supplemented with other training documents  chosen at random or through contextual diversity. Meanwhile, the review team continues to draw from the highest-ranked documents, using the most recent ranking available. They continue until the review is complete.[3]

Does it work?

Logic tells us that continuously updated rankings will produce better results than a one-time process. As you add more training documents, the algorithm should improve. At least, that is the case with our system. While rankings based on a few thousand training documents can be quite good, they almost always improve through the addition of more training documents. As our Senior Research Scientist Jeremy Pickens says: “More is more.” And more is better.

And while more is better, it does not necessarily mean more work for the team. Our system’s ability to accept additional training documents, and to continually refine its rankings based on those additional exemplars, results in the review team having to review fewer documents, saving both time and money.

Testing the hypothesis

We decided to test our hypothesis using three different review projects. Because each had already gone through linear review, we had what Dr. Pickens calls “ground truth” about all of the records being ranked. Put another way, we already knew whether the documents were responsive or privileged (which were the goals of the different reviews).[4]

Thus, in this case we were not working with a partial sample or drawing conclusions based on a sample set. We could run the ranking process as if the documents had not been reviewed but then match up the results to the actual tags (responsive or privileged) given by the reviewers.

The process

The tests began by picking six documents at random from the total collection. We then used those documents as training seeds for an initial ranking. We then ranked all of the documents based on those six exemplars.[5]

From there, we simulated delivering new training documents to the reviewers. We included a mix of highly ranked and random documents, along with others selected for their contextual diversity (meaning they were different from anything previously selected for training). We used this technique to help ensure that the reviewers saw a diverse range of documents—hopefully improving the ranking results.

Our simulated reviewers made judgments on these new documents based on tags from the earlier linear review. We then submitted their judgments to the algorithm for further training and ranking. We continued this train-rank-review process, working in batches of 300, until we reached an appropriate recall threshold for the documents.

What do I mean by that? At each point during the iteration process, Insight Predict ranked the entire document population. Because we knew the true responsiveness of every document in the collection, we could easily track how far down in the ranking we would have to go to cover 50%, 60%, 70%, 80%, 90%, or even 95% of the relevant documents.

From there, we plotted the information to compare how many documents you would have to review using a one-time ranking process versus a continuous ranking approach. For clarity and simplicity, I chose two recall points to display: 80% (a common recall level) and 95% (high but achievable with our system). I could have presented several other recall rates as well but it might make the charts more confusing than necessary. The curves all looked similar in any event.

The research studies

Below are charts showing the results of our three case studies. These charts are different from the typical yield curves because they serve a different purpose. In this case, we were trying to demonstrate the efficacy of a continuous ranking process rather than a single ranking outcome.

Specifically, along the X-axis is the number of documents that were manually tagged and used as seeds for the process (the simulated review process). Along the Y-axis is the number of documents the review team would have to review (based on the seeds input to that point) to reach a desired recall level. The black diagonal line crossing the middle represents the simulated review counts, which were being continually fed back to the algorithm for additional training.

This will all make more sense when I walk you through the case studies. The facts of these cases are confidential, as are the clients and actual case names. But the results are highly interesting to say the least.

Research study 1: Wellington F matter (responsive review)

This case involved a review of 85,506 documents. Of those, 11,460 were judged responsive. That translates to a prevalence (richness) rate of about 13%. Here is the resulting chart from our simulated review (click chart for larger view):

wellington-f-matter2

There is a lot of information on this chart so I will take it step by step.

The black diagonal line represents the number of seeds given to our virtual reviewers. It starts at zero and continues along a linear path until it intersects the 95% recall line. After that, the line becomes dashed to reflect the documents that might be included in a linear review but would be skipped in a TAR 2.0 review.

The red line represents the number of documents the team would have to review to reach the 80% recall mark. By that I simply mean that after you reviewed that number of documents, you would have seen 80% of the relevant documents in the population. The counts (from the Y axis) range from a starting point of 85,506 documents at zero seeds (essentially a linear review)[6] to 27,488 documents (intersection with the black line) if you used continuous review.

I placed a grey dashed vertical line at the 2,500 document mark. This figure is meant to represent the number of training documents you might use to create a one-time ranking for a traditional TAR 1.0 process.[7] Some systems require a larger number of seeds for this process but the analysis is essentially the same.

Following the dashed grey line upwards, the review team using TAR 1.0 would have to review 60,161 documents to reach a recall rate of 80%. That number is lower than the 85,000+ documents that would be involved with a linear review. But it is still a lot of documents and many more than the 27,488 required using continuous ranking.

With continuous ranking, we would continue to feed training documents to the system and continually improve the yield curve. The additional seeds used in the ranking are represented by the black diagonal line as I described earlier. It continues upwards and to the right as more seeds are reviewed and then fed to the ranking system.

The key point is that the black solid line intersects the red 80% ranking curve at about 27,488 documents. At this point in the review, the review team would have seen 80% of the relevant documents in the collection. We know this is the case because we have the reviewer’s judgments on all of the documents. As I mentioned earlier, we treated those judgments as “ground truth” for this research study.[8]

What are the savings?

The savings come from the reduction of documents required to reach the 80% mark. By my calculations, the team would be able to reduce its review burden from 60,161 documents in the TAR 1.0 process to 27,488 documents in the TAR 2.0 process—a reduction of another 32,673 documents. That translates to an additional 38% reduction in review attributable to the continuous ranking process. That is not a bad result. If you figure $4 a document for review costs,[9] that would come to about $130,692 in additional savings.

It is worth mentioning that total savings from the TAR process are even greater. If we can reduce the total document population from 85,506 to 28,000 documents, that represents a reduction of 58,018 documents, or about 68%. At $4 a document, the total savings from the TAR process comes to $232,072.

Time is Money: I would be missing the boat  if I stopped the analysis here. We all know the old expression, “Time is money.” In this case, the time savings from continuous ranking over a one-time ranking can be just as important as the savings on review costs. If we assumed your reviewer could go through 50 documents an hour, the savings for 80% recall would be a whopping 653 hours of review time avoided. At eight hours per review day, that translates to 81 review days saved.[10]

How about for 95% recall?

If you followed my description of the ranking curve for 80% recall, you can see how we would come out if our goal was to achieve 95% recall. I have placed a summary of the numbers in the chart but I will recap them here.

  1. Using 2,500 seeds and the ranking at that point, the TAR 1.0 team would have to review 77,731 documents in order to reach the 95% recall point.
  2. With TAR 2.0’s continuous ranking, the review team could drop the count to 36,215 documents for a savings of 41,516 documents. That comes to a 49% savings.
  3. At $4 a document, the savings from using continuous ranking instead of TAR 1.0 would be $166,064. The total savings over linear review would be $202,024.
  4. Using our review metrics from above, this would amount to saving 830 review hours or 103 review days.

The bottom line on this case is that continuous ranking saves a substantial amount on both review costs and review time.

Research study 2: Ocala M matter (responsive review)

This case involved a review of 57,612 documents. Of those, 11,037 were judged relevant. That translates to a prevalence rate of about 19%, a bit higher than in the Wellington F Matter.

Here is the resulting chart from our simulated review (click chart for larger view):

ocala-m-matter2

For an 80% recall threshold, the numbers are these:

  1. Using TAR 1.0 with 2,500 seeds and the ranking at that point, the team would have to review 29,758 documents in order to reach the 80% recall point.
  2. With TAR 2.0 and continuous ranking, the review team could drop the count to 23,706 documents for a savings of 6,052 documents. That would be an 11% savings.
  3. At $4 a document, the savings from the continuous ranking process would be $24,208.

Compared to linear review, continuous ranking would reduce the number of documents to review by 33,906, for a cost savings of $135,624.

For a 95% recall objective, the numbers are these:

  1. Using 2,500 seeds and the ranking at that point, the TAR 1.0 team would have to review 46,022 documents in order to reach the 95% recall point.
  2. With continuous ranking, the TAR 2.0 review team could drop the count to 31,506 documents for a savings of 14,516 documents. That comes to a 25% savings.
  3. At $4 a document, the savings from the continuous ranking process would be $58,064.

Not surprisingly, the numbers and percentages in the Ocala M study are different from the numbers in Wellington F, reflecting different documents and review issues. However, the underlying point is the same. Continuous ranking can save a substantial amount on review costs as well as review time.

Research study 3: Wellington F matter (privilege review)

The team on the Wellington F Matter also conducted a privilege review against the 85,000+ documents. We decided to see how the continuous ranking hypothesis would work for finding privileged documents. In this case, the collection was sparse. Of the 85,000+ documents, only 983 were judged to be privileged. That represents a prevalence rate of just over 1%, which is relatively low and can cause a problem for some systems.

Here is the resulting chart using the same methodology (click chart for larger view):

wellington-priv-matter3

For an 80% recall threshold, the numbers are these:

  1. The TAR 1.0 training would have finished the process after 2,104 training seeds. The team would have hit the 80% recall point at that time.
  2. There would be no gain from continuous ranking in this case because the process would be complete during the initial training.

The upshot from this study is that the team would have saved substantially over traditional means of reviewing for privilege (which would involve linear review of some portion of the documents).[11] However, there were no demonstrative savings from continuous ranking.

I recognize that most attorneys would demand a higher threshold than 80% for a privilege review. For good reasons, they would not be comfortable with allowing 20% of the privileged documents to slip through the net. The 95% threshold might bring them more comfort.

For a 95% recall objective, the numbers are these:

  1. Using 2,500 seeds and the ranking at that point, the TAR 1.0 team would have to review 18,736 documents in order to reach the 95% recall point.
  2. With continuous ranking, the TAR 2.0 review team could drop the count to 14,404 documents for a savings of 4,332 documents.
  3. At $4 a document, the savings from the continuous ranking process would be $17,328.

For actual privilege reviews, we recommend that our clients use many of the other analytics tools in Insight to make sure that confidential documents don’t fall through the net. Thus, for the documents that are not actually reviewed during the TAR 2.0 process, we would be using facets to check the names and organizations involved in the communications to help make sure there is no inadvertent production.

What about the subject matter experts?

In reading this, some of you may wonder what the role of a subject matter expert might be in a world of continuous ranking. Our answer is that the SME’s role is just as important as it was before but the work might be different. Instead of reviewing random documents at the beginning of the process, SMEs might be better advised to use their talents to find as many relevant documents as possible to help train the system. Then, as the review progresses, SMEs play a key role doing QC on reviewer judgments to make sure they are correct and consistent. Our research suggests that having experts review a portion of the documents tagged by the review team can lead to better ranking results at a much lower cost than having the SME review all of the training documents.

Ultimately, a continuous ranking process requires that the review team carry a large part of the training responsibility as they do their work. This sits well with most SMEs who don’t want to do standard review work even when it comes to relatively small training sets. Most senior lawyers that I know have no desire to review the large numbers of documents that would be required to achieve the benefits of continuous ranking. Rather, they typically want to review as few documents as possible. “Leave it to the review team,” I often hear. “That’s their job.”

Conclusion

As these three research studies demonstrate, continuous ranking can produce better results than the one-time ranking approach associated with traditional TAR. These cases suggest that potential savings can be as high as 49% over the one-time ranking process.

As you feed more seeds into the system, the system’s ability to identify responsive documents continues to improve, which makes sense. The result is that review teams are able to review far fewer documents than traditional methods require and achieve even higher rates of recall.

Traditional TAR systems give you one bite at the apple. But if you want to get down to the core, one bite won’t get you there. Continuous ranking lets one bite feed on another, letting you finish your work more quickly and at lower cost. One bite at the apple is a lot better than none, but why stop there?

[Author’s note: Thanks are due to Dr. Jeremy Pickens for the underlying work that led to this article along with his patient help trying to explain these concepts to a dumb lawyer—namely me. Thanks also to Dr. William Webber for his extended comments on mistakes and misperceptions in my draft. Any mistakes remaining are mine alone. Further thanks to Ron Tienzo who caught a lot of simple mistakes through careful proofing and Bob Ambrogi, a great editor and writer.]


[1] Relevant in this case means relevant to the issues under review. TAR systems are often used to find responsive documents but they can be used for other inquiries such as privileged, hot or relevant to a particular issue.

[2] Our contextual diversity algorithm is designed to find documents that are different from those already seen and used for training. We use this method to ensure that we aren’t missing documents that are relevant but different from the mainstream of documents being reviewed.

[3] Determining when the review is complete is a subject for another day. Suffice it to say that once you determine the appropriate level of recall for a particular review, it is relatively easy to sample the ranked documents to determine when that recall threshold has been met.

[4] We make no claim that a test of three cases is anything more than a start of a larger analysis. We didn’t hand pick the cases for their results but would readily concede that more case studies would be required before you could draw a statistical conclusion. We wanted to report on what we could learn from these experiments and invite others to do the same.

[5] Our system ranks all of the documents each time we rank. We do not work off a reference set (i.e. a small sample of the documents).

[6] We recognize that IR scientists would argue that you only need to review 80% of the total population to reach 80% recall in a linear review. We could use this figure in our analysis but chose not to simply because the author has never seen a linear review that stopped before all of the documents were reviewed—at least based on an argument that they had achieved a certain recall level as a result of reaching a certain threshold. Clearly you can make this argument and are free to do so. Simply adjust the figures accordingly.

[7] This isn’t a fair comparison. We don’t have access to other TAR systems to see what results they might have after ingesting 2,500 seed documents. Nor can we simulate the process they might use to select those seeds for the best possible ranking results. But it is the data I have to work with. The gap between one-time and continuous ranking may be narrower but I believe the essential point is the same. Continuous ranking is like continuous learning: the more of it the better.

[8] In a typical review, the team would not know they were at the 80% mark without testing the document population. We know in this case because we have all the review judgments. In the real world, we recommend the use of a systematic sample to determine when target recall is being approached by the review.

[9] I chose this figure as a placeholder for the analysis. We have seen higher and lower figures depending on who is doing the review. Feel free to use a different figure to reflect your actual review costs.

[10] I used 50 documents per hour as a placeholder for this calculation. Feel free to substitute different figures based on your experience. But saving on review costs is only half the benefit of a TAR process.

[11]Most privilege reviews are not linear in the sense that all documents in a population are reviewed. Typically, some combination of searches are run to identify the likely privileged candidates. That number should be smaller than the total but can’t be specified in this exercise.