TAR 2.0 Capabilities Allow Use in Even More E-Discovery Tasks

Recent advances in Technology Assisted Review (“TAR 2.0”) include the ability to deal with low richness, rolling collections, and flexible inputs in addition to vast improvements in speed. [1] These improvements now allow TAR to be used effectively in many more discovery workflows than its traditional “TAR 1.0” use in classifying large numbers of documents for production.

To better understand this, it helps to begin by examining in more detail the kinds of tasks we face. Broadly speaking, document review tasks fall into three categories:[2]

  • Classification. This is the most common form of document review, in which documents are sorted into buckets such as responsive or non-responsive so that we can do something different with each class of document. The most common example here is a review for production.
  • Protection. This is a higher level of review in which the purpose is to protect certain types of information from disclosure. The most common example is privilege review, but this also encompasses trade secrets and other forms of confidential, protected, or even embarrassing information, such as personally identifiable information (PII) or confidential supervisory information (CSI).
  • Knowledge Generation. The goal here is learning what stories the documents can tell us and discovering information that could prove useful to our case. A common example of this is searching and reviewing documents received in a production from an opposing party or searching a collection for documents related to specific issues or deposition witnesses.

You’re probably already quite familiar with these types of tasks, but I want to get explicit and discuss them in detail because each of the three has distinctly different recall and precision targets, which in turn have important implications for designing your workflows and integrating TAR.


Let’s quickly review those two crucial metrics for measuring the effectiveness and defensibility of your discovery processes, “recall” and “precision.” Recall is a measure of completeness, the percentage of relevant documents actually retrieved. Precision measures purity, the percentage of retrieved documents that are relevant.

The higher the percentage of each, the better you’ve done. If you achieve 100 percent recall, then you have retrieved all the relevant documents. If all the documents you retrieve are relevant and have no extra junk mixed in, you’ve achieved 100% precision. But recall and precision are not friends. Typically, a technique that increases one will decrease the other.

This engineering trade-off between recall and precision is why it helps to be explicit and think carefully about what we’re trying to accomplish. Because the three categories of document review have different recall and precision targets, we must choose and tune our technologies — including TAR — with these specific goals in mind so that we maximize effectiveness and minimize cost and risk. Let me explain in more detail.

Classification Tasks

Start with classification — the sorting of documents into buckets. We typically classify so that we can do different things with different subpopulations, such as review, discard, or produce.

Under the Federal Rules of Civil Procedure, and as emphasized by The Sedona Conference and any number of court opinions, e-discovery is limited by principles of reasonableness and proportionality. As Magistrate Judge Andrew J. Peck wrote in the seminal case, Da Silva Moore v. Publicis Groupe:[3]

The goal is for the review method to result in higher recall and higher precision than another review method, at a cost proportionate to the ‘value’ of the case.

As Judge Peck suggests, when we’re talking document production the goal is to get better results, not perfect results. Given this, you want to achieve reasonably high percentages of recall and precision, but with cost and effort that is proportionate to the case. Thus, a goal of 80 percent recall — a common TAR target — could well be reasonable when reviewing for responsive documents, especially when current research suggests that the “gold standard” of complete eyes-on review by attorneys can’t do any better than that at many times the cost.[4]

Precision must also be reasonable, but requesting parties are usually more interested in making sure they get as many responsive documents as possible. So recall usually gets more attention here.[5]

Protection Tasks

By contrast, when your task is to protect certain types of confidential information (most commonly privilege, but it could be trade secrets, confidential supervisory information, or anything else where the bell can’t be unrung), you need to achieve 100 percent recall. Period. Nothing can fall through the cracks. This tends to be problematic in practice, as the goal is absolute perfection and the real world seldom obliges.

So to approximate this perfection in practice, we usually need to use every tool in our toolkit to identify the documents that need to be protected — not just TAR but also keyword searching and human review — and use them effectively against each other. The reason for this is simple: Different review methods make different kinds of mistakes. Human reviewers tend to make random mistakes. TAR systems tend to make very systematic errors, getting entire classifications of documents right or wrong.[6] By combining different techniques into our workflows, one serves as a check against the others.

The best way to maximize recall is to stack techniques.

The best way to maximize recall is to stack techniques.

This is an important point about TAR for data protection tasks, and one I want to reemphasize. The best way to maximize recall is to stack techniques, not to replace them. Because TAR doesn’t make the same class of errors as search terms and human review, it makes an excellent addition to privilege and other data protection workflows — provided the technology can deal with low prevalence and be efficiently deployed.[7] (More on that in a later post.)

Precision, on the other hand, is somewhat less important when your task is to protect documents. Precision doesn’t need to be perfect, but because these tasks typically use lots of attorney hours, they’re usually the most expensive part of review. Including unnecessary junk gets expensive quickly. So you still want to achieve a fairly high level of precision (particularly to avoid having to log documents unnecessarily if you are maintaining a privilege log), but recall is still the key metric here.

Knowledge Generation Tasks

The final task we described above is where we get the name “discovery” in the first place. What stories do these documents tell? What stories can my opponents tell with these documents? What facts and knowledge can we learn from them? This is the discovery task that is most Google-like.[8] For knowledge generation, we don’t really care about recall. We don’t want all the documents about a topic; we just want the best documents about a topic — the ones that will end up in front of deponents or used at trial.

Precision is therefore the most important metric here. You don’t want to waste your time going through junk — or even duplicative and less relevant documents. This is where TAR can also help, prioritizing the document population by issue and concentrating the most interesting documents at the top of the list so that attorneys can quickly learn what they need to litigate the case.

One nitpicky detail about TAR for issue coding and knowledge generation should be mentioned, though. TAR algorithms rank documents according to their likelihood of getting a thumbs-up or a thumbs-down from a human reviewer. They do not rank documents based on how interesting they are. For example, in a review for responsiveness, some documents could be very easy to predict as being responsive, but not very interesting. On the other hand, some documents could be extremely interesting, but harder to predict because they are so unusual.

On the gripping hand, however, the more interesting documents tend to cluster near the top of the ranking. Interesting documents sort higher this way because they tend to contain stronger terms and concepts as well as more of them.[9] TAR’s ability to concentrate the interesting documents near the top of a ranked list thus makes it a useful addition to knowledge-generation workflows.

What’s Next

With this framework for thinking about, developing, and evaluating different discovery workflows, we can now get into the specifics of how TAR 2.0 can best be used for the various tasks at hand. To help with this analysis, we have created a TAR checklist you can use to help organize your approach.

In the end, the critical factor in your success will be how effectively you use all the tools and resources you have at your disposal, and TAR is a powerful new addition to your toolbox.


[1] See The Five Myths of Technology Assisted Review, Revisited for more discussion of TAR 1.0 vs. TAR 2.0 and the new capabilities opened up by current tools and algorithms.

[2] By no means am I the first to suggest this taxonomy of e-discovery tasks, either. My former colleague Manfred Gabriel, now at KPMG, has been hammering this point for years. For another take on these three tasks you can check out several of his papers, including this one on using TAR for privilege review.

[3] Da Silva Moore v. Publicis Groupe, 2012 U.S. Dist. LEXIS 23350 (SDNY, Feb. 24, 2012).

[4] See, e.g., Maura R. Grossman & Gordon V. Cormack, Technology-Assisted Review in E-Discovery Can Be More Effective and More Efficient Than Exhaustive Manual Review, XVII RICH. J.L. & TECH. 11 (2011), at 10-15 (summarizing recent research on human review and citing results for maxima of 65% recall (Voorhees 2000) and 52.8% – 83.6% recall (Roitblat, Kershaw, & Oot 2010)).

[5] The differing importance of recall and precision both here and in other discovery tasks is one reason the F1 measure (the harmonic mean of recall and precision) is often problematic. While it may be a good single measure for information retrieval research, it prematurely blends two measures that often have to be considered and weighted separately in practical discovery tasks.

[6] See, e.g. Maura R. Grossman and Gordon V. Cormack, Inconsistent Responsiveness Determination in Document Review: Difference of Opinion or Human Error?, 32 Pace L. Rev. 267 (2012), (finding that coding inconsistencies by human reviewers are largely attributable to human error, and not to documents being “borderline” or any inherent ambiguity in the relevance judgments).

[7] Random training approaches such as those used by support vector machine algorithms tend to need prohibitively large samples in order to deal effectively with low richness, which is common in many actual cases. See, e.g. Gordon V. Cormack and Maura R. Grossman, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR ’14, July 6–11, 2014, Gold Coast, Queensland, Australia (evaluating different approaches to TAR training across eight data sets with prevalence (richness) ranging from 0.25% to 3.92% with a mean of 1.18%).

[8] To be more nitpicky, this search is the most Google-like for the basic task of searching on a single topic. A more challenging problem here is often figuring out all the different possible topic that a collection of documents could speak to – including those that we don’t know we need to look for – and then finding the best examples of each topic to review. This is another area where TAR and similar tools that model the entire document set can be useful, and will be the topic of a more detailed follow-up post here.

[9] This is true in general, but not always. Consider an email between two key custodians who are usually chatty but that reads simply “Call me.” There are no key terms there for a ranking engine based on full text analysis to latch onto, though the unusual email could be susceptible to other forms of outlier detection and search.