Ask Catalyst: In TAR, What Is Validation And Why Is It Important?

[Editor’s note: This is another post in our “Ask Catalyst” series, in which we answer your questions about e-discovery search and review. To learn more and submit your own question, go here.]  

Twitter_Ask_Catalyst_John_TredennickThis week’s question:

In technology assisted review, what is validation and why is it important?

Today’s question is answered by John Tredennick, founder and CEO.

Validation is the “act of confirming that a process has achieved its intended purpose.”[1] It is important to TAR for several reasons, including the need to ensure the TAR algorithm has worked properly and because Rule 26(g) requires counsel to certify that the process they used for producing discovery documents was reasonable and reasonably effective.[2]  While courts have approved validation methods in specific cases,[3] no court has yet purported to set forth specific validation standards applicable to all cases or for all TAR review projects.

Every validation process involves some form of sampling, either judgmental or statistical.[4] A judgmental sample is based primarily on subjective choices—for example a keyword search looking for privileged documents missed in review.[5] In contrast, statistical sampling requires that the sample be drawn randomly from the entire document population.[6] The key benefit of a statistical sample is that it provides a defensible basis to extrapolate sample results to a larger document population.

Validating a Recall Estimate

The goal of TAR is to reduce the number of documents necessary for review. This is typically done through a review “cutoff,” meaning that you stop before all documents are reviewed.[7] In most cases, validation is required to demonstrate that the cutoff point is reasonable. For example, you may want to show that the TAR process led to review and production of 75 percent of the relevant documents.[8]

The proposition to be validated is that only 25 percent of the relevant documents were left in the un-reviewed population, often called the “null set” [9] or sometimes the “discard pile.”[10] To validate the proposition you have to estimate recall (which is the percentage of relevant documents found during your review)[11] in a statistically sound manner.

The simplest and arguably the most statistically sound method for estimating recall is what many call the Direct Method.[12] This approach involves a random sample drawn from the entire document population but requires that you continue to sample until you find a sufficient number of relevant documents to meet a required sample size.

Using a freely available sampling calculator,[13] we might determine that we need a sample size of 384 documents to achieve a 95 percent confidence level and a five percent margin of error.[14]

raosoft-384

Once the sample has been taken (and we have found 384 relevant documents), we can compare the number of documents that were included in the already reviewed production set with the total number of sampled documents (adding those from the null set). If, for example, we determined that 288 were part of the reviewed/production population, we would conclude the TAR process found 75 percent of the relevant documents (288/384), for a point estimate of 75 percent recall.

Because a point estimate tells only part of the statistical story, we would also need to determine the exact confidence interval around our estimate.[15] For this purpose we would use what statisticians call a “binomial proportion confidence interval” calculator.[16] Using such a calculator, as you can see below, we would find that the TAR process had promoted from 70 percent to 79 percent of the relevant documents for review and production, leaving between 21 percent and 30 percent of the relevant documents in the null set.

binomial-confidence

We could use our statistical evidence to validate a claim that we achieved 75 percent recall and lay the groundwork for an argument that it was reasonable to stop the review at this point.

The problem with the Direct Method is the requirement that the sample be composed solely of relevant documents. If the document population is 1 percent rich,[17] you will need to review 100 documents on average for each relevant one found. In order to find 384 relevant documents in such a case, you might need to sample as many as 38,400 documents. If richness were even lower, say 0.1 percent, you would have to look at 384,000 documents, on average, in order to obtain a valid sample. As many have noted, that is a huge and arguably unreasonable burden simply to validate review results.[18]

There are several other approaches to estimating recall which have been proposed or used in practice. These have been labeled the “ratio methods”[19] because they either compare a known value with an estimate or compare two estimates for the recall calculation. For example:

  1. The number of relevant documents found during review with the estimated richness of the collection. [28]
  2. The number of relevant documents found during review with the estimated number of documents in the null set. [21]
  3. Estimated richness with the estimated number of relevant documents left in the null set. [22]

The ratio methods seem logical but can be statistically challenged if the proponent fails to take into account the confidence interval inherent in each point estimate.[23] Imagine a scenario where we found 75,000 relevant documents during the review but ended the review with two million documents left in the null set. If we took a sample of the null set for relevant documents, we might estimate that it contained only 25,000 relevant documents. This would support an argument that we found and produced 75 percent of the relevant documents during the review (75,000/100,000), which at least some courts have deemed reasonable.[24]

If, however, we take the associated confidence interval for the point estimate into account, the recall estimate could be quite different. Assume, for simplicity’s sake, that the upper bound of the confidence interval was 5 percent above the point estimate. That means that the number of relevant documents in the null set could be as high as 125,000 documents. In such a case, the review might only have found 37.5 percent of the relevant documents (75,000/200,000). That number may not be sufficient to meet the reasonableness obligations under Rule 26(g)

Validation Goals

Ultimately, the goal of a TAR validation process is to confirm that you achieved a certain result. In that regard, the burden in validating a TAR review should be no different than that faced by counsel supervising a collection or doing a linear review based on keyword searches.[25] Whichever approach is taken, Rule 26(g) requires that counsel follow a reasonable process to identify relevant documents and validate the results (including those not reviewed) in some statistically sound fashion. By practical necessity, the methodology chosen will require you to exercise judgment and balance the effort required against the benefits achieved through the validation process.


Footnotes

[1] M. Grossman, G. Cormack, The Grossman-Cormack Glossary of Technology-Assisted Review, 7 Fed. Cts. L. Rev. 1, 34 (2013) (hereinafter “TAR Glossary”). Put another way, the validity of a measurement tool is considered to be the degree to which the tool measures what it claims to measure. E.g. https://en.wikipedia.org/wiki/Validity_(statistics).

[2] Karl Schieneman & Thomas C. Gricks III, The Implications of Rule 26(g) on the Use of Technology-Assisted Review, 7 Fed. Cts. L. Rev. 239, 269 (2013) (hereinafter “Schieneman & Gricks”).

[3] Schieneman & Gricks at 270 (citing In re Biomet M2a Magnum Hip Implant Prods. Liab. Litig., 2013 U.S. Dist. LEXIS 84440 (N.D. Ind. Apr. 18, 2013) (accepting statistical samples having a confidence level of 99 percent and a maximum confidence interval of ±2 percent.); In re Actos (Pioglitazone) Prods. Liab. Litig., 2012 U.S. Dist. LEXIS 187519, at *26-27 (W.D. La. July 27, 2012) (“The application’s estimates of richness use a confidence level of 95 percent… with an error margin of plus or minus 4.3 percent.”); Da Silva Moore v. Publicis Groupe, 287 F.R.D. 182, 186 (S.D. N.Y. 2012)(“The parties agreed to use a 95 percent confidence level [±2 percent] to create a random sample of the entire email collection . . . .”)

[4] The alternative to sampling is to review all of the records subject to validation.

[5] TAR Glossary at 21.

[6] TAR Glossary at 27 (each document in the sample should have an equal chance of being drawn).

[7] See TAR Glossary at 13 (“Documents above the Cutoff are deemed to be Relevant and Documents below the Cutoff are deemed to be Non-Relevant.”)

[8] Reportedly, the court in Global Aerospace, Inc. v. Landow Aviation, L.P., No. CL 61040 (Vir. Cir. Ct. Apr. 23, 2012), accepted a proposed standard of 75 percent recall at least in part on grounds that keyword search and manual review often achieved much lower recall. See Schieneman & Gricks at 264.

[9] TAR Glossary at 25; Schieneman & Gricks at 273.

[10] See e.g. What Should You Do With the Discard Pile?. and Your TAR Temperature is 98.6 – That’s A Pretty Hot Result.

[11] TAR Glossary at 27.

[12] Reportedly, the Direct Method was used with approval in several cases, including Kleen Prods., LLC v. Packaging Corp. of Am., No. 10-C-5711 (N.D. Ill. Feb. 21, 2002), and In Re: Actos (Pioglitazone)Prods. Liab. Litig., MDL No. 6:11-md-2299 (W.D. La. July 27, 2012). See M. Grossman, G. Cormack, Comments on “The Implications of Rule 26(g) on the Use of Technology-Assisted Review,” 7 Fed. Cts. Law Rev. 285, 306-307 (2014)(hereinafter “Grossman and Cormack Comments”); Schieneman & Gricks at 272.

[13] For example, the one at www.raosoft.com/samplesize.html

[14] Courts have accepted samples having a confidence level of either 95 percent or 99 percent, and a nominal confidence interval of between ±2 percent and ±5 percent as satisfying the reasonable inquiry requirements of Rule 26(g). Schieneman & Gricks, at 270. The only change to the analysis resulting from different choices on confidence levels and confidence intervals is to change the required sample size.

[15] The exact confidence interval is the range around the point estimate that could contain the true value of the feature being sampled. It is similar to, but not the same as, the margin of error that is used initially to determine your sample size. Once the sample is taken, we can calculate the exact confidence interval because it is dependent in part on the proportional results of the sample itself. You can read more about this at https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval.

[16] One commonly used calculator can be found here: http://statpages.org/confint.html. Another is here: www.danielsoper.com/statcalc3/calc.aspx?id=85.

[17] Also known as prevalence or yield, richness refers to the number of relevant documents in a given population. TAR Glossary at 26.

[18] E.g. Schieneman & Gricks at 273. “Since there are alternative means of calculating recall that do not require such a significant effort, whether this level of exactitude is required to satisfy the reasonable inquiry requirements of Rule 26(g) must be evaluated against the proportionality considerations in Rule 26(b)(2)(C)(iii).”

[19] Grossman & Cormack Comments at 306.

[20] This is sometimes called the Basic Ratio method. See Grossman & Cormack Comments at 308, citing Schieneman & Gricks at 273.

[21] This was reportedly used in the Global Aerospace case. See Grossman & Cormack Comments at 308 and Schieneman & Gricks at 273.

[22] Grossman & Cormack Comments at 308. See Herbert L. Roitblat, A Tutorial on Sampling in Predictive Coding (OrcaTec LLC 2013), at 3, and Herbert L. Roitblat, Measurement in eDiscovery: A Technical White Paper (OrcaTec LLC 2013), at 10.

[23] Noted blogger Ralph Losey points out this problem in his discussion of ei-Recall, which he proposes to use for recall validation in Introducing ‘ei-Recall’ – A New Gold Standard for Recall Calculations in Legal Search. Losey adds an “Accept on Zero Defect” process to his methodology along with a suggested stratified sampling approach for extremely low richness collections. See also Grossman & Cormack Comments at 308-310.

[24] See Schieneman & Gricks at 264

[25] Compare Gricks & Schieneman at 273-274 (Since every step of the technology-assisted review process impacts either the nature or efficacy of the search, or the level of inquiry, Rule 26(g) applies throughout the process, from collection through validation. As with all discovery, what is reasonable in the application of Rule 26(g) to technology-assisted review is governed primarily by the proportionality considerations of Rule 26(b)(2)(C)(iii).

mm

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.