Measuring Recall in E-Discovery Review, Part One: A Tougher Problem Than You Might Realize

A critical metric in Technology Assisted Review (TAR) is recall, which is the percentage of relevant documents actually found from the collection. One of the most compelling reasons for using TAR is the promise that a review team can achieve a desired level of recall (say 75% of the relevant documents) after reviewing only a small portion of the total document population (say 5%). The savings come from not having to review the remaining 95% of the documents. The argument is that the remaining documents (the “discard pile”) include so few that are relevant (against so many irrelevant documents) that further review is not economically justified.

download-pdfAs most legal professionals readily understand, this is a big deal. Let’s say you collected one million documents for review. Let’s also say that review costs (direct and QC) average $2 a document. Cutting out 95% of the review means not having to look at 950,000 documents. That equals savings of $1.9 million, which is nothing to sneeze at, and not atypical of what you might see with a good TAR product and proper protocol.

But how do we prove we have found a given percentage of the relevant documents at whatever point we stop the review? It turns out the answer is not as easy as some might think, at least when the percentage of relevant documents is low. Indeed, there is a heated debate going on right now among a number of our TAR-savvy colleagues about how we should go about proving that we have obtained a certain level of recall. Much of that debate turns on how many documents we need to sample for that proof.

While not all of these can be characterized as debate, here are a number of articles and blog posts addressing this topic:

If you read these posts, you may find yourself scratching your head. Some suggest you can prove recall by sampling only a relatively few documents. The problem with this approach is that it doesn’t seem to stand up to statistical scrutiny. By that I mean they provide little or no statistical assurance you have actually achieved a given level of recall.

Others suggest approaches which are more statistically valid, but require sampling a lot of documents to prove your point (as many as 34,000 in one case). Either way, this presents a problem. Legal professionals need a reasonable but also statistically reliable way to measure recall in order to justify review cutoff decisions.

In this article, the first of two parts, I will attempt to illustrate the difficulty and costs inherent in trying to prove a particular recall level using a simple hypothetical review population. In part two, I will discuss the pros and cons of different approaches offered by TAR experts to solve the problem. My goal, if possible, is to find a practical but valid answer to the problem we can all agree upon.[1]

A Hypothetical Review

To illustrate the problem, let’s conjure up a hypothetical review. Drawing from my introduction, assume we collected one million documents for review. Assume also that the percentage of relevant documents in the collection is 1%.[2] That suggests there are 10,000 relevant documents in our collection (1,000,000*.01).

Using Sampling to Estimate Richness

Typically we don’t know in advance how many relevant documents are in the collection. To find this information, we need to estimate the collection’s richness (aka prevalence) using statistical sampling, which is simply a method in which a sample of the document population is drawn at random, such that statistical properties of the sample may be extrapolated to the entire document population.

To create our sample we must randomly select a subset of the population and use the results of our sample to estimate the characteristics of the larger population. The degree of certainty around our estimate is a function of the number of documents we sample.

While this is not meant to be an article about statistical sampling, here are a few concepts you should know. Although there are many reference sources for these terms, I will draw from the excellent Grossman-Cormack Glossary of Technology-Assisted Review:

  1. Point Estimate: The most likely value for a population characteristic. Thus, when we estimate that a document population contains 10,000 relevant documents, we are offering a point estimate.
  2. Confidence Interval: A range of values around our point estimate which we believe contains the true value of the number being estimated. For example, if the confidence interval for our point estimate ranges from 8,000 to 12,000, that means we believe the true value will appear within that range.
  3. Margin of Error: The maximum amount by which a point estimate might deviate from the true value, typically expressed as percentage. People often talk about a 5% margin of error, which simply means the expected confidence interval is 5% above or below the point estimate.
  4. Confidence Level: The chance that our confidence interval will include the true value. For example, “95% confidence” means that if one were to draw 100 independent random samples of the same size, and compute the point estimate and confidence Interval from each sample, about 95 of the 100 confidence intervals would contain the true value.
  5. Sample Size: How many documents do we have to sample in order to achieve a specific confidence interval and confidence level. In general, the higher the confidence level, the more documents we have to review. Likewise, if we want a narrower confidence interval, we will have to increase our sample size.

It might help to see these concepts displayed visually. Here is a chart showing what a 95% confidence level looks like against a “normal” distribution of document values as well as a specific confidence interval.

Point Estimate and Confidence Interval

In this case our point estimate was 500 relevant documents in our collection. Our confidence interval (shaded in red) suggests that the actual range of relevant documents could go from 460 at the lower end of our estimate to 540 at the higher end.

Part of the curve is not shaded in red. It covers the 5% chance that the actual number of relevant documents is either above (2.5%) or below (2.5%) our confidence interval range.

Our Hypothetical Estimate

We start our analysis with a sample of 600 documents, chosen randomly from the larger population. The sample size was based on a desired confidence level of 95% and a desired margin of error of 4%. You can use other numbers for this part of the exercise but these will do for our calculations.

How did we get 600? There are a number of handy sample calculators you can use to determine sample size based on your choices about confidence levels and margin of error. In past articles, I have recommended the Raosoft calculator because it is simple to use. I will use it again here as well.

As you can see, I entered the population size (1,000,000), a desired confidence level (95%), and a margin of error (4%). In turn, the calculator suggested that I look at 600 documents for my sample.

Initial Sampling Results

Let’s assume we found six relevant documents out of the 600 we sampled. That translates to 0.01 or 1% richness (6/600). We can use that percentage to estimate that there are 10,000 relevant documents in the total review population (1,000,000*.01). This becomes our point estimate.

What about the margin of error? In this case I chose a sample size which would give us up to a 4% margin of error. That means the estimated number of relevant documents in our population is within a 4% range +/- of our point estimate of 10,000 documents.

As noted, there are a million documents in the collection. Four percent of one million comes to 40,000 documents. If we use that figure for our margin of error, it suggests that our confidence interval for relevant documents could range from the six we found in our sample to as high as 50,000. That is an interesting spread.

Determining the Exact Confidence Interval

Dr. William Webber, a well-known expert in this field, explains that in practice we would use a more refined approach to calculate our confidence interval. It turns out that the “exact” confidence interval depends on the results of the random sample. He pointed me to a binomial calculator where we can use the survey results to determine our exact confidence interval.

Based on our planned sample size (600) and the number of relevant documents we found (6), our confidence interval (expressed as a decimal) ranges from 0.0037 (lower) to 0.0216 (upper). We multiply these decimal values against the total number of documents in our collection (1,000,000) to calculate our exact confidence interval. In this case, it runs from 3,700 to 21,600.

(In the remainder of this article, I will refer to the margin of error given by the sample size calculator as the “nominal” margin of error since the actual width of the confidence interval depends on what we see in the sample.)

So, we have a start on the problem. We believe there are 10,000 relevant documents in our collection (our point estimate) but it could be as high as 21,600 (or as low as 3,700). Let’s move on to our review.

The Review

The team finds 7,500 relevant documents after looking at the first 50,000.[3] Based on our initial point estimate, we could reasonably conclude we have found 75% of the relevant documents. At that point, we might decide to shut down the review. Most courts would view stopping at 75% recall to be more than reasonable.

Your argument to the court seems compelling. If there were only 2,500 relevant documents left in the discard pile, the cost of reviewing another 950,000 documents to find 2,500 relevant ones seems disproportionate. On average, you would have to look at 380 documents to find the next relevant document. At a cost of $2 per document for review, it would cost $760 for each additional relevant document found. If you continued until the end, the cost would be an extra $1.9 million.

How Do We Know We Achieved 75% Recall?

Now comes the hard part. How do we know we actually found 75% of the relevant documents?

Remember that our initial point estimate was 10,000 documents, which seems to support this position. However, it had a confidence interval which suggested the real number of relevant documents could be as high as 21,600.

That means your recall estimate could be off by quite a bit. Here are the numbers for this simple mathematical exercise:

  1. We found 7,500 documents during the review.
  2. If there are only 10,000 relevant documents in the total population, it is easy to conclude we achieved 75% recall (7,500/10,000).
  3. However, if there were 21,600 relevant documents in the population (which was the upper range for the confidence interval), we achieved only a 35% recall of relevant documents (7,500/21,600).

Those numbers would give grist for the argument that the producing party did not meet its burden to find a reasonable number of relevant documents. While the team may have found and reviewed 75% of the relevant documents, it is also possible that they only found and reviewed 35% of the relevant documents. Most people would agree that this is not be enough to meet your duty as a producing party.

Sampling the Discard Pile

So what do we do about this problem? One answer is to sample the discard population to determine its richness (some call this term elusion). If we could show that there were only a limited number of relevant documents in the discard pile, that would help establish our bona fides.

Let’s make some further assumptions. We sample the discard pile (950,000 documents), again reviewing 600 documents based on our choice of a 95% confidence level and a 4% nominal confidence interval.

This time we find two relevant documents, which suggests that the number of relevant documents in the discard pile has dropped to about 0.33% (2/600). From there we can estimate that we would find only 3,135 relevant documents in the discard pile (950,000*0.0033). Added to the 7,500 documents we found in review, that makes a total of 10,635 relevant documents in the collection.

Using that figure we calculate that the review team found about 71% of the relevant documents (7,500/10,635). While not quite 75%, this is a still a number that most courts have accepted as reasonable and proportionate.

What about the Confidence Interval?

But how big is our exact confidence interval? Using our binomial calculator, we get this range:

Applying these figures to our discard pile, we estimate that there could be as many as 11,400 relevant documents left (0.0120*950,000).

If we add the 7,500 documents already found to the upper value of 11,400 documents from our sample, we get a much lower estimate of recall. Specifically, we are producing 7,500 out of what could be as many as 18,900 relevant documents. That comes to a recall rate of 40% (7,500/18,900).

Is that enough? Again, I suspect most readers—and courts—would say no. Producing just two out of five of the relevant documents in a population would not seem reasonable.

Increasing the Sample Size

What to do? One option is to try to narrow the margin of error (and ultimately the exact confidence interval) with a larger sample. If we narrow the nominal margin of error from 4% to 2%, it turns out we will have to sample 2,395 randomly selected documents.

Let’s assume we found eight relevant documents out of 2,395 in our sample. That again suggests a richness level of about 0.33% and a point estimate of 3,173, which is close to what we found with our earlier sample.[4] If we add that information to our calculator, we find the actual confidence interval narrows quite a bit.

Applying the exact confidence interval ranges to our discard pile we reach the following conclusions:

  1. We now have a point estimate of 3,173 relevant documents in the discard pile (950,000*(8/2395)).
  2. We estimate that the low range of relevant documents in the discard pile is 1,330 (0.0014*950,000).
  3. We estimate that the high range of relevant documents in the discard pile is 6,270 (0.0066*950,000).

Using the upper value in our actual confidence interval, we get an improved estimate of recall. Specifically, we found 7,500 relevant documents out of what could be as many as 13,770 relevant documents, which comes to a recall rate of 54%.

Would producing 54% of the relevant documents be enough to meet our duty to use reasonable efforts.[5] What do you think?

Improving the Percentages

Let’s take one more step to see what the cost would be to narrow the confidence interval even further. This time we choose a nominal 1% margin of error.

Our calculator suggests we would have to sample 9,508 documents. Assume we find 31 relevant documents out of the 9,508 documents we sampled, which would again support our richness estimate of about 0.33% (31/9508).

We will enter the sampled richness into our binomial calculator to find out our exact confidence interval.

Applying the confidence interval figures to our discard pile we reach the following conclusions:

  1. We estimate there are 3,097 relevant documents in the discard pile, again, about the same as before (950,000*(31/9508)).
  2. The lower range of relevant documents is 2,090 (0.0022*950,000).
  3. The upper range of relevant documents is 4,370 (0.0046*950,000).

Using these values for our exact confidence interval. the range goes from 63% (7,500/11,870) to 78% (7,500/9,590).

I think most would agree that this type of confidence interval would be reasonable. It would suggest that you found 70% of the relevant documents in your review with the understanding that the number might be as low as 63% but it also could be as high as 78%.

The Cost of Proving Recall

We have found a method to prove recall by sampling the discard pile. But at what cost? If we are satisfied with a recall rate of 54% for the lower bound of our confidence interval, we would have to sample 2,395 documents. At 100 documents an hour, the sample would take about 24 hours of review to complete.[6] At $2 per document, the cost would be $4,790.

If we feel we have to narrow the interval and reach a minimum recall rate of 63%, then the sample size quadruples to 9,508 documents. If we again assume 100 documents an hour, review time would go up to 95 hours, which is more than two weeks of effort. At $2 per document, the cost would jump to $19,016.

To make matters worse, what happens if our confirming sample doesn’t support our initial estimate? At that point we would have to continue our review until we found a reasonable percentage. Then we would have to review another sample from the discard pile to confirm that we had indeed found 75% of the relevant documents or whatever number we end up at.

You now see the problem inherent in proving recall. In the next part of this article, I will take a look at suggestions proposed by others to solve this problem without requiring so many documents to be sampled.

Postscript: Summarizing the Numbers

Along with William Webber, Maura Grossman and Gordon Cormack, Tom Gricks and Karl Schieneman were kind enough to read my article and make helpful comments before publication. Tom sent me this spreadsheet which summarizes nicely the results of my analysis. Thanks to Tom for giving me permission to include it here:

The sheet focuses on the discard pile calculations. The terms used in the column headings are:

  • CL: Confidence Level.
  • CI: Confidence Interval.
  • Sample: Sample size.
  • RelSAMPLE: Number of relevant documents found in the sample.
  • %: Percentage of relevant documents found in the sample (rounded to 2 decimal places).
  • Pt Est: Point Estimate.
  • BICIL: Binomial Confidence Interval (Lower).
  • BICIU: Binomial Confidence Interval (Upper).
  • RelL: Number of relevant documents expected in discard pile (lower).
  • RelU: Number of relevant documents expected in discard pile (upper).
  • RecallMIN: Percentage recall (based on finding 7,500 documents) based on the lower range of the Confidence Interval.
  • RecallMax: Percentage recall (based on finding 7,500 documents) based on the upper range of the Confidence Interval.

Footnotes

[1] I would like to thank William Webber, PhD and respected TAR mathematician, for helping me understand many of the issues here and for his assistance with the statistical processes discussed in this article. Thanks also to Maura Grossman and Gordon Cormack for offering comments, suggestions and helping me get on what I hope is the right track with this analysis. Lastly, thanks to Jeremy Pickens for his help with the article and for his continuing leadership in the field. I am just the scrivener here.

[2] There is an ongoing debate about the level of richness one should expect in a discovery collection. There is also a difference in opinion about whether using keyword search to cull large collections (increasing their richness) is appropriate. I don’t intend to get into either debate in this article; rather, my focus is on proving recall rates for low richness collections. Whether from direct collections or third-party productions, low richness collections occur often enough to deserve our attention.

[3] Each TAR project is likely to achieve different results depending on the number of relevant documents in the population, the nature of the documents and the issue being investigated. Here, we are focusing on low-richness collections. In their recently published research, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery, SIGIR’14, July 6–11, 2014, Maura Grossman and Gordon Cormack reported achieving 75% recall after reviewing far less than 5% of the total document population. (Topics 202, 207, B and C to pick several examples.) For this article, I chose 5% for total review as a convenient and reasonable number. The logic of my discussion works regardless of the number of documents you actually have to review in a specific project.

[4] With a larger sample, we might expect to find a different number of relevant documents, which might suggest a different point estimate. For this article, I am keeping the numbers consistent with a richness level of 0.33%.

[5] Interestingly, if we used the lower value in our actual confidence interval (1,330), we would conclude that we found 85% of the relevant documents (7,500/8,830). Thus, our argument would be that we found between 54% and 85% of the total relevant documents with our point estimate being 70%.

[6] You can insert a different number here to match your estimated review speed.

4 thoughts on “Measuring Recall in E-Discovery Review, Part One: A Tougher Problem Than You Might Realize

  1. Pingback: Measuring Recall in E-Discovery Review: A Tougher Problem Than You Might Realize – Part 1 | @ComplexD

  2. Pingback: The Pendulum Swings: Practical Measurement in eDiscovery | OrcaTec | Better Results

  3. Dennis Kiker

    Very clear explanation of the challenges, John. Thank you. There are two questions, however, that have eluded me in this and other similar discussions. First, how do we know that the sample we chose is truly random? In other words, when we select the 600 out of 1 million documents, how do we select them to ensure that they are, in fact, random? Most document populations that I have dealt with are not uniform in content. I have email on a variety of topics, reports, memos, spreadsheets, etc. Given the variety of content, how do you select a truly random sample?

    Second is the issue of relevance. I’ve seen articles that demonstrate great disparity among even experienced reviewers as to what is relevant and what is not? How can we account for variations in the what is deemed relevant, which is the basis for all of the results?

    Bear in mind that I am not an advocate of linear review. The costs are just too prohibitive, and, in my view, even with the challenges of selecting a truly random sample and dealing with disparate views on relevance, technology-assisted review yields far greater benefits than it does challenges. But I would like to understand these issues better so that I could respond to challenges on those points.

    Reply
  4. Pingback: Measuring Recall in E-Discovery Review, Part Two: No Easy Answers |

Leave a Reply

Your email address will not be published. Required fields are marked *