Is Random the Best Road for Your CAR? Or is there a Better Route to Your Destination?

John Tredennick Car

One of the givens of traditional CAR (computer-assisted review)[1] in e-discovery is the need for random samples throughout the process. We use these samples to estimate the initial richness of the collection (specifically, how many relevant documents we might expect to see). We also use random samples for training, to make sure we don’t bias the training process through our own ideas about what is and is not relevant.

Later in the process, we use simple random samples to determine whether our CAR succeeded. We sample the discards to ensure that we have not overlooked too many relevant documents.

But is that the best route for our CAR? Maybe not. Our road map leads us to believe a process called a systematic random sampling will get you to your destination faster with fewer stops. In this post, I will tell you why.[2]

About Sampling

Even we simpleton lawyers (J.D.s rather than Ph.D.s) know something about sampling. Sampling is the process by which we examine a small part of a population in the hope that our findings will be representative of the larger population. It’s what we do for elections. It’s what we do for QC processes. It’s what we do with a box of chocolates (albeit with a different purpose).

Academics call it “probability sampling” because the goal is to determine the probability of the sample matching the large population. It is so called because every element has some known probability of being sampled (which then allows us to make probabilistic statements about the likelihood that the sample is representative of the larger population).

There are several ways to do this including simple random, systematic and stratified sampling. For this article, my focus is on the first two methods: simple random and systematic.

Simple Random Sampling

The most basic form of sampling is “simple random sampling.” The key here is to employ a sampling process that ensures that each member of the sampled population has an equal chance of being selected.[3] With documents, we do this with a random number generator and a unique ID for each file. The random number generator is used to select IDs in a random order.

I am not going to go into confidence intervals, margins of error or other aspects of the predictive side of sampling. Suffice it to say that the size of your random sample helps determine your confidence about how well the sample results will match the larger population. That is a fun topic as well but my focus today is on types of sampling rather than the size of the sampling population needed to draw different conclusions.

Systematic Random Sampling

A “systematic random sample” differs from a simple random sample in two key respects. First, you need to order your population in some fashion. For people, it might be alphabetically or by size. For documents, Systemic Samplingwe order them by their relevance ranking. Any order works as long as the process is consistent and it serves your purposes.

The second step is to draw your sample in a systematic fashion. You do so by choosing every Nth person (or document) in the ranking from top to bottom. Thus, you might select every 10th person in the group to compose your sample. As long as you don’t start with the first person on the list but instead select your first person in the order randomly (say from the top ten people), your sample is a valid form of random sampling and can be used to determine the characteristics of the larger population. You can read more about all of this at Wikipedia and many more sources. Don’t just take my word for it.

Why Would I Use a Systematic Random Sample?

This is where the rubber meets the road (to overuse the metaphor). For CAR processes, there are a lot of advantages to using a systematic random sample over a random sample. Those advantages include getting a better picture of the document population and increasing your chances of finding relevant documents.

Let me start by emphasizing an important point. When you’re drawing a sample, you want it to be “representative” of the population you’re sampling. For instance, you’d like each sub-population to be fairly and proportionally represented. This particularly matters if sub-populations differ in the quality you want to measure.

Drawing a simple random sample means that we’re not, by our selection method, deliberately or negligently under-representing some subset of the population. However, it can still happen that, due to random variability, we can oversample one subset of the population, and undersample another. If the sub-populations do differ systematically, then this may skew our results. We may miss important documents.

An Example: Sports Preferences for Airport Travelers

William Webber gave me a sports example to help make the point.

Say we are sampling travelers in a major international airport to see what sports they like (perhaps to help the airport decide what sports to televise in the terminal). Now, sports preference tends to differ among countries, and airline flights go between different countries (and at different times of day you’ll tend to find people from different areas traveling).

So it would not be a good idea to just sit at one gate, and sample the first hundred people off the plane. Let’s say you’re in Singapore Airport. If you happen to pick a stop-over flight on the way from Australia to India, your sample will “discover” that almost all air travelers in the terminal are cricket fans. Or perhaps there is a lawyers’ convention in Bali, and you’ve picked a flight from the United States, then your study might convince the airport to show American football around the clock.

Let’s say instead that you are able to draw a purely random sample of travelers (perhaps through boarding passes–let’s not worry about the practicality of getting to these randomly sampled individuals). You’ll get a better spread, but you might tend to bunch up on some flights, and miss others–perhaps 50% more samples on the Australian-India flight, and 50% fewer on the U.S.-Bali one.

This might be particularly unfortunate if some individuals were more “important” than the others. To develop the scenario, let’s say the airport also wanted to offer sports betting for profit. Then maybe American football is an important niche market, and it would be unfortunate if your random sample happened to miss those well-heeled lawyers dying to bet on that football game I am watching as I write this post.

What you’d prefer to do (and again, let’s ignore practicalities) is to spread your sample out, so that you are assured of getting an even coverage of gates and times (and even seasons of the year). Of course, your selection will still have to be random within areas, and you still might get unlucky (perhaps the lawyer you catch hates football and is crazy about croquet). But you’re more likely to get a representative sample if your approach is systematic rather than simple random.

Driving our CAR Systematically

Let’s get back in our CAR and talk about the benefit of sampling against our document ranking. In this case, the value we’re trying to estimate is “relevance” (or more exactly, something about the distribution of relevance). Here, the population differentiation is a continuous one, from the highest relevance ranking to the lowest. This differentiation is going to be strongly correlated with the value we’re trying to measure.

Highly ranked documents are more likely to be relevant than lowly ranked ones (or so we hope). So if our simple random sample happened by chance to over-sample from the top of the ranking, we’re going to overstate the total number of relevant documents in the population.

John Tredennick SportscarLikewise, if our random sample happened by chance to oversample from the bottom of the ranking, our sample might understate the relevance population. By moving sequentially through the ranking from top to bottom, a systematic random sample removes the danger of this random “bunching,” and so makes our estimate more accurate overall.

At different points in the process, we might also want information about particular parts of the ranking. First, we may be trying to pick a cutoff. That suggests we need good information about the area around our candidate cutoff point.

Second, we might wonder if relevant documents have managed to bunch in some lower part of the ranking. It would be unfortunate if our simple random sample happened not to pick any documents from this region of interest. It would mean that we might miss relevant documents.

With a systematic random sample, we are guaranteed that each area of the ranking is equally represented. That is the point of the sample, to draw from each segment in the ranking (decile for example) and see what kinds of documents live there. Indeed, if we are already determined to review the top-ranking documents, we might want to place more emphasis on the lower rankings. Or not, depending on our goals and strategy.

Either way, the point of a systematic random sampling is to ensure that we sample documents across the ranking–from top to bottom. We do so in the belief that it will provide a more representative look at our document population and give us a better basis to draw a “yield curve.”[4] To be fair, however, the document selected from that particular region might not be representative of that region. Whether you choose random or systematic, there is always the chance that you will miss important documents.

Does it Work?

Experiments have shown us that documents can bunch together in a larger population. Back in the paper days, I knew that certain custodians were likely to have the “good stuff” in their correspondence files and I always went there first in my investigation. Likewise, people generally kept types of documents together in boxes, which made review quicker. I could quickly dismiss boxes of receipts when they didn’t matter to my case while spending my time on research notebooks when they did.

Similarly, and depending on how they were collected, relevant documents are likely to be found in bunches across a digital population. After all, I keep files on my computer in folders much like I did before I had a computer. It helps with retrieval. Other people do as well. The same is true for email, which I dutifully folder to keep my inbox clear.

So, no problem if those important documents get picked up during a random sample, or even because they are similar to other documents tagged as relevant. However, sometimes they aren’t picked up. They might still be bunched together but simply fall toward the bottom of the ranking. Then you miss out on valuable documents that might be important to your case.

While no method is perfect, we believe that a systematic random sample offers a better chance that these bunches get picked up during the sampling process. The simple reason is that we are intentionally working down the ranking to make sure we see documents from all segments of the population.

From experiments, we have seen this bunching across the ranking (yield) curve. By adding training documents from these bunches, we can quickly improve the ranking, which means we find more relevant documents with less effort. Doing so means we can review fewer documents at a lower cost. The team is through more quickly as well, which is important when deadlines are tight.

Many traditional systems don’t support systematic random sampling. If that is the case with your CAR, you might want to think about an upgrade. There is no doubt that simple random sampling will get you home eventually but you might want to ride in style. Take a systematic approach for better results and leave the driving to us.

 


[1] I could use TAR (Technology Assisted Review) but it wouldn’t work as well for my title. So, today it is CAR. Either term works for me.

[2] Thanks are due to William Webber, Ph.D., who helped me by explaining many of the points raised in this article. Webber is one of a small handful of CAR experts in the marketplace and, fortunately for us, a member of our Insight Predict Advisory Board. I am using several of his examples with permission.

[3] Information retrieval scientists put it this way: In simple random sampling, every combination of elements from the population has the same probability of being the sample. The distinction here is probably above the level of this article (and my understanding).

[4] Yield curves are used to represent the effectiveness of a document ranking and are discussed in several other blog posts I have written (see, e.g., here, here, here and here). They can be generated from a simple random sample but we believe a systematic random sample–where you move through all the rankings–will provide a better and more representative look at your population.

mm

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.