One of the givens of traditional CAR (computer-assisted review)^{[1]} in e-discovery is the need for random samples throughout the process. We use these samples to estimate the initial richness of the collection (specifically, how many relevant documents we might expect to see). We also use random samples for training, to make sure we don’t bias the training process through our own ideas about what is and is not relevant.

Later in the process, we use simple random samples to determine whether our CAR succeeded. We sample the discards to ensure that we have not overlooked too many relevant documents.

But is that the best route for our CAR? Maybe not. Our road map leads us to believe a process called a systematic random sampling will get you to your destination faster with fewer stops. In this post, I will tell you why.^{[2]}

# About Sampling

Even we simpleton lawyers (J.D.s rather than Ph.D.s) know something about sampling. Sampling is the process by which we examine a small part of a population in the hope that our findings will be representative of the larger population. It’s what we do for elections. It’s what we do for QC processes. It’s what we do with a box of chocolates (albeit with a different purpose).

Academics call it “probability sampling” because the goal is to determine the probability of the sample matching the large population. It is so called because every element has some known probability of being sampled (which then allows us to make probabilistic statements about the likelihood that the sample is representative of the larger population).

There are several ways to do this including simple random, systematic and stratified sampling. For this article, my focus is on the first two methods: simple random and systematic.

# Simple Random Sampling

The most basic form of sampling is “simple random sampling.” The key here is to employ a sampling process that ensures that each member of the sampled population has an equal chance of being selected.^{[3]} With documents, we do this with a random number generator and a unique ID for each file. The random number generator is used to select IDs in a random order.

I am not going to go into confidence intervals, margins of error or other aspects of the predictive side of sampling. Suffice it to say that the size of your random sample helps determine your confidence about how well the sample results will match the larger population. That is a fun topic as well but my focus today is on types of sampling rather than the size of the sampling population needed to draw different conclusions.

# Systematic Random Sampling

A “systematic random sample” differs from a simple random sample in two key respects. First, you need to order your population in some fashion. For people, it might be alphabetically or by size. For documents, we order them by their relevance ranking. Any order works as long as the process is consistent and it serves your purposes.

The second step is to draw your sample in a systematic fashion. You do so by choosing every Nth person (or document) in the ranking from top to bottom. Thus, you might select every 10^{th} person in the group to compose your sample. As long as you don’t start with the first person on the list but instead select your first person in the order randomly (say from the top ten people), your sample is a valid form of random sampling and can be used to determine the characteristics of the larger population. You can read more about all of this at Wikipedia and many more sources. Don’t just take my word for it.

# Why Would I Use a Systematic Random Sample?

This is where the rubber meets the road (to overuse the metaphor). For CAR processes, there are a lot of advantages to using a systematic random sample over a random sample. Those advantages include getting a better picture of the document population and increasing your chances of finding relevant documents.

Let me start by emphasizing an important point. When you’re drawing a sample, you want it to be “representative” of the population you’re sampling. For instance, you’d like each sub-population to be fairly and proportionally represented. This particularly matters if sub-populations differ in the quality you want to measure.

Drawing a simple random sample means that we’re not, by our selection method, deliberately or negligently under-representing some subset of the population. However, it can still happen that, due to random variability, we can oversample one subset of the population, and undersample another. If the sub-populations do differ systematically, then this may skew our results. We may miss important documents.

# An Example: Sports Preferences for Airport Travelers

William Webber gave me a sports example to help make the point.

Say we are sampling travelers in a major international airport to see what sports they like (perhaps to help the airport decide what sports to televise in the terminal). Now, sports preference tends to differ among countries, and airline flights go between different countries (and at different times of day you’ll tend to find people from different areas traveling).

So it would not be a good idea to just sit at one gate, and sample the first hundred people off the plane. Let’s say you’re in Singapore Airport. If you happen to pick a stop-over flight on the way from Australia to India, your sample will “discover” that almost all air travelers in the terminal are cricket fans. Or perhaps there is a lawyers’ convention in Bali, and you’ve picked a flight from the United States, then your study might convince the airport to show American football around the clock.

Let’s say instead that you are able to draw a purely random sample of travelers (perhaps through boarding passes–let’s not worry about the practicality of getting to these randomly sampled individuals). You’ll get a better spread, but you might tend to bunch up on some flights, and miss others–perhaps 50% more samples on the Australian-India flight, and 50% fewer on the U.S.-Bali one.

This might be particularly unfortunate if some individuals were more “important” than the others. To develop the scenario, let’s say the airport also wanted to offer sports betting for profit. Then maybe American football is an important niche market, and it would be unfortunate if your random sample happened to miss those well-heeled lawyers dying to bet on that football game I am watching as I write this post.

What you’d prefer to do (and again, let’s ignore practicalities) is to spread your sample out, so that you are assured of getting an even coverage of gates and times (and even seasons of the year). Of course, your selection will still have to be random within areas, and you still might get unlucky (perhaps the lawyer you catch hates football and is crazy about croquet). But you’re more likely to get a representative sample if your approach is systematic rather than simple random.

# Driving our CAR Systematically

Let’s get back in our CAR and talk about the benefit of sampling against our document ranking. In this case, the value we’re trying to estimate is “relevance” (or more exactly, something about the distribution of relevance). Here, the population differentiation is a continuous one, from the highest relevance ranking to the lowest. This differentiation is going to be strongly correlated with the value we’re trying to measure.

Highly ranked documents are more likely to be relevant than lowly ranked ones (or so we hope). So if our simple random sample happened by chance to over-sample from the top of the ranking, we’re going to overstate the total number of relevant documents in the population.

Likewise, if our random sample happened by chance to oversample from the bottom of the ranking, our sample might understate the relevance population. By moving sequentially through the ranking from top to bottom, a systematic random sample removes the danger of this random “bunching,” and so makes our estimate more accurate overall.

At different points in the process, we might also want information about particular parts of the ranking. First, we may be trying to pick a cutoff. That suggests we need good information about the area around our candidate cutoff point.

Second, we might wonder if relevant documents have managed to bunch in some lower part of the ranking. It would be unfortunate if our simple random sample happened not to pick any documents from this region of interest. It would mean that we might miss relevant documents.

With a systematic random sample, we are guaranteed that each area of the ranking is equally represented. That is the point of the sample, to draw from each segment in the ranking (decile for example) and see what kinds of documents live there. Indeed, if we are already determined to review the top-ranking documents, we might want to place more emphasis on the lower rankings. Or not, depending on our goals and strategy.

Either way, the point of a systematic random sampling is to ensure that we sample documents across the ranking–from top to bottom. We do so in the belief that it will provide a more representative look at our document population and give us a better basis to draw a “yield curve.”^{[4]} To be fair, however, the document selected from that particular region might not be representative of that region. Whether you choose random or systematic, there is always the chance that you will miss important documents.

# Does it Work?

Experiments have shown us that documents can bunch together in a larger population. Back in the paper days, I knew that certain custodians were likely to have the “good stuff” in their correspondence files and I always went there first in my investigation. Likewise, people generally kept types of documents together in boxes, which made review quicker. I could quickly dismiss boxes of receipts when they didn’t matter to my case while spending my time on research notebooks when they did.

Similarly, and depending on how they were collected, relevant documents are likely to be found in bunches across a digital population. After all, I keep files on my computer in folders much like I did before I had a computer. It helps with retrieval. Other people do as well. The same is true for email, which I dutifully folder to keep my inbox clear.

So, no problem if those important documents get picked up during a random sample, or even because they are similar to other documents tagged as relevant. However, sometimes they aren’t picked up. They might still be bunched together but simply fall toward the bottom of the ranking. Then you miss out on valuable documents that might be important to your case.

While no method is perfect, we believe that a systematic random sample offers a better chance that these bunches get picked up during the sampling process. The simple reason is that we are intentionally working down the ranking to make sure we see documents from all segments of the population.

From experiments, we have seen this bunching across the ranking (yield) curve. By adding training documents from these bunches, we can quickly improve the ranking, which means we find more relevant documents with less effort. Doing so means we can review fewer documents at a lower cost. The team is through more quickly as well, which is important when deadlines are tight.

Many traditional systems don’t support systematic random sampling. If that is the case with your CAR, you might want to think about an upgrade. There is no doubt that simple random sampling will get you home eventually but you might want to ride in style. Take a systematic approach for better results and leave the driving to us.

[1] I could use TAR (Technology Assisted Review) but it wouldn’t work as well for my title. So, today it is CAR. Either term works for me.

[2] Thanks are due to William Webber, Ph.D., who helped me by explaining many of the points raised in this article. Webber is one of a small handful of CAR experts in the marketplace and, fortunately for us, a member of our Insight Predict Advisory Board. I am using several of his examples with permission.

[3] Information retrieval scientists put it this way: In simple random sampling, every combination of elements from the population has the same probability of being the sample. The distinction here is probably above the level of this article (and my understanding).

[4] Yield curves are used to represent the effectiveness of a document ranking and are discussed in several other blog posts I have written (see, e.g., here, here, here and here). They can be generated from a simple random sample but we believe a systematic random sample–where you move through all the rankings–will provide a better and more representative look at your population.

Bill DimmI fully agree that spreading out the samples more uniformly produces better results (fewer samples needed to achieve the same amount of error), but systematic random sampling seems a bit flawed. To see the problem, consider a similar, but slightly different approach to taking M samples: Order the documents based on relevance ranking, divide the set up into M equal-sized buckets (or strata) of documents based on the ranking, and select one random sample from each of the M buckets. This is the same as systematic random sampling except that you take a random document from each bucket instead of the Nth document from each bucket, so you don’t favor any particular position within each bucket — you get an unbiased estimate for each bucket individually, and errors due to picking documents high/low from within individual buckets tend to average out. In contrast, taking the Nth document from each bucket will always favor documents with either above or below (depending on N) average (for that particular bucket) relevance ranking for every single bucket, depending on that single initial random choice of N. The systematic random sampling approach is unbiased, but only in a rather strange sense that your systematic error, which will tend to have the same sign for every single bucket, has an overall random factor applied to it that would tend to cancel out if you repeated the entire sampling experiment many times — like knowing the exact right answer and adding a single random +1 or -1 with equal probability to it, giving an unbiased (but always wrong) result. If the number of samples, M, is large, the error will be pretty small because the bucket is small so the position within the bucket won’t have much impact, but it seems like an unnecessary error.

The process is extremely similar to numerical integration, where simple random sampling is like Monte Carlo integration (with terrible 1/sqrt(M) error bar) and uniform spacing of samples is like applying the trapezoid rule (with 1/M^2 error bar; ignoring 0.5 weighting of endpoints). The random choice of the Nth document for every bucket is like shifting all of the sample points in the trapezoid rule, which introduces a 1/M error with size depending on how far N is from the center of the bucket.

William WebberBill,

Hi! Sharply observed!

Systematic sampling is I think certainly lower variance than simple random sampling for sampling a ranking, due to possibly bunching effects in the latter. I see your point, though, about systematic sampling tending to “err with the slope”, though in practice this would only be an issue I think at higher rankings (the slope being too slight to matter much further down the ranking), and I think with a reasonable sample size would only be a slight issue.

What you’re proposing is essentially stratified sampling with a single element per stratum. This would, I agree, avoid a “systematic” error with the slope. But if relevant documents tend to “cluster” in the ranking (as they would with near-dupes), then you run a (admittedly slight) risk that the observations in adjoining strata may fall close together, and within the same “cluster”. And it might be (again slightly) less accurate for selecting a cutoff point, since samples will not be evenly spread across all potential cutoff points.

Anyway, it would be interesting to see empirical (or analytical) results.

William

Jeremy PickensBill,

Thank you for your thoughts.

I definitely see what you’re saying in terms of integration, the trapezoid rule, etc. But here is the thing: We’re not

necessarilytrying get a perfectly accurate estimate of the shape of the function at every point along the curve. We’re trying to get an accurate estimate of where the n% recall point is.So you could imagine making the stratified sample and the systematic sample conceptually equivalent by wrapping imaginary, post hoc, evenly-spaced buckets around your systematic sample, correct? In which case, the systematic sample is (conceptually) a stratified sample wherein you always pick the k^th ordered document in that bucket, non?

Now, please correct my thinking if you see this differently — I state the following by way of discussion not of absolutism — but even if you’re taking the k^th document from each bucket, because you’ve chosen k at random (even though it’s the same k every time), as you integrate across buckets you’ll still arrive at the bucket at which n% recall is achieved at the same time, as in the stratified, one document per bucket approach, will you not?

That is, your trapezoid might be consistently slightly high or slightly low within each of your buckets, but the cumulative sum across all the buckets does

notslip further and further behind. You either stay slightly above or slightly below the whole way across, and arrive at the bucket that contains the kth percentile recall at essentially the same point.And if that’s the case, then all that matters is your one document sample in that one, final, stopping-point, n% recall bucket, right? And from that perspective, I do not see a difference between the systematic approach and the approach that you describe. Why? Because if you’re in that same n% recall bucket in both cases, all that matters is that single document sample from within that one bucket. And with a single document sample from a single bucket, you still have an equal chance of being slightly high or slightly low as you do if you’d picked the kth starting point (within your imaginary buckets for the systematic sample) at the beginning of the process.

D’ya see what I’m saying?

Bill DimmHi Jeremy,

I want to emphasize that this is a 1/M error, so it can be a little tricky to see unless you take the number of samples, M, to be somewhat small. If M is fairly large, the 1/M error is probably negligible compared to the noise in the data, so it’s not likely to cause a problem as long as you are aware that M needs to be large.

“…conceptually equivalent by wrapping imaginary, post hoc, evenly-spaced buckets around your systematic sample…”

You’re going to have documents at one of the ranking that aren’t in a bucket, and at the other end a bucket that falls off the end of the documents (unless you just happen to pick your random N to be half the width of a bucket). It may sound like I’m nitpicking, but this is actually significant as you’ll see below.

“…if you’re in that same n% recall bucket in both cases, all that matters is that single document sample from within that one bucket…”

Not quite. You’ve effectively got an “off by approximately one” error in both the numerator (number of relevant documents before the cutoff) and the denominator (total number of relevant documents) of the recall, so (1+x)/(1+y) instead of x/y. If you are looking for the 100% recall level, the numerator and denominator are the same and the error cancels out, but for any other recall level the percentage impact on the numerator is bigger than the percentage impact on the denominator so it doesn’t cancel.

It’s usually easiest to see problems by taking an extreme example, so assume we have 1,000,000 documents and take only 100 samples (M=100). Compare results for k=1 and k=10,000 (two different random starting points that are as extreme as possible to magnify the error – I’m using k here to mach your notation, while John and I used N earlier). For k=1 we are sampling documents with rank: 1, 10001, 20001, 30001, etc. For k=10,000 we sample documents with rank: 10000, 20000, 30000, etc. So, k=1 and k=10000 are virtually identical (we’ll assume samples with adjacent rank like 10000 and 10001 give the same result — ignoring all noise to avoid unnecessary complication) except N=10000 trades a rank 1 document for a rank 1,000,000 document. Assume that the predictive coding algorithm does a perfect job, so we are actually just sampling a step function, and the step occurs at a point that is not near any sample (just to avoid unnecessary complication about edge effects).

For k=1 you find that documents with rank 1, 10001, and 20001 are relevant and the rest aren’t. For k=10000 you find that documents with rank 10000 and 20000 are relevant and the rest aren’t. So k=1 associates 67% recall (2/3) with rank 10001 while k=10000 associates 50% recall (1/2) with rank 10000. They would associate 100% recall with rank 20001 and 20000 respectively, so no significant disagreement there.

I took a naive approach to computing the recall in the previous paragraph — it doesn’t adjust for the fact that there is an excess of samples with rank >= 10001 for the k=1 case (two samples representing 10001 documents instead of one sample representing 10000 documents). You might employ a more sophisticated approach that would adjust for that by computing the fraction of samples above some rank that are responsive and applying that fraction to the total number of documents above that rank. So, for k=1 100% of samples with rank >= 20001 are relevant, so we estimate a total of 20001 relevant documents (you could argue for more than 20001 relevant documents due to uncertainty about documents with rank in the range [20002,30000] but that’s a whole different can of worms). 100% of documents with rank >= 10001 are relevant, so we estimate 10001 relevant documents with rank >= 10001, so 50.0025% (yes, way too many significant digits – done for comparison) recall at rank 10001. With k=10000 the same approach gives 50.0000% recall at rank 10000, so they’re effectively identical. It may seem like we’ve eliminated the problem completely, but that’s really only the case if our predictive coding algorithm is so perfect that it gives a step function for the probability of relevance. If we take a more realistic case the error comes back, just somewhat reduced. If probability of relevance declines as rank increases, a calculation of the fraction of documents that are relevant above a certain rank will tend to be higher for k=1 than k=10000 due to inclusion of the rank=1 document.

Of course, the fact that our sample in the example above includes only 2 or 3 relevant documents should be a big red flag that M is too small, but the point was to make the effect large enough to be obvious.

Bill DimmCorrection: “…documents at one END of the ranking that aren’t in a bucket…”

Jeremy PickensBill — ok thank you for your thoughtful reply.

After rereading your comments three times to make sure I was on the right page before responding 🙂 and discussing with some others, I see what you’re saying. I just.. I think that there has to be a perfect storm of small sample size, relevance step function perfection, etc. for it to make a huge difference

relative to other options.I think this line sums it up perfectly:

If probability of relevance declines as rank increases, a calculation of the fraction of documents that are relevant above a certain rank will tend to be higher for k=1 than k=10000 due to inclusion of the rank=1 document.And yes, probability of relevance does decline as rank increases. But more importantly, the actual relevant documents are fewer and further in between. And typically tend to cluster a bit, score-wise, because they share certain similar terms with your training documents. So your relevant docs will have little runs in between wider swaths of nonrelevant docs. And this tends to occur more and more has you approach your proportionality target (70%? 80%? 90%) recall.

And so here is where I’m still scratching my head: If we take your approach, and randomly pick 1 document from within each of the M buckets, I think I agree with you that at low recall, the randomness will even itself out, while in our approach, you might get a slight over- or under-estimate, i.e. the system might say you’re at 20% recall when you’re really at 15%.. or say that you’re at 10% recall when you’re really at 15%.

But you don’t stop when you’re at 15% recall. You stop when you’re at (let’s say for the sake of argument) 80% recall.

And at 80% recall, since you have this sparse, clusterly behavior of the relevant docs, the systematic sample will actually have lower error than the 1 doc per bucket approach, won’t it? And that’s because the 1 doc per bucket approach will, itself, tend to be clumpy, to cluster. To go with the numbers in your example above, you might randomly happen to select rank 10,000 in bucket M_(i) and rank 1 in bucket M_(i+1). If that clumpiness happens to hit a little tail relevance pocket, then you’re going to essentially count two very similar relevant documents twice, and overestimate the fact that you’ve hit 80% recall. You might still only be at 75% recall. And if those two clumpy samples miss relevant docs completely, you’ll be underestimating your recall.

Even if one neighboring bucket sample hits a relevant doc and one doesn’t, what if your random samples happened to hit rank 1 in bucket M_(i) and rank 10,000 in bucket M_(i+1)? Then you’ll have a huge gap in between your estimates of where the last or next relevant document is found, which means you might go almost 2M documents too far, or not 2M document far enough, right?

In those tail, high recall areas of the ranking, then, the approach that minimizes the clumpiness is the systematic random sample approach, because it guarantees the most regularity, the most evenness, between intervals.

And so which error is greater.. being off a little at the beginning, but then being more even-handed as you reach your target.. or being more even-handed at the beginning, and then off at little as you reach your target?

At Catalyst we have already done a number of experiments in which we compare the systematic approach against a simple random sample across the entire collection. And the error for the systematic approach was definitely lower than the simple random approach when it came to estimating whether you’d hit 80% recall.. I believe because of the natural clumpiness of simple random sampling (as well as the relevance clumpiness in the tail of the ranking).

I can see that your 1 random per bucket approach constrains some of that full-collection simple random clumpiness issue, but it doesn’t go away completely. Is the 1 random per bucket approach the correct middle ground, the sweet spot between the full clumpiness of the simple random sample and the full regularity of the systematic random sample? Maybe, maybe not. It would be very interesting to move out from theory and put it to the empirical test on real ranking data with full ground truth. And test it, not just using a 3 document sample (which I know you were just using for illustrative purposes) or even a 100 document sample. But a full 95/5 or 99/2-sized sample. The same number of documents in both approaches. And see what sort of variance each yields.

Bill Dimm“…I think that there has to be a perfect storm of small sample size, relevance step function perfection, etc…”

Just to be clear, the error does not depend on the relevance being a step function in any way whatsoever. I chose the step function in hopes that it would make the explanation easier to follow by avoiding talk of ratios of random variables and calculation expectation values.

“If that clumpiness happens to hit a little tail relevance pocket, then you’re going to essentially count two very similar relevant documents twice, and overestimate the fact that you’ve hit 80% recall. You might still only be at 75% recall. And if those two clumpy samples miss relevant docs completely, you’ll be underestimating your recall.”

There are two things I need to comment on here. I’ll start with the more obvious one. Over-counting responsive documents by one has much less impact at high recall than at more moderate recall. This was discussed earlier — you have an extra +1 in both the numerator (only after encountering the “extra” point) and the denominator, so the impact shrinks as you approach the point where the numerator and denominator would have been equal without the extra +1, i.e. 100% recall. To get the recall estimate to jump to 80% when it should have been only 75% by over-counting just one point would imply that we should have had only 4 relevant documents in the entire sample but we over-counted to find 5 instead (3/4=75% becomes 4/5=80%). I don’t know if you intentionally picked an extreme example to mirror mine, but clearly you can’t claim that you can distinguish the difference between 75% recall and 80% recall with only 4 or 5 relevant documents — worrying about clumps in the tail would be like measuring microns with a yardstick.

How likely are we to hit the same clump twice? Presumably, a clump is smaller than a bucket, or you would risk hitting it more than once with your current approach (and be guaranteed to hit it at least twice with either approach if it was bigger than two buckets). If the clump falls entirely within one bucket, and we assume that each possible position of the clump within the bucket has equal probability, then the probability of hitting the clump with the sample is exactly the same for SRS (systematic random sampling) where you always pick same location within bucket as it is for SS (stratified sampling) where you pick a random position within the bucket, so there is no difference — both give a probability of hitting the clump equal to the number of documents in the clump divided by the number of documents in the bucket. So, you only have an opportunity for a difference when the clump straddles the boundary between two buckets. If the clump does straddle two buckets, you also need to hit on both the left bucket (probability equal to the number of docs from the cluster in the left bucket divided by the size of the bucket) and the right bucket. If we take the number of documents in a bucket to be B, and the number of documents in the clump to be C, there are B possible, equally likely (probability 1/B), positions relative to the left edge of the left bucket where the left edge of the clump can start, and C-1 of them will have the clump straddling into the right buckets. So, the probability of hitting it twice is (i is the number of clump documents in the left bucket, so there are C-i documents from the clump in the right bucket):

P(2 hits) = sum[i=1 to C-1, i*(C-i)/B^3]

If I haven’t botched the math, that comes out to

P(2 hits) = (C^3 – C) / (6 * B^3)

The term linear in C will be negligible compared to the C^3 term — if we ignore it we have a result that depends only on the ratio (C/B). If C/B is 0.5, i.e. a clump is half the size of a bucket, we find the probability of a double hit to be less than 1/48. So you would need about 48 clumps to average just one double-hit, and I argued in the previous paragraph that over-counting by one would have very little impact on the recall in the neighborhood of 80% recall.

As far as missing a clump goes, if the clump is bigger than a few buckets it won’t be missed, and if it is smaller than a bucket it will be missed with the appropriate probability and won’t have a huge impact on recall in the 80% range anyway.

“It would be very interesting to move out from theory and put it to the empirical test…”

With a large number of buckets, I would be very surprised if you could see any difference between SRS and SS beyond random noise. None the less, I think it is worth pointing out that the SRS approach isn’t appropriate if the number of buckets isn’t large so that a reader doesn’t misapply it. Furthermore, I don’t see where you gain anything (aside from questions) by adding the same random shift to all of the sample points rather than sampling from the center of each bucket if you want equal spacing.

Jeremy Pickens“It would be very interesting to move out from theory and put it to the empirical test…”With a large number of buckets, I would be very surprised if you could see any difference between SRS and SS beyond random noise. None the less, I think it is worth pointing out that the SRS approach isn’t appropriate if the number of buckets isn’t large so that a reader doesn’t misapply it. Furthermore, I don’t see where you gain anything (aside from questions) by adding the same random shift to all of the sample points rather than sampling from the center of each bucket if you want equal spacing.What you gain is an advantage relative to the way the industry typical does validation, which is a simple random sample. Or rather, two simple random samples: One at the beginning (pre-TAR) across the entire collection to estimate collection-wide prevalence, and then another at the end (post-TAR) from the set of documents that are not being produced, to ensure that (relative to what you’ve started with) you’re not missing “too much”.

With the systematic sample approach to validation, what one is doing is a post-TAR only sample. As such, since a pre-TAR sample is not needed, this gives the reviewer doing the validation a chance to double the size of the systematic sample and still do no more effort than otherwise would have been done.

So if you do happen in your systematic sample to hit at rank 1, 10001, 20001, or happen to hit and rank 10000, 20000, 30000, then if you double the size of that sample you are now hitting at 1, 5001, 10001, 15001, 20001, 25001 (or at 5000, 10000, 15000, 20000, 25000, 30000). So in the worst case you’re hitting both the middle and the endpoints, and I really don’t see the issue here.

And even if you choose to not set your sample size equal to the sum of the before+after that you otherwise would have done, and instead only do a regular-sized “after” sample, the software we wrap around our process, and the team of consultants we’ve got providing service, ensure that the user of the service doesn’t misapply it. Doesn’t take a sample that is too small.

If your concern is to the casual reader going out on their own and attempting to replicate this form of post-hoc only validation, then I agree with the caution. However, I think the bigger issue is not the differences between edge cases of the systematic vs “one random per bucket” approach. The bigger issue is not taking a sample size that is big enough.

That is, doing a 100 document systematic sample might be a bad idea. But doing a 100 document “one random per bucket” sample is also a bad idea. Having nothing to do with the one random per bucket approach, but having everything to do with sample size = 100.

I still think it would make a very nice investigation to actually take some real data (full rankings, full collection judgments) and do an empirical exploration of what happens as the sample size gets bigger and bigger (or smaller and smaller). Start at a sample size of 1000 for both approaches (systematic and one per bucket), estimate the error, and then step the size down to 950, 900, 850, etc. all the way down to 50, and show the error for each. Just to take this out of the theoretical realm and into empirical practice. Again, not that I’m recommending validating your TAR process using a sample size of 50. Not at all. But the more one understands one’s tools, the better one gets at wielding them. Doing such a parameter sweep might yield really interesting insights.

Jeremy PickensAnd just for the record:

“…conceptually equivalent by wrapping imaginary, post hoc, evenly-spaced buckets around your systematic sample…”You’re going to have documents at one of the ranking that aren’t in a bucket, and at the other end a bucket that falls off the end of the documents (unless you just happen to pick your random N to be half the width of a bucket). It may sound like I’m nitpicking, but this is actually significant as you’ll see below.What I meant here wasn’t that you are placing evenly-spaced buckets around your systematic sample, with the middle of each bucket centered on the your randomly selected rank k. No, what I mean was that the buckets were evenly spaced, and that you were overlaying them on top of the collection, such that one systematic random sample hit in each bucket.. though of course in the same place in every bucket.

In other words, my imaginary buckets are exactly the same (same size, same partitioning of the ranking) as your buckets. But my sample always hits at rank k within each bucket, whereas yours hits at a random position in each bucket.

Is that clearer?

Bill DimmYes, much clearer. Your buckets are exactly the same as my buckets have always been. Your goal is to estimate how many of the documents in each bucket are relevant. As you move from low rank to high rank within a bucket, you expect the density of relevant documents (i.e. the “probability” of the document being relevant, if you prefer, although there is really nothing random here — a document is either relevant or it isn’t) to decrease, reflecting the downward-sloping precision-recall curve. You are going to estimate the number of relevant documents in the bucket in a very crude way, which is to review just one of them. The best you can hope for with that estimate (since you are going to get 0 or 1, i.e. a yes or a no, not 0.72) is that the probability of the document being relevant reflects the average density across the whole bucket. Which single document best reflects the average? To the extent that the density is linear within the bucket (non-linear terms will be less and less important relative to linear the smaller the bucket is), the best choice you can make without knowing/assuming a shape for the density is the center of the bucket. If you pick k to be from the lower rank end of the bucket, you will get a 1 (a responsive doc) more often than you should, and if you pick it to be above the center you’ll get a 0 more often than you should. If you pick k randomly for each bucket those too high/low errors tend to cancel. If you pick k to be the center of the bucket you avoid the issue as much as possible (without making strong assumptions about the shape of the density as a function of rank). If you pick k to always be the same non-center value you introduce an error that can only be ignored if the number of buckets is big enough. Why go down that road?

Jeremy PickensYour buckets are exactly the same as my buckets have always been.Or conversely, I could also say that your bucketed approach is the same as my systematic approach has always been, just with a little bit of the jitters, a limited scope random perturbation around where each next sample in the sequence drops 🙂

If you pick k to be the center of the bucket you avoid the issue as much as possible (without making strong assumptions about the shape of the density as a function of rank). If you pick k to always be the same non-center value you introduce an error that can only be ignored if the number of buckets is big enough. Why go down that road?I’d still like to see, empirically, the effect of that error. One thing that’s nice about the systematic approach as opposed to the random perturbation approach is that, by sliding your starting point from 1 all the way up to your skip size (aka imaginary bucket size), you can actually get a complete, non-sampled estimate of that error. I.e. there are a tractable number of different possible sampling outcomes, due to the systematic nature of the sampling process, which allows us to test every single one.. whether it’s a start at rank 1, at rank 10000, or in the very middle at rank 5000.

Jeremy PickensCorrection: “You either stay slightly above or slightly below the whole way across, and arrive at the bucket that contains the nth percentile recall at essentially the same point.”

Not kth percentile. Messed up my notation there.

Jeremy PickensAnd one other issue that should be addressed. This is something that I’ve been discussing with William Webber recently. Whether or not there is a 1/M error, realize that there are two ways of expressing that error: (1) in terms of precision, and (2) in terms of recall.

By that I mean that I think it’s important to keep in mind that, if one is indeed off, that one measures how much one is off not in terms of raw rank, but in terms of final goal.

Because again, being off by 1/M documents is not the same as being off by 1/M responsive documents. So if your goal is a (proportionality-derived) 80% recall, and you’re in that Mth bucket at which the 80th percentile is found, responsive documents are much fewer and far between than when you’re in the 20th percentile recall bucket — again because you’re ranking by descending responsiveness. So being fewer and far between means that even if you miss, you’re not going to miss by as much as if responsiveness were evenly distributed throughout the list. Know what I’m saying?

Anyway, just something to think about.

Pingback: This Week's Links (weekly) | The Many Faces of Mike McBride

Pingback: Pioneering Cormack/Grossman Study Validates Continuous Learning, Judgmental Seeds and Review Team Training for Technology Assisted Review

Pingback: The Five Myths of Technology Assisted Review, Revisited |

Pingback: Continuous Active Learning for Technology Assisted Review (How it Works and Why it Matters for E-Discovery) |

Pingback: Predictive Ranking (TAR) for Smart People |