Predictive Ranking (TAR) for Smart People

download-pdf-versionPredictive Ranking, aka predictive coding or technology-assisted review, has revolutionized electronic discovery–at least in mindshare if not actual use. It now dominates the dais for discovery programs, and has since 2012 when the first judicial decisions approving the process came out. Its promise of dramatically reduced review costs is top of mind today for general counsel. For review companies, the worry is about declining business once these concepts really take hold.

While there are several “Predictive Coding for Dummies” books on the market, I still see a lot of confusion among my colleagues about how this process works. To be sure, the mathematics are complicated, but the techniques and workflow are not that difficult to understand. I write this article with the hope of clarifying some of the more basic questions about TAR methodologies.

I spent over 20 years as a trial lawyer and partner at a national law firm and another 15 at Catalyst. During that time, I met a lot of smart people–but few actual dummies. This article is for smart lawyers and legal professionals who want to learn more about TAR. Of course, you dummies are welcome to read it too.

What is Predictive Ranking?

Predictive Ranking is our name for an interactive process whereby humans train a computer algorithm to identify useful (relevant) documents. We call it Predictive Ranking because the goal of these systems is to rank documents in order of estimated relevance. Humans do the actual coding.

How does it work?

In its simplest form, it works like the Pandora Internet radio service. Pandora has thousands of songs in its archive but no idea what kind of music you like. Its goal is to play music from your favorite artists but also to present new songs you might like as well.

Pandora

How does Pandora do this? For those who haven’t tried it, you start by giving Pandora the name of one or more artists you like, thus creating a “station.” Pandora begins by playing a song or two by the artists you have selected. Then, it chooses a similar song or artist you didn’t select to see if you like it. You answer by clicking a “thumbs up” or “thumbs down” button. Information retrieval (IR) scientists call this “relevance feedback.”

Pandora analyzes the songs you like, as well as the songs you don’t to make its suggestions. It looks at factors such as melody, harmony, rhythm, form, composition and lyrics to find similar songs. As you give it feedback on its suggestions, it takes that information into account in order to make better selections the next time. The IR people would call this “training.”

The process continues as you listen to your radio station. The more feedback you provide, the smarter the system gets. The end result is Pandora plays a lot of music you like and, occasionally, something you don’t like.

Predictive Ranking works in a similar way–only you work with documents rather than songs. As you train the system, it gets smarter about which documents are relevant to your inquiry and which are not.[1] It is as simple as that.

OK, but how does Predictive Ranking really work?

Well, it really is just like Pandora, although there are a few more options and strategies to consider. Also, different vendors approach the process in different ways, which can cause some confusion. But here is a start toward explaining the process.

1. Collect the documents you want to review and feed them to the computer.

To start, the computer has to analyze the documents you want to review (or not review), just like Pandora needs to analyze all the music it maintains. While approaches vary, most systems analyze the words in your documents in terms of frequency in the document and across the population.

Some systems require that you collect all of the documents before you begin training. Others, like our system, allow you to add documents during the training process. Either approach works. It is just a matter of convenience.

2. Start training/review.

You have two choices here. You can start by presenting documents you know are relevant (or non-relevant) to the computer or you can let the computer select documents for your consideration. With Pandora, you typically start by identifying an artist you like. This gives the computer a head start on your preferences. In theory, you could let Pandora select music randomly to see if you liked it but this would be pretty inefficient.

Either way, you essentially begin by giving the computer examples of which documents you like (relevant) and which you don’t like (non-relevant).[2] The system learns from the examples which terms tend to occur in relevant documents and which in non-relevant ones. It then develops a mathematical formula to help it predict the relevance of other documents in the population.

There is an ongoing debate about whether training requires the examples to be provided by subject matter experts (SMEs) to be effective. Our research suggests that review teams assisted by SMEs are just as effective as SMEs alone. See: Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?  Others disagree. See, for example, Ralph Losey’s posts about the need for SME’s to make the process effective.

infographic_blog_predictive_ranking_large-01

3. Rank the documents by relevance.

This is the heart of the process. Based on the training you have provided, the system creates a formula which it uses to rank (order) your documents by estimated relevance.

4. Continue training/review (rinse and repeat).

Continue training using your SME or review team. Many systems will suggest additional documents for training, which will help the algorithm get better at understanding your document population. For the most part, the more training/review you do, the better the system will be at ranking the unseen documents.

5. Test the ranking.

How good a job did the system do on the ranking? If the ranking is “good enough,” move forward and finish your review. If it is not, continue your training.

Some systems view training as a process separate from review. Following this approach, your SME’s would handle the training until they were satisfied that the algorithm was fully trained. They would then let the review teams look at the higher-ranked documents, possibly discarding those below a certain threshold as non-relevant.

Our research suggests that a continuous learning process is more effective. We therefore recommend that you feed reviewer judgments back to the system for a process of continuous learning. As a result, the algorithm continues to get smarter, which can mean even fewer documents need to be reviewed. See: TAR 2.0: Continuous Ranking – Is One Bite at the Apple Really Enough?

6. Finish the review.

The end goal is to finish the review as efficiently and cost-effectively as possible. In a linear review, you typically review all of the documents in the population. In a predictive review, you can stop well before then because the important documents have been moved to the front of the queue. You save on both review costs and the time it takes to complete the review.

Ultimately, “finishing” means reviewing down the ranking until you have found enough relevant documents, with the concept of proportionality taking center stage. Thus, you stop after reviewing the first 20% of the ranking because you have found 80% of the relevant documents. Your argument is that the cost to review the remaining 80% of the document population just to find the remaining 20% of the relevant documents is unduly burdensome.[3]

That’s all there is to it. While there are innumerable choices in applying the process to a real case, the rest is just strategy and execution.

How do I know if the process is successful?

That, of course, is the million-dollar question. Fortunately, the answer is relatively easy.

The process succeeds to the extent that the document ranking places more relevant documents at the front of the pack than you might get when the documents are ordered by other means (e.g. by date or Bates number). How successful you are depends on the degree to which the Predictive Ranking is better than what you might get using your traditional approach.

Let me offer an example. Imagine your documents are represented by a series of cells, as in the below diagram. The orange cells represent relevant documents and the white cells non-relevant.

Random Docs

What we have is essentially a random distribution, or at least there is no discernable pattern between relevant and non-relevant. In that regard, this might be similar to a review case where you ordered documents by Bates number or date. In most cases, there is no reason to expect that relevant documents would appear at the front of the order.

This is typical of a linear review. If you review 10% of the documents, you likely will find 10% of the relevant documents. If you review 50%, you will likely find 50% of the relevant documents.

Take a look at this next diagram. It represents the outcome of a perfect ordering. The relevant documents come first followed by non-relevant documents.

Perfect Docs

If you could be confident that the ranking worked perfectly, as in this example, it is easy to see the benefit of ordering by rank. Rather than review all of the documents to find relevant ones, you could simply review the first 20% and be done. You could confidently ignore the remaining 80% (perhaps after sampling them) or, at least, direct them to a lower-priced review team.

Yes, but what is the ranking really like?

Since this is directed at smart people, I am sure you realize that computer rankings are never that good. At the same time, they are rarely (if ever) as bad as you might see in a linear review.

Following our earlier examples, here is how the actual ranking might look using Predictive Ranking:

Actual Docs

We see that the algorithm certainly improved on the random distribution, although it is far from perfect. We have 30% of the relevant documents at the top of the order, followed by an increasing mix of non-relevant documents. At about a third of the way into the review, you would start to run out of relevant documents.

This would be a success by almost any measure. If you stopped your review at the midway point, you would have seen all but one relevant document. By cutting out half the document population, you would save substantially on review costs.

How do I measure success?

If the goal of Predictive Ranking is to arrange a set of documents in order of likely relevance to a particular issue, the measure of success is the extent to which you meet that goal. Put as a question, “Am I getting more relevant documents at the start of my review than I might with my typical approach (often a linear review).”[4] If the answer is yes, then how much better?

To answer these questions, we need to take two additional steps. First, for comparison purposes, we will want to measure the “richness” of the overall document population. Second, we need to determine how effective our ranking system turned out to be against the entire document population.

1. Estimating richness: Richness is a measure of how many relevant documents are in your total document population. Some people call this “prevalence,” as a reference to how prevalent relevant documents are in the total population. For example, we might estimate that 15% or the documents are relevant, with 85% non-relevant. Or we might say document prevalence is 15%.

How do we estimate richness? Once the documents are assembled, we can use random sampling for this purpose. In general, a random sample allows us to look at a small subset of the document population, and make predictions about the nature of the larger set.[5] Thus, from the example above, if our sample found 15 documents out of a hundred to be relevant, we would project a richness of 15%. Extrapolating that to the larger population (100,000 for example), we might estimate that there were about 15,000 relevant documents to be found.

For those really smart people who understand statistics, I am skipping a discussion about confidence intervals and margins of error. Let me just say that the larger the sample size, the more confident you can be in your estimate. But, surprisingly, the sample size does not have to be that large to provide a high degree of confidence.

Systematic Random2. Evaluating the ranking: Once the documents are ranked, we can then sample the ranking to determine how well our algorithm did in pushing relevant documents to the top of the stack. We do this through a systematic random sample.

In a systematic random sample, we sample the documents in their ranked order, tagging them as relevant or non-relevant as we go. Specifically, we sample every Nth document from the top to the bottom of the ranking (e.g. every 100th document). Using this method helps ensure that we are looking at documents across the ranking spectrum, from highest to lowest.

As an aside, you can actually use a systematic random sample to determine overall richness/prevalence and to evaluate the ranking. Unless you need an initial richness estimate, say for review planning purposes, we recommend you do both steps at the same time.

You can read more about simple and systematic random sampling in an earlier article I wrote, Is Random the Best Road for Your CAR? Or is there a Better Route to Your Destination?

Comparing the results

We can compare the results of the systematic random sample to the richness of our population by plotting what scientists call a “yield curve.” While this may sound daunting, it is really rather simple. It is the one diagram you should know about if you are going to use Predictive Ranking.

Linear Yield Curve

A yield curve can be used to show the progress of a review and the results it yields, at least in number of relevant documents found. The X axis shows the percentage of documents to be reviewed (or reviewed). The Y axis shows the percentage of relevant documents found (or you would expect to fin) at any given point in the review.

Linear review: Knowing that the document population is 15% rich (give or take) provides a useful baseline against which we can measure the success of our Predictive Ranking effort. We plot richness as a diagonal line going from zero to 100%. It reflects the fact that, in a linear review, we expect the percentage of relevant documents to correlate to the percentage of total documents reviewed.

Following that notion, we can estimate that if the team were to review 10% of the document population, they would likely see 10% of the relevant documents. If they were to look at 50% of the documents, we would expect them to find 50% of the relevant documents, give or take. If they wanted to find 80% of the relevant documents, they would have to look at 80% of the entire population.

Predictive Review: Now let’s plot the results of our systematic random sample. The purpose is to show how the review might progress if we reviewed documents in a ranked order, from likely relevant to likely non-relevant. We can easily compare it to a linear review to measure the success of the Predictive Ranking process.

Predictive Yield Curve

You can quickly see that the line for the Predictive Review goes up more steeply than the one for linear review. This reflects the fact that in a Predictive Review the team starts with the most likely relevant documents. The line continues to rise until you hit the 80% relevant mark, which happens after a review of about 10-12% of the entire document population. The slope then flattens, particularly as you cross the 90% relevant line. That reflects the fact that you won’t find as many relevant documents from that point onward. Put another way, you will have to look through a lot more documents before you find your next relevant one.

We now have what we need to measure the success of our Predictive Ranking project. To recap, we needed:

  1. A richness estimate so we have an idea of how many relevant documents are in the population.
  2. A systematic random sample so we can estimate how many relevant documents got pushed to the front of the ordering.

It is now relatively easy to quantify success. As the yield curve illustrates, if I engage in a Predictive Review, I will find about 80% of the relevant documents after only reviewing about 12% of total documents. If I wanted to review 90% of the relevant documents, I could stop after reviewing just over 20% of the population. My measure of success would be the savings achieved over a linear review.[6]

At this point we move into proportionality arguments. What is the right stopping point for our case? The answer depends on the needs of your case, the nature of the documents and any stipulated protocols among the parties. At the least, the yield curve helps you frame the argument in a meaningful way.

Moving to the advanced class

My next post will take this discussion to a higher level, talking about some of the advanced questions that dog our industry. For a sneak peak on my thinking, take a look at a few of the articles we have already posted on the results of our research. I think you now have a foundation upon which to understand these and just about any other article on the topic you might find.

I hope this was helpful. Post your questions below. I will try and answer them (or pass them on to our advisory board for their thoughts).

Further reading:


[1] IR specialists call these documents “relevant” but they do not mean relevant in a legal sense. They mean important to your inquiry even though you may not plan on introducing them at trial. You could substitute hot, responsive, privileged or some other criterion depending on the nature of your review.

[2] I could use “irrelevant” but that has a different shade of meaning for the IR people so I bow to their use of non-relevant here. Either word works for this discussion.

[3] Sometimes at the meet-and-confer, the parties agree on Predictive Ranking protocols, including the relevance score that will serve as the cut-off for review.

[4] I will use a linear review (essentially a random relevance ordering) as a baseline because that is the way most reviews are done. If you review based on conceptual clusters or some other method, your baseline for comparison would be different.

[5] Note that an estimate based on a random sample is not valid unless you are sampling against the entire population. If you get new documents, you have to redo your sample.

[6] In a separate post we will argue that the true measure of success with Predictive Ranking is the total amount saved on the review, taking into consideration software and hardware along with human costs. Time savings is also an important factor. IR scientist William Webber has touched on this point here: Total annotation cost should guide automated review.

3 thoughts on “Predictive Ranking (TAR) for Smart People

  1. Tim Slattery

    Thank you, Mr. Tredennick, for this clear, useful explanation. Your Pandora analogy and simple graphics are especially effective. I have used Catalyst and know that such systems can increase litigator productivity by orders of magnitude. My question concerns the effects of time and human learning on the specification of relevance. Suppose that a new discovery matter and a trial attorney SME exist in parallel universes: In our universe, SME-a’s knowledge of the case is limited to pleadings and preliminary information provided by Client. In the alternate universe, SME-b is capable of time travel and has returned from the future (post-discovery) with complete knowledge of the case. Assuming identical doc populations were processed in both universes, would the prevalence / richness rate be the same? Would the predictive coding yield curve be different? If Client’s disclosure is challenged as incomplete (at or after the close of discovery), will accuracy of the process be judged from the perspective of SME-a or SME-b?

    Reply
    1. John Tredennick

      Thanks for the kind words Tim:

      I believe that the answer to your question is that the results should be similar with the two SME’s for two reasons:

      1. The judgment is on the document. While having extra knowledge would always be helpful, I am not convinced that it would make that big a difference in how you views that individual documents.

      2. The Yield curves are built around judgments on a lot of documents and are not dramatically affected by a few differences in judgment. You can read more about that in our research on whether SME’s are needed for training.

      Assuming the SME’s in your hypothetical were equally competent, I believe they would do similar jobs on the training.

      And, ultimately, nobody is or should be judged based on the benefit of hindsight.

      Reply
  2. Pingback: Pioneering Cormack/Grossman Study Validates Continuous Learning, Judgmental Seeds and Review Team Training for Technology Assisted Review

Leave a Reply

Your email address will not be published. Required fields are marked *