Last month, two of the leading experts on e-discovery, Maura R. Grossman and Gordon V. Cormack, presented a peer-reviewed study on continuous active learning to the annual conference of the Special Interest Group on Information Retrieval, a part of the Association for Computing Machinery (ACM), “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.”
In the study, they compared three TAR protocols, testing them across eight different cases. Two of the three protocols, Simple Passive Learning (SPL) and Simple Active Learning (SAL), are typically associated with early approaches to predictive coding, which we call TAR 1.0. The third, continuous active learning (CAL), is a central part of a newer approach to predictive coding, which we call TAR 2.0.
Based on their testing, Grossman and Cormack concluded that CAL demonstrated superior performance over SPL and SAL, while avoiding certain other problems associated with these traditional TAR 1.0 protocols. Specifically, in each of the eight case studies, CAL reached higher levels of recall (finding relevant documents) more quickly and with less effort that the TAR 1.0 protocols.
Not surprisingly, their research caused quite a stir in the TAR community. Supporters heralded its common-sense findings, particularly the conclusion that random training was the least efficient method for selecting training seeds. (See, e.g., Latest Grossman and Cormack Study Proves Folly of Using Random Search for Machine Training, by Ralph Losey.) Detractors challenged their results, arguing that using random seeds for training worked fine with their TAR 1.0 software and eliminated bias. (See, e.g., Random Sampling as an Effective Predictive Coding Training Strategy, by Herbert L. Roitblat.) We were pleased that it confirmed our earlier research and legitimized what for many is still a novel approach to TAR review.
So why does this matter? The answer is simple. CAL matters because saving time and money on review is important to our clients. The more the savings, the more it matters.
TAR 1.0: Predictive Coding Protocols
To better understand how CAL works and why it produces better results, let’s start by taking a look at TAR 1.0 protocols and their limitations.
- A subject matter expert (SME), often a senior lawyer, reviews and tags a random sample (500+ documents) to use as a control set for training.
- The SME then begins a training process using Simple Passive Learning or Simple Active Learning. In either case, the SME reviews documents and tags them relevant or non-relevant.
- The TAR engine uses these judgments to build a classification/ranking algorithm that will find other relevant documents. It tests the algorithm against the already-tagged control set to gauge its accuracy in identifying relevant documents.
- Depending on the testing results, the SME may be asked to do more training to help improve the classification/ranking algorithm.
- This training and testing process continues until the classifier is “stable.” That means its search algorithm is no longer getting better at identifying relevant documents in the control set. There is no point in further training relative to the control set.
The next step is for the TAR engine to run its classification/ranking algorithm against the entire document population. The SME can then review a random sample of ranked documents to determine how well the algorithm did in pushing relevant documents to the top of the ranking. The sample will help tell the review administrator how many documents will need to be reviewed to reach different recall rates.
The review team can then be directed to look at documents with relevance scores higher than the cutoff point. Documents below the cutoff point can be discarded.
Even though training is initially iterative, it is a finite process. Once your classifier has learned all it can about the 500+ documents in the control set, that’s it. You simply turn it loose to rank the larger population (which can take hours to complete) and then divide the documents into categories to review or not review.
The goal, to be sure, is for the review population to be smaller than the remainder. Savings come from not having to review all of the documents.
SPL and SAL: Simple TAR 1.0 Training Protocols
Grossman and Cormack tested two training protocols used in the TAR 1.0 methodology: Simple Passive Learning and Simple Active Learning.
Simple Passive Learning uses random documents for training. Grossman and Cormack did not find this approach to be particularly effective:
The results show that entirely non-random training methods, in which the initial training documents are selected using a simple keyword search, and subsequent training documents are selected by active learning, require substantially and significantly less human review effort to achieve any given level of recall, than passive learning, in which the machine-learning algorithm plays no role in the selection of training documents.
Common sense supports their conclusion. The quicker you can present relevant documents to the system, the faster it should learn about your documents.
We have also written about this issue and made similar arguments about the efficacy of random training. Is Random the Best Road for Your Car? Or is there a Better Route to Your Destination?; Comparing Active Learning to Random Sampling using Zipf’s Law to Evaluate Which is More Effective for TAR.
Simple Active Learning does not rely on random documents. Instead, it suggests starting with whatever relevant documents you can find, often through keyword search, to initiate the training. From there, the computer presents additional documents designed to help train the algorithm. Typically the system selects documents it is least sure about, often from the boundary between relevance and non-relevance. In effect, the machine learning algorithm is trying to figure out where to draw the line between the two based on the documents in the control set you created to start the process.
As Grossman and Cormack point out, this means that the SME spends a lot of time looking at marginal documents in order to train the classifier. And keep in mind that the classifier is training against about a relatively small number of documents chosen by your initial random sample. There is no statistical reason to think these are in fact representative of the larger population and likely are not. We have written recently about the issue of topical coverage of random samples here: Comparing Active Learning to Random Sampling using Zipf’s Law to Evaluate Which is More Effective for TAR.
Grossman and Cormack concluded that Simple Active Learning performed better than Simple Passive Learning. However, Simple Active Learning was found to be less effective than continuous active learning.
Among active-learning methods, continuous active learning with relevance feedback yields generally superior results to simple active learning with uncertainty sampling, while avoiding the vexing issue of “stabilization” – determining when training is adequate, and therefore may stop.
Thus, both of the TAR 1.0 protocols, SPL and SAL, were found to be less effective at finding relevant documents than CAL.
Practical Problems with TAR 1.0 Protocols
Whether you use either the SPL or SAL protocol, the TAR 1.0 process comes with a number of practical problems when applied to “real world” discovery.
One Bite at the Apple: The first, and most relevant to a discussion of continuous active learning, is that you get only “one bite at the apple.” (See, TAR 2.0: Continuous Ranking—Is One Bite at the Apple Really Enough?). Once the team gets going on the review set, there is no opportunity to feed back their judgments on review documents and improve the classification/ranking algorithm. Improving the algorithm means the review team will have to review less documents to reach any desired recall level.
SMEs Required: A second problem is that TAR 1.0 generally requires a senior lawyer or subject-matter expert (SME) for training. Expert training requires the lawyer to review thousands of documents to build a control set, to train and then test the results. Not only is this expensive, but it delays the review until you can convince your busy senior attorney to sit still and get through the training. I wrote about these problems in this post.
Rolling Uploads: Going further, the TAR 1.0 approach does not handle rolling uploads well and does not work well for low richness collections, both of which are common in e-discovery. New documents render the control set invalid because they were not part of the random selection process. That typically means going through new training rounds.
Low Richness: The problem with low richness collections is that it can be hard to find good training examples based on random sampling. If richness is below 1%, you may have to review several thousand documents just to find enough relevant ones to train the system. Indeed, this issue is sufficiently difficult that some TAR 1.0 vendors suggest their products shouldn’t be used for low richness collections.
TAR 2.0 Predictive Coding Protocols
With TAR 2.0, these real-world problems go away, partly due to the nature of continuous learning and partly due to the continuous ranking process required to support continuous learning. Taken together, continuous learning and continuous ranking form the basis of the TAR 2.0 approach, not only saving on review time and costs but making the process more fluid and flexible in the bargain.
Our TAR 2.0 engine is designed to rank millions of documents in minutes. As a result, we rank every document in the collection each time we run a ranking. That means we can continuously integrate new judgments by the review team into the algorithm as their work progresses.
Because our engine can rank all of the documents, there is no need to use a control set for training. Training success is based on ranking fluctuations across the entire set, rather than a limited set of randomly-selected documents. When document rankings stop changing, the classification/ranking algorithm has settled, at least until new documents arrive.
This solves the problem of rolling uploads. Because we don’t use a control set for training, we can integrate rolling document uploads into the review process. When you add new documents to the mix, they simply join in the ranking process and become part of the review.
Depending on whether the new documents are different or similar to documents already in the population, they may integrate into the rankings immediately or instead fall to the bottom. In the latter case, we pull samples from the new documents through our contextual diversity algorithm for review. As the new documents are reviewed, they integrate further into the ranking.
You can see an illustration of the initial fluctuation of new documents in this example from Insight Predict. The initial review moved forward until the classification/ranking algorithm was pretty well trained.
New documents were added to the collection midway through the review process. Initially the population rankings fluctuated to accommodate the newcomers. Then, as representative samples were identified and reviewed, the population settled down to stability.
For more on contextual diversity, see below or our recent article comparing contextual diversity with random sampling. Comparing Active Learning to Random Sampling using Zipf’s Law to Evaluate Which is More Effective for TAR.
Continuous Active Learning
There are two aspects to continuous active learning. The first is that the process is “continuous.” Training doesn’t stop until the review finishes. The second is that the training is “active.” That means the computer feeds documents to the review team with the goal of making the review as efficient as possible (minimizing the total cost of review).
Although our software will support a TAR 1.0 process, we have long advocated continuous learning as the better alternative. Simply put, as the reviewers progress through documents in our system, we feed their judgments back to the system to be used as seeds in the next ranking process. Then, when the reviewers ask for a new batch, the documents are presented based on the latest completed ranking. To the extent the ranking has improved by virtue of the additional review judgments, they receive better documents than they otherwise would had the learning stopped after “one bite at the apple.”
In effect, the reviewers become the trainers and the trainers become reviewers. Training is review, we say. And review is training.
Indeed, review team training is all but required for a continuous learning process. It makes little sense to expect a senior attorney do the entire review, which may involve hundreds of thousands of documents. Rather, SMEs should focus on finding (through search or otherwise) relevant documents to help move the training forward as quickly as possible. They can also be used to monitor the review team, using our QC algorithm designed to surface documents likely to have been improperly tagged. We have shown that this process is as effective as using senior lawyers to do the training and can be done at a lower cost. And, like CAL itself, our QC algorithm also continues to learn as the review progresses.
What are the Savings?
Grossman and Cormack quantified the differences between the TAR 1.0 and 2.0 protocols by measuring the number of documents a team would need to review to get to a specific recall rate. Here, for example, is a chart showing the difference in the number of documents a team would have to review to achieve a 75% level of recall comparing continuous active learning and simple passive learning:
The test results showed that the review team would have to look at substantially more documents using the SPL (random seeds) protocol than CAL. For matter 201, the difference would be 50,000 documents. At $2 a document for review and QC, that would be a savings of $100,000. For matter 203, which is the extreme case here, the difference would be 93,000 documents. The savings from using CAL based on $2 a document would be $186,000.
Here is another chart that compares all three protocols over the same test set. In this case Grossman and Cormack varied the size of the training sets for SAL and SPL to see what impact it might have on the review numbers. You can see that the results for for both of the TAR 1.0 protocols improve with additional training but at the cost of requiring the SME to look at as many as 8,000 documents before beginning training. And, even using what Grossman and Cormack call an “ideal” training set for SAL and SPL (which cannot be identified in advance), SAL beat or matched the results in every case, often by a substantial margin.
We presented our research on the benefits of continuous active learning as well. Like Grossman and Cormack, we found there were substantial savings to be had by continuing the training through the entire review. You can see it in this example:
To read about our research on this issue and the savings that can be achieved by a continuous learning process, see:
- TAR 2.0: Continuous Ranking – Is One Bite at the Apple Really Enough?
- 5 Myths About Technology-Assisted Review (Law Technology News).
- Predictive Ranking (TAR) for Smart People.
- The Five Myths of Technology Assisted Review, Revisited.
What about Review Bias?
Grossman and Cormack constructed their CAL protocol by starting with seeds found through keyword search. They then presented documents to reviewers based on “relevance feedback.”
Relevance feedback simply means that the system feeds the highest-ranked documents to the reviewers for their judgment. Of course, what is highly ranked depends on what you tagged before.
Some argue that this approach opens the door to bias. If your ranking is based on documents you found through keyword search, what about other relevant documents you didn’t find? “You don’t know what you don’t know,” they say.
Random selection of training seeds raises the chance of finding relevant documents that are different from the ones you have already found. Right?
Actually, everyone seems to agree on this point. Grossman and Cormack point out that they used relevance feedback because they wanted to keep their testing methods simple and reproducible. As they note in their conclusion:
There is no reason to presume that the CAL results described here represent the best that can be achieved. Any number of feature engineering methods, learning algorithms, training protocols, and search strategies might yield substantive improvements in the future.
In an excellent four-part series (which starts here), Ralph Losey suggests using a multi-modal approach to combat fears of bias in the training process. From private discussions with the authors, we know that Grossman and Cormack also use added techniques to improve the learning process for their system as well.
We combat bias in our active learning process by including contextual diversity samples as part of our active training protocol. Contextual diversity uses an algorithm we developed to present the reviewer with documents that are very different from what the review team has already seen. We wrote about it extensively in a recent blog post.
Our ability to do contextual diversity sampling comes from the fact that our DRE engine ranks all of the documents every time. Because we rank all the documents, we know something about the nature of the documents already seen by the reviewers and the documents not yet reviewed. The contextual diversity algorithm essentially clusters unseen documents and then presents a representative sample of each group as the review progresses. And, like our relevance and QC algorithms, contextual diversity also keeps learning and improving as the review progresses.
The picture below, from our earlier blog post on this subject, illustrates our approach. Each yellow circle indicates a contextual cluster and the red dot in each circle indicates the most representative sample document the algorithm can find.
The Continuous Learning Process
Backed by our continuous ranking engine and contextual diversity, we can support a simple and flexible TAR 2.0 process for training and review. Here are the basic steps:
- Start by finding as many relevant documents as possible. Feed them to the system for initial ranking. (Actually, you could start with no relevant documents and build off of the review team work. Or, start with contextual diversity sampling to get a feel for different types of documents in the population.)
- Let the review team begin review. They get an automated mix including highly relevant documents and others selected by the computer based on contextual diversity and randomness to avoid bias. Our mix is a trade secret but most are highly ranked documents to maximize review-team efficiency over the course of the entire review.
- As the review progresses, QC a small percentage of the documents at the senior attorney’s leisure. Our QC algorithm will present documents that are most likely mistagged.
- Continue until you reach the desired recall rate. Track your progress through our progress chart (shown above) and an occasional systematic sample, which will generate a yield curve.
The process is flexible and can progress in almost any way you desire. You can start with tens of thousands of tagged documents if you have them, or start with just a few or none at all. Just let the review team get going either way and let the system balance the mix of documents included in the dynamic, continuously iterative review queue. As they finish batches, the ranking engine keeps getting smarter. If you later find relevant documents through whatever means, simply add them. It just doesn’t matter when your goal is to find relevant documents for review rather than train a classifier.
This TAR 2.0 process works well with low richness collections because you are encouraged to start the training with any relevant documents you can find. As review progresses, more relevant documents rise to the top of the rankings, which means your trial team can get up to speed more quickly. It also works well for ECA and third-party productions where you need to get up to speed quickly. (Read a case study on using TAR for third-party productions here.)
As Grossman and Cormack point out:
This study highlights an alternative approach – continuous active learning with relevance feedback – that demonstrates superior performance, while avoiding certain problems associated with uncertainty sampling and passive learning. CAL also offers the reviewer the opportunity to quickly identify legally significant documents that can guide litigation strategy, and can readily adapt when new documents are added to the collection, or new issues or interpretations of relevance arise.
If your TAR product is integrated into your review engine and supports continuous ranking, there is little doubt they are right. Keep learning, get smarter and save more. That is a winning combination.