This past weekend I received an advance copy of a new research paper prepared by Gordon Cormack and Maura Grossman, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.” They have posted an author’s copy here.
The study attempted to answer one of the more important questions surrounding TAR methodology:
Should training documents be selected at random, or should they be selected using one or more non-random methods, such as keyword search or active learning?
Their conclusion was unequivocal:
The results show that entirely non-random training methods, in which the initial training documents are selected using a simple keyword search, and subsequent training documents are selected by active learning, require substantially and significantly less human review effort (P < 0.01) to achieve any given level of recall, than passive learning, in which the machine-learning algorithm plays no role in the selection of training documents.
Among passive-learning methods, significantly less human review effort (P < 0.01) is required when keywords are used instead of random sampling to select the initial training documents. Among active-learning methods, continuous active learning with relevance feedback yields generally superior results to simple active learning with uncertainty sampling, while avoiding the vexing issue of “stabilization” – determining when training is adequate, and therefore may stop.
The seminal paper is slated to be presented in July in Australia at the annual conference of the Special Interest Group on Information Retrieval (SIGIR), a part of the Association for Computing Machinery (ACM).
Why is This Important?
Their research replicates the findings of much of our own research and validates many of the points about TAR 2.0 which we have made in recent posts here and in an article published in Law Technology News. (Links to the LTN article and our prior posts are collected at the bottom of this post.)
Specifically, Cormack and Grossman conclude from their research that:
- Continuous active learning is more effective than the passive learning (one-time training) used by most TAR 1.0 systems.
- Judgmental seeds and review using relevance feedback is more effective than random seeds, and particularly for sparse collections.
- Subject matter experts aren’t necessary for training; review teams and relevance feedback are just as effective for training.
Their findings open the door to a more fluid approach to TAR, one we have advocated and used for many years. Rather than have subject matter experts click endlessly through randomly selected documents, let them find as many good judgmental seeds as possible. The review team can get going right away and the team’s judgments can be continuously fed back into the system for even better ranking. Experts can QC outlying review judgments to ensure that the process is as effective as possible.
While I will summarize the paper, I urge you to read it for yourself. At eight pages, this is one of the easier to read academic papers I have run across. Cormack and Grossman write in clear language and their points are easy to follow (for us non-Ph.D.s). That isn’t always true of other SIGIR/academic papers.
Cormack and Grossman chose eight different review projects for their research. Four came from the 2009 TREC Legal Track Interactive Task program. Four others came from actual reviews conducted in the course of legal proceedings.
The review projects under study ranged from a low of 293,000 documents to a high of just over 1.1 million. Prevalence (richness) was generally low, which is often the case in legal reviews, ranging from 0.25% to 3.92% with a mean of 1.18%.
The goal here was to compare the effectiveness of three TAR protocols:
- SPL: Simple Passive Learning.
- SAL: Simple Active Learning.
- CAL: Continuous Active Learning (with Relevance Feedback).
The first two protocols are typical in TAR 1.0 training. Simple Passive Learning uses randomly-selected documents for the training. Simple Active Learning uses judgmental seeds for the first round of training but then uses computer-generated seeds to further improve the classifier.
Continuous Active Learning also starts with judgmental seeds (like SAL) but then trains using review teams working primarily with highly relevant documents after the first ranking. Catalyst uses a CAL-like approach in Predict, but we further supplement the relevance feedback with a balanced, dynamically selected mixture that includes both relevance feedback and additional documents selected using Predict’s contextual diversity engine.
As the authors explain:
The underlying objective of CAL is to find and review as many of the responsive documents as possible, as quickly as possible. The underlying objective of SAL, on the other hand, is to induce the best classifier possible, considering the level of training effort.
For each of the eight review projects, Cormack and Grossman ran simulated reviews using each of the three protocols. They used review judgments already issued for each project as “ground truth.” They then simulated running training and review in 1,000 seed increments. In a couple of cases they ran their experiments using 100 seed batches but this proved impractical for the entire project.
(As a side note, we have done experiments in which the size of the batch is varied. Generally, the faster and tighter the iteration, the higher the recall for the exact same amount of human effort. Rather than delve further into this here, this topic deserves and will shortly receive its own separate blog post.)
Here are the key conclusions Cormack and Grossman reached:
The results show SPL to be the least effective TAR method, calling into question not only its utility, but also commonly held beliefs about TAR. The results also show that SAL, while substantially more effective than SPL, is generally less effective than CAL, and as effective as CAL only in a best-case scenario that is unlikely to be achieved in practice.
Our primary implementation of SPL, in which all training documents were randomly selected, yielded dramatically inferior results to our primary implementations of CAL and SAL, in which none of the training documents were randomly selected.
In summary, the use of a seed set selected using a simple keyword search, composed prior to the review, contributes to the effectiveness of all of the TAR protocols investigated in this study.
Perhaps more surprising is the fact that a simple keyword search, composed without prior knowledge of the collection, almost always yields a more effective seed set than random selection, whether for CAL, SAL, or SPL. Even when keyword search is used to select all training documents, the result is generally superior to that achieved when random selection is used. That said, even if passive learning is enhanced using a keyword-selected seed or training set, it is still dramatically inferior to active learning.
While active-learning protocols employing uncertainty sampling are clearly more effective than passive-learning protocols, they tend to focus the reviewer’s attention on marginal rather than legally significant documents. In addition, uncertainty sampling shares a fundamental weakness with passive learning: the need to dene and detect when stabilization has occurred, so as to know when to stop training. In the legal context, this decision is fraught with risk, as premature stabilization could result in insufficient recall and undermine an attorney’s certification of having conducted a reasonable search under (U.S.) Federal Rule of Civil Procedure 26(g)(1)(B).
Their article includes several Yield/Gain charts illustrating their findings. I won’t repost them all here, but here is their first chart as an example. It shows comparative results for the three protocols for TREC Topic 201. You can easily see that Continuous Active Learning resulted in a higher level of recall after review of fewer documents, which is the key to keeping review costs in check:
No doubt some people will challenge their conclusions, but they cannot be ignored as we move from TAR 1.0 to the next generation.
As the authors point out:
This study highlights an alternative approach – continuous active learning with relevance feedback – that demonstrates superior performance, while avoiding certain problems associated with uncertainty sampling and passive learning. CAL also offers the reviewer the opportunity to quickly identify legally significant documents that can guide litigation strategy, and can readily adapt when new documents are added to the collection, or new issues or interpretations of relevance arise.
From the beginning, we argued that continuous ranking/continuous learning is more effective than the TAR 1.0 approach of a one-time cutoff. We have also argued that clicking through thousands of randomly selected seeds is less effective for training than actively finding relevant documents and using them instead. And, lastly, we have issued our own research suggesting strongly that subject matter experts are not necessary for TAR training and can be put to better use finding useful documents for training and doing QC of outlier review team judgments, continuously and on the fly, with the ability to always determine where the outlier pool is shifting as review continues.
It is nice to see that others agree and are providing even more research to back up these important points. TAR 2.0 is here to stay.
Further reading on Catalyst’s research and findings about TAR 2.0:
- 5 Myths About Technology-Assisted Review (Law Technology News).
- Predictive Ranking (TAR) for Smart People.
- The Five Myths of Technology Assisted Review, Revisited.
- In TAR, Wrong Decisions Can Lead to the Right Documents.
- Is Random the Best Road for Your Car? Or is there a Better Route to Your Destination?
- Are Subject Matter Experts Really Required for TAR Training? (A Follow-Up on TAR 2.0 Experts vs. Review Teams).
- Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?
- TAR 2.0: Continuous Ranking – Is One Bite at the Apple Really Enough?