In a recent blog post, Ralph Losey tackles the issue of expertise and TAR algorithm training. The post, as is characteristic of Losey’s writing, is densely packed. He raises a number of different objections to doing any sort of training using a reviewer who is not a subject matter expert (SME). I will not attempt to unpack every single one of those objections. Rather, I wish to cut directly to the fundamental point that underlies the belief in the absolute necessity that an SME, and only an SME, should provide the judgments, the document codings, that get used for training:
Quality of SMEs is important because the quality of input in active machine learning is important. A fundamental law of predictive coding as we now know it is GIGO, garbage in, garbage out. Your active machine learning depends on correct instruction.
Active machine learning depends on a number of different factors, including but not limited to the type of features (aka “signals”) that are extracted from the data and the complexity of the data itself (how “separable” the data is), even if perfect and complete labeling of every document in the collection were available. All of these factors have an effect on the quality of the output. But yes, one of those factors is the labels on the documents, the human-given coding.
Coding quality is indeed important. However, what I question is this seemingly “common sense objection” of garbage in, garbage out (GIGO). In offline discussion with Ralph, I was able to distill this common sense objection into an even purer form, a part of this conversation which I reprint here with permission:
Sorry, but wrong decisions to find the right docs sounds like alchemy to me, lead to gold.
This is the essence of the entire conundrum: Can wrong decisions be used to find the right documents? If it can be shown that wrong decisions can indeed be used to find the right documents, then while that does not automatically answer every single one of Ralph’s objections, it provides a solid foundation on which to do so as the industry continues to iterate our understanding of TAR’s capabilities. Thus, the purpose of this post is to focus on the fact that this can be done. Future posts will then apply the principle to real world TAR workflows.
You may also want to read our prior posts related to this topic:
- TAR 2.0: Continuous Ranking – Is One Bite at the Apple Really Enough?
- Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?
- Are Subject Matter Experts Really Required for TAR Training? (A Follow-Up on TAR 2.0 Experts vs. Review Teams)
Pseudo Relevance Feedback
The first manner in which we show that wrong decisions can be used to find right documents is to turn to an old information retrieval concept known as pseudo relevance feedback (aka blind feedback). Imagine running a search on a collection of documents, and getting back a list of results, some of which are relevant and some of which are not. Ideally, you would want all the relevant ones to be toward the top, and the non-relevant ones at the bottom. We all know that doesn’t happen. So the technique of pseudo relevance is employed to improve the quality of the ranking. PRF operates in the following manner:
- The top k (usually a couple dozen) results are selected from the top of the existing ranking.
- All top k documents are blindly judged to be relevant. That is, they’re automatically coded as relevant, whether or not they truly are.
- Those top k documents, with their relevant=true coding, are then fed back to the machine learner, and the ranking is altered based on this blind, or pseudo-relevant, feedback.
In the PRF regimen, there are many documents in this top k set that are truly not relevant, and yet they are being coded as relevant and used for training by a human coder so naïve that (s)he is coding those documents blindly. Under the GIGO principle, this would mean that these “garbage” wrong judgments would cause the quality of the ranking, the number of truly responsive documents at the top ranks of the list, to go down.
And yet that turns out to not be the case.
As far back as 20 years ago (1994), information retrieval researchers were reporting that using nonrelevant documents to find relevant ones yielded better results. For example, see Automatic Query Expansion Using SMART: TREC 3, by Buckley, Salton, Allan, and Singhal:
Massive query expansion also works in general for the ad-hoc experiments, where expansion and weighting are based on the top initially retrieved documents instead of known relevant documents. In the ad-hoc environment this approach may hurt performance for some queries (e.g. those without many relevant documents in the top retrieved set), but overall proves to be worthwhile with an average 20% improvement.
The astute reader will note, and might complain, that while performance does improve on average, and for the vast majority of queries, it is not universal. “Why risk making some things worse,” one might ask, “even if most things get better?” There are two answers to that.
The first answer is that, because PRF is a decades-old, established technique in the information retrieval world, there is a large, active body of research around it. There are indeed researchers who have explored the trade-off between risk and reward (Estimation and Use of Uncertainty in Pseudo-relevance Feedback) and have learned to optimize around it (Accounting for Stability of Retrieval Algorithms using Risk-Reward Curve). These are but a few of many available papers that address the topic.
The second answer is simply to note that the goal here is not (yet) to address detailed issues of workflow, risk-mitigation, or total annotation cost. Those deserve separate, full length treatises. Rather, the goal here is simply to dispel the notion that “wrong” decisions cannot lead to “right” documents. By and large the body of literature on pseudo relevance feedback shows that they can. Full stop.
Experiments Using Only Wrong Documents
However, readers might feel like raising the objection that PRF doesn’t use “wrong” judgments so much as it uses diluted “right” judgments. After all, there are some truly relevant document in the top k set that gets used for training, and so even while there are some truly non-relevant documents that get blindly marked as relevant, there are also some truly relevant documents that get marked as relevant, in the same way that even a broken watch correctly tells the time twice a day. However, even if some truly relevant documents are mixed in with the blind feedback, that doesn’t change the fact that wrong decisions are still leading to right documents. I also argue that this is a realistic parallel to what a non-SME would do, which is to still make a lot of right decisions, with some wrong decisions mixed in. However, because of this potential objection, I will take things one step further.
This leads us to the second manner in which we show that wrong decisions can be used to find right documents. And this one, I giddily foreshadow, is going to be a little more extreme in its demonstration. These are some experiments that we’ve done using the proprietary Catalyst algorithms, so I will not talk about the algorithms, only the outputs. The setup is as follows: As part of some of the earlier TREC Legal Track runs, ground truth (human judgments on documents) was established by TREC analysts. However, teams that participated in the runs were allowed to submit documents that they felt were incorrectly judged, and the topic authority for the matter then made the final adjudication. In some cases, the original judgment was upheld. In some cases, it was overturned, and the topic authority made the final, correct call for a document that was different than the original non-authoritative reviewer had given.
For our experiment, we collected the docids of all those documents with topic authority overturns. For the training of our system, we used only those docids, no more and no less. However, we established two conditions. In the first, the docids were given the coding value of responsive or non-responsive based on the final, topic authority judgment of that document. In the second, the exact same docids were given the coding value of based on the original, non-authoritative reviewer. That is to say, in the second condition, the judgments weren’t just slightly wrong, they were 100% wrong. In this second condition, all documents marked by the topic authority as responsive were given the training value of non-responsive, and vice versa.
The algorithms then used this training data from each condition separately to produce a ranking over the remainder of the collection. The quality of this ranking was determined using test data that consisted of the remainder of the judgments for which there was no disagreement, i.e. that either every single team participating in the TREC evaluation felt were correctly judged in the first place, or that the topic authority personally adjudicated and kept the original marking. These results were visualized in the form of yield curve. The x-axis is the depth in the ranking, and the y-axis is the cumulative number of truly responsive documents available at that depth in the ranking. We do not show raw counts, but we do show a blue line which represents the expected cumulative rate of discovery for manual linear review, i.e. what would happen on average if you were to review documents in a random order.
The perpetually astute reader might be tempted at this point to shout out, “Aha! See, I knew it! The authoritative user’s judgments lead to better outcomes than the non-authoritative user. Garbage in, garbage out. Our position is vindicated!” Let me, however, remind that reader of one fact: The “non-authoritative” training in this case are documents with 100% wrong judgments. Not 10% wrong, 25% wrong, or even 50% wrong, as you might expect from a non-expert, but trained contract reviewer. But 100% wrong. Keep that in mind as you compare these yield curves against manual linear review (blue line). What this experiment shows is that even when this training data is 100% wrong, the rate at which you are able to discover responsive documents — at least using the Catalyst algorithm with its proprietary algorithmic compensation factors — significantly outperforms manual linear review.
Let me remind the reader of the goal of this exercise, which is to show that wrong decisions can be used to find right documents. How we deal with various wrong decisions to mitigate risk, to maximize yield, etc. is a secondary question. And it is one that is proper for the reader to ask. However, that question cannot be asked unless one first is willing to accept the notion that wrong decisions can lead to right documents. That is the primary question, and the foundation on which we will be able to build further discussion of how exactly to deal with various kinds of wrongness, and to what extent it does or does not affect the overall outcome.
Lest the reader believe that this is an unrepeatable example, let us show another topic, with the experiment similarly designed:
Now the yield curve for this experiment was lower than in the previous experiment, which has a lot to do with training set size, characteristics of the data, etc. But the story that it tells is similar: Even training using documents that are 100% wrong in their labeling gives a yield that outperforms manual linear review. All else aside, wrong decisions can and do lead to right documents.
Wrongness Indeed Leads to Rightness
I suppose one might also note that in this particular case, not only did wrong decisions lead to right documents, but those wrong decisions led to more right documents (higher yield) at various points than did the right decisions. Again, however, as I noted in the previous experiment, the goal here is not to compare, not to delve into the workflow details about how to use wrong or right decisions. The goal is simply to show as a first step that wrongness can indeed lead to rightness.
We’ve repeated this experiment on a number of additional TREC matters, as well as on some of our own matters, and have consistently found the same outcome. The common sense objection of “garbage in, garbage out” masks a host of underlying realities and algorithmic workarounds. I believe that there is a common — I think even unconscious — assumption in the industry that anything that is not 100% correct is “garbage.” What I hope is that this post opens the door to the possibility that there is a wide spectrum in between garbage and perfection.
When it comes to producing documents, we as an industry often talk about the standard of reasonableness, rather than perfection. So why is it that when it comes to coding our training documents, we have a blind spot (yes, that’s a PRF pun) to the idea that reasonable coding calls can also lead to reasonable outcomes. It is a false dichotomy to assume that the only two choices are between garbage and complete expertise. This post has shown that imperfect inputs, wrong decisions, are capable of leading to right documents. That by itself does not wipe away every objection that was raised by Losey’s post – more discussion and experimental evidence is required – but it does undermine the foundation of those objections.