In TAR, Wrong Decisions Can Lead to the Right Documents (A Response to Ralph Losey)

In a recent blog post, Ralph Losey tackles the issue of expertise and TAR algorithm training.  The post, as is characteristic of Losey’s writing, is densely packed.  He raises a number of different objections to doing any sort of training using a reviewer who is not a subject matter expert (SME).  I will not attempt to unpack every single one of those objections.  Rather, I wish to cut directly to the fundamental point that underlies the belief in the absolute necessity that an SME, and only an SME, should provide the judgments, the document codings, that get used for training:

Losey writes:

Quality of SMEs is important because the quality of input in active machine learning is important. A fundamental law of predictive coding as we now know it is GIGO, garbage in, garbage out. Your active machine learning depends on correct instruction.

Active machine learning depends on a number of different factors, including but not limited to the type of features (aka “signals”) that are extracted from the data and the complexity of the data itself (how “separable” the data is), even if perfect and complete labeling of every document in the collection were available.  All of these factors have an effect on the quality of the output.  But yes, one of those factors is the labels on the documents, the human-given coding.

Coding quality is indeed important.  However, what I question is this seemingly “common sense objection” of garbage in, garbage out (GIGO).  In offline discussion with Ralph, I was able to distill this common sense objection into an even purer form, a part of this conversation which I reprint here with permission:

Sorry, but wrong decisions to find the right docs sounds like alchemy to me, lead to gold.

This is the essence of the entire conundrum: Can wrong decisions be used to find the right documents?  If it can be shown that wrong decisions can indeed be used to find the right documents, then while that does not automatically answer every single one of Ralph’s objections, it provides a solid foundation on which to do so as the industry continues to iterate our understanding of TAR’s capabilities. Thus, the purpose of this post is to focus on the fact that this can be done.  Future posts will then apply the principle to real world TAR workflows.

You may also want to read our prior posts related to this topic:

Pseudo Relevance Feedback

The first manner in which we show that wrong decisions can be used to find right documents is to turn to an old information retrieval concept known as pseudo relevance feedback (aka blind feedback).  Imagine running a search on a collection of documents, and getting back a list of results, some of which are relevant and some of which are not.  Ideally, you would want all the relevant ones to be toward the top, and the non-relevant ones at the bottom.  We all know that doesn’t happen.  So the technique of pseudo relevance is employed to improve the quality of the ranking.  PRF operates in the following manner:

  1. The top k (usually a couple dozen) results are selected from the top of the existing ranking.
  2. All top k documents are blindly judged to be relevant. That is, they’re automatically coded as relevant, whether or not they truly are.
  3. Those top k documents, with their relevant=true coding, are then fed back to the machine learner, and the ranking is altered based on this blind, or pseudo-relevant, feedback.

In the PRF regimen, there are many documents in this top k set that are truly not relevant, and yet they are being coded as relevant and used for training by a human coder so naïve that (s)he is coding those documents blindly.  Under the GIGO principle, this would mean that these “garbage” wrong judgments would cause the quality of the ranking, the number of truly responsive documents  at the top ranks of the list, to go down.

And yet that turns out to not be the case.

As far back as 20 years ago (1994), information retrieval researchers were reporting that using nonrelevant documents to find relevant ones yielded better results.  For example, see Automatic Query Expansion Using SMART: TREC 3, by Buckley, Salton, Allan, and Singhal:

Massive query expansion also works in general for the ad-hoc experiments, where expansion and weighting are based on the top initially retrieved documents instead of known relevant documents. In the ad-hoc environment this approach may hurt performance for some queries (e.g. those without many relevant documents in the top retrieved set), but overall proves to be worthwhile with an average 20% improvement.

The astute reader will note, and might complain, that while performance does improve on average, and for the vast majority of queries, it is not universal.  “Why risk making some things worse,” one might ask, “even if most things get better?”  There are two answers to that.

The first answer is that, because PRF is a decades-old, established technique in the information retrieval world, there is a large, active body of research around it.  There are indeed researchers who have explored the trade-off between risk and reward (Estimation and Use of Uncertainty in Pseudo-relevance Feedback) and have learned to optimize around it (Accounting for Stability of Retrieval Algorithms using Risk-Reward Curve). These are but a few of many available papers that address the topic.

The second answer is simply to note that the goal here is not (yet) to address detailed issues of workflow, risk-mitigation, or total annotation cost.  Those deserve separate, full length treatises.  Rather, the goal here is simply to dispel the notion that “wrong” decisions cannot lead to “right” documents. By and large the body of literature on pseudo relevance feedback shows that they can. Full stop.

Experiments Using Only Wrong Documents

However, readers might feel like raising the objection that PRF doesn’t use “wrong” judgments so much as it uses diluted “right” judgments.  After all, there are some truly relevant document in the top k set that gets used for training, and so even while there are some truly non-relevant documents that get blindly marked as relevant, there are also some truly relevant documents that get marked as relevant, in the same way that even a broken watch correctly tells the time twice a day.  However, even if some truly relevant documents are mixed in with the blind feedback, that doesn’t change the fact that wrong decisions are still leading to right documents.  I also argue that this is a realistic parallel to what a non-SME would do, which is to still make a lot of right decisions, with some wrong decisions mixed in.  However, because of this potential objection, I will take things one step further.

This leads us to the second manner in which we show that wrong decisions can be used to find right documents.  And this one, I giddily foreshadow, is going to be a little more extreme in its demonstration.  These are some experiments that we’ve done using the proprietary Catalyst algorithms, so I will not talk about the algorithms, only the outputs.  The setup is as follows: As part of some of the earlier TREC Legal Track runs, ground truth (human judgments on documents) was established by TREC analysts.  However, teams that participated in the runs were allowed to submit documents that they felt were incorrectly judged, and the topic authority for the matter then made the final adjudication.  In some cases, the original judgment was upheld.  In some cases, it was overturned, and the topic authority made the final, correct call for a document that was different than the original non-authoritative reviewer had given.

For our experiment, we collected the docids of all those documents with topic authority overturns.  For the training of our system, we used only those docids, no more and no less.  However, we established two conditions.  In the first, the docids were given the coding value of responsive or non-responsive based on the final, topic authority judgment of that document.  In the second, the exact same docids were given the coding value of based on the original, non-authoritative reviewer.  That is to say, in the second condition, the judgments weren’t just slightly wrong, they were 100% wrong.  In this second condition, all documents marked by the topic authority as responsive were given the training value of non-responsive, and vice versa.

The algorithms then used this training data from each condition separately to produce a ranking over the remainder of the collection.  The quality of this ranking was determined using test data that consisted of the remainder of the judgments for which there was no disagreement, i.e. that either every single team participating in the TREC evaluation felt were correctly judged in the first place, or that the topic authority personally adjudicated and kept the original marking. These results were visualized in the form of yield curve. The x-axis is the depth in the ranking, and the y-axis is the cumulative number of truly responsive documents available at that depth in the ranking.  We do not show raw counts, but we do show a blue line which represents the expected cumulative rate of discovery for manual linear review, i.e. what would happen on average if you were to review documents in a random order.

Experiment Yield Curve

The perpetually astute reader might be tempted at this point to shout out, “Aha! See, I knew it! The authoritative user’s judgments lead to better outcomes than the non-authoritative user. Garbage in, garbage out. Our position is vindicated!”  Let me, however, remind that reader of one fact: The “non-authoritative” training in this case are documents with 100% wrong judgments.  Not 10% wrong, 25% wrong, or even 50% wrong, as you might expect from a non-expert, but trained contract reviewer.  But 100% wrong.  Keep that in mind as you compare these yield curves against manual linear review (blue line).  What this experiment shows is that even when this training data is 100% wrong, the rate at which you are able to discover responsive documents — at least using the Catalyst algorithm with its proprietary algorithmic compensation factors — significantly outperforms manual linear review.

Let me remind the reader of the goal of this exercise, which is to show that wrong decisions can be used to find right documents.  How we deal with various wrong decisions to mitigate risk, to maximize yield, etc. is a secondary question.  And it is one that is proper for the reader to ask. However, that question cannot be asked unless one first is willing to accept the notion that wrong decisions can lead to right documents.  That is the primary question, and the foundation on which we will be able to build further discussion of how exactly to deal with various kinds of wrongness, and to what extent it does or does not affect the overall outcome.

Lest the reader believe that this is an unrepeatable example, let us show another topic, with the experiment similarly designed:

Experiment Yield Curve 2

Now the yield curve for this experiment was lower than in the previous experiment, which has a lot to do with training set size, characteristics of the data, etc.  But the story that it tells is similar: Even training using documents that are 100% wrong in their labeling gives a yield that outperforms manual linear review.  All else aside, wrong decisions can and do lead to right documents.

Wrongness Indeed Leads to Rightness

I suppose one might also note that in this particular case, not only did wrong decisions lead to right documents, but those wrong decisions led to more right documents (higher yield) at various points than did the right decisions. Again, however, as I noted in the previous experiment, the goal here is not to compare, not to delve into the workflow details about how to use wrong or right decisions. The goal is simply to show as a first step that wrongness can indeed lead to rightness.

We’ve repeated this experiment on a number of additional TREC matters, as well as on some of our own matters, and have consistently found the same outcome.  The common sense objection of “garbage in, garbage out” masks a host of underlying realities and algorithmic workarounds.  I believe that there is a common — I think even unconscious — assumption in the industry that anything that is not 100% correct is “garbage.” What I hope is that this post opens the door to the possibility that there is a wide spectrum in between garbage and perfection.

When it comes to producing documents, we as an industry often talk about the standard of reasonableness, rather than perfection.  So why is it that when it comes to coding our training documents, we have a blind spot (yes, that’s a PRF pun) to the idea that reasonable coding calls can also lead to reasonable outcomes.  It is a false dichotomy to assume that the only two choices are between garbage and complete expertise.  This post has shown that imperfect inputs, wrong decisions, are capable of leading to right documents.  That by itself does not wipe away every objection that was raised by Losey’s post – more discussion and experimental evidence is required – but it does undermine the foundation of those objections.

mm

About Jeremy Pickens

Jeremy Pickens is one of the world’s leading information retrieval scientists and a pioneer in the field of collaborative exploratory search, a form of information seeking in which a group of people who share a common information need actively collaborate to achieve it. Dr. Pickens has seven patents and patents pending in the field of search and information retrieval. As senior applied research scientist at Catalyst, Dr. Pickens has spearheaded the development of Insight Predict. His ongoing research and development focuses on methods for continuous learning, and the variety of real world technology assisted review workflows that are only possible with this approach. Dr. Pickens earned his doctoral degree at the University of Massachusetts, Amherst, Center for Intelligent Information Retrieval. He conducted his post-doctoral work at King’s College, London. Before joining Catalyst, he spent five years as a research scientist at FX Palo Alto Lab, Inc. In addition to his Catalyst responsibilities, he continues to organize research workshops and speak at scientific conferences around the world.

10 thoughts on “In TAR, Wrong Decisions Can Lead to the Right Documents (A Response to Ralph Losey)

  1. Pingback: Can you train a useful model with incorrect labels? « Evaluating E-Discovery

  2. Gerard Britton

    Jeremy,

    Does the above address the issue of incorrect coding of relevant documents as non-relevant, which would dink recall? If I am reading the above correctly, the analysis was limited to non-relevant documents being coded as relevant, which one would think would dink precision only.

    Isn’t the incorrect coding by training reviewers of relevant items as non-relevant what concerns people because it would tend to drop relevant items out of the higher ranking positions or lead to inconsistent output?

    Gerry

    Reply
    1. mmJeremy Pickens

      Gerard,

      First of all, thank you for acknowledging the fact that there might indeed be different kinds of outputs depending on the type of coding wrongness. Most folks with whom I talk think that wrong is simply wrong.. because it somehow destroys consistency. We’ve looked at this empirically, though, and found that, indeed and in general, it’s worse to flip a relevant to a nonrelevant than vice versa.

      But that’s also the nice thing about trained contract reviewers: if you look at the kinds of mistakes that they tend to make, they tend to overmark rather than undermark for responsiveness. In other words, the mistakes that they’re making are in the right direction for keeping your recall high.

      Not only that, but I think many folks tend to worry too much about consistency, and not enough about coverage. So what if your SME is 100% consistent, if that person only has the capability (due to time, money, etc) to judge a few thousand documents? In other words, there might not be a huge difference between a truly relevant document that gets marked as nonrelevant, and a truly relevant document that doesn’t get marked at all, because the single SME ran out of time. This is an empirical question, rather than a philosophical one. And is of even greater concern when you get into the long tail of responsive documents, which is where you tend to be when you’re approaching a defensible 70%, 80%, 90% recall.

      Anyway, I’ll get down off of that hobby horse. To answer your question, yes, in this experiment, the wrongnesses go in both directions, simultaneously. What I wrote was:

      “…in the second condition, the judgments weren’t just slightly wrong, they were 100% wrong. In this second condition, all documents marked by the topic authority as responsive were given the training value of non-responsive, and vice versa.”

      The “vice versa” was intended to communicate the fact that all documents marked by the topic authority as non-responsive were giving the training value of responsive, too. Both ways. 100% wrong.

      Reply
  3. Pingback: Beware of the TAR Pits! – Part Two | e-Discovery Team ®

  4. Jim Caitlan

    Jeremy –

    Thanks so much. Very intriguing example of a non-intuitive outcome which should be considered for integration into evolving TAR methodology.

    Of course non-intuitive outcomes increase the psychological resistance to new technology and methods. Attorneys are already wary of approving TAR protocols that appear reasonable on their face. I’m trying to imagine the meet and confer where the producing party proposes a protocol based on deliberate mis-coding of the training set.

    Jim

    Reply
    1. mmJeremy Pickens

      Jim,

      You’re welcome. And yes, you’re absolutely correct that there is and will continue to be much psychological resistance. I am hopeful, however, that empiricism will win out in the end.

      Let me correct one notion: We are certainly not recommending a protocol in which the producing party deliberately miscodes training documents. We recommend that all of your reviewers, no matter who those reviewers are (whether senior attorney or contract reviewer) make judgments to the best of their ability and knowledge, at all times. What our experiment showed, though, is that even when those contract reviewer effort isn’t completely up to snuff, at least when compared to the topic authority’s understanding of the matter, all is not lost. Those “wrong” judgments will still lead to a helluva lot of responsive documents. Sometimes even to more than your topic authority’s judgments lead.

      There will still need to be protocols to finesse those “wrong” judgments into the overall workflow, which protocols we have indeed designed. We just don’t cover them in this post. What we’re essentially covering is the worst case scenario, and we’re showing that even in this worst case scenario, wrong decisions lead to right documents.

      Reply
  5. Pingback: Pioneering Cormack/Grossman Study Validates Continuous Learning, Judgmental Seeds and Review Team Training for Technology Assisted Review

  6. Pingback: Continuous Active Learning for Technology Assisted Review (How it Works and Why it Matters for E-Discovery) |

  7. Pingback: Predictive Ranking (TAR) for Smart People |

Leave a Reply

Your email address will not be published. Required fields are marked *