Thinking Through the Implications of CAL: Who Does the Training?

Before joining Catalyst in 2010, my entire academic and professional career revolved around basic research. I spent my time coming up with new and interesting algorithms, ways of improving document rankings and classification. However, in much of my research, it was not always clear which algorithms which may or may not have immediate application. It is not that the algorithms were not useful; they were. They just did not always have immediate application to a live, deployed system.

Since joining Catalyst, however, my research has become much more applied. I have come to discover that doesn’t just mean that the algorithms that I design have to be more narrowly focused on the existing task. It also means that I have to design those algorithms to be aware of the larger real world contexts in which those algorithms will be deployed and the limitations that may exist therein.

So it is with keen interest that I have been watching the eDiscovery world react to the recent (SIGIR 2014) paper from Maura Grossman and Gordon Cormack on the CAL (continuous active learning) protocol, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.

At Catalyst, we have been speaking publicly about the benefits of continuous learning every since our LegalTech 2012 panel session three years ago. At the time, we labeled the dimensions “dynamic” and “static,” rather than “continuous” and “simple.” But the idea is the same: Is your review protocol, and the algorithms used to support that protocol, primarily defined as a fixed, finite, ultimately static learning period, followed by batch contract review of the predicted top docs in which no learning takes place (simple learning)? Or is the algorithm capable of continuous updating, is it dynamically aware of every new document judgment that is made, no matter who makes it? And not only can it, but does it, use those judgments to improve?

Let’s be clear: Cormack and Grossman were talking to the e-discovery community about the benefits of the continuous mindset long before our continuous-based offerings came on the market. Some of Cormack’s earliest writings on this go back to 2009. [See, e.g., G. V. Cormack and M. Mojdeh, Machine learning for information retrieval: TREC 2009 Web, Relevance Feedback and Legal Tracks, The Eighteenth Text REtrieval Conference (TREC 2009)]. So we were not “first,” nor are we attempting to represent anything otherwise. But we have been doing it for long enough that it has given us the chance to reflect on some of the practical issues that arise when adopting the continuous approach.

These practical concerns have led us to write about what we call the “Five Myths of TAR” (see here and here. However, when reading a lot of the public reaction to Grossman and Cormack’s CAL work, as well as when engaging in private conversations about algorithms and workflow, it occurred to me that even though we had listed what we saw as myths, and explained why we thought each of the myths were indeed myths, we had not done one thing. We had not explained how the myths relate to each other.

So I would like to write a bit in this and in future posts, not just about CAL, nor just about the myths, but about how everything fits together. I would like to connect the dots by going into more depth about the implications of applying a continuous learning protocol to your TAR project. My first dot-connection is this: In a CAL protocol, who does your training?

Who does your CAL training?

Under a simple protocol, whether Simple Passive Learning (SPL) or Simple Active Learning (SAL), the number of training documents presented to the system is limited. From both various (non-Catalyst) vendor white papers as well as from information conversations, that number seems to hover somewhere around 3,000 documents. However, in some of my conversations, industry practitioners have indicated to me that in their experience, simple training has occasionally gone as as high as 17,000 documents. Nevertheless, somewhere around 3,000 documents seems to be typical. Cormack and Grossman tested simple training set sizes of 2,000, 5,000 and 8,000 documents.

When the training set sizes are (relatively) low like this, it is possible for one or two subject matter experts (SMEs) to get through all 2,000, 5,000, 8,000 and maybe even 17,000 documents themselves. A high average document review rate is around 100 docs/hour, but let’s assume that a really good SME can get through 250 docs/hour. That means that 2,000 documents are only eight hours of work, 5,000 are half a week, 8,000 are the better part of one week, and 17,000 plus a second SME helper can also be done in just under a week. The 2,000 to 17,000 worth of training can be done, and then the remainder of predicted relevant documents can be batched out to the contract reviewers for confirmation before final production.

Vendors who advocate an SPL or SAL protocol typically also recommend, nay advocate if not practically necessitate, that your SMEs do all of the training. “Only an SME can make the absolute right and wrong calls” is a common refrain that one hears in the industry, “and if the SME is not doing the training, it will mess up the predictions and throw off your entire process.”

With CAL, on the other hand, there is no distinction between a training phase and a (batched out) review phase. Training is review. Review is training. All reviewed documents get used for training. All training documents are used to further improve the predictive quality on as-yet unseen review documents.

What Does this Mean for CAL?

So what does this mean for CAL? Specifically, what does it mean when there is a warning that your SMEs do all of the training? When review is training, that means that your SMEs have to do the entire review themselves. That is what CAL is. Every document that has a judgment on it gets used for further training. So if you believe that only your SME can train the system, if you run a CAL protocol, your SME must do the entire review themselves.

If your collection is small and rich, or if your collection is bigger and sparse, you’re OK. Because the total number of responsive docs is still small enough for one, maybe two, people to get through. But if your collection is bigger and rich, or if your collection is huge and sparse, you might have a problem, because the issue is the total number of documents that you would have to look at to get eyeballs on everything that you’re producing, which in a CAL regimen you do.

For example, say you have 700,000 documents in the collection, and 1 percent of them are responsive, i.e. 7,000 total responsive documents. With CAL (and loosely extrapolating from Cormack-Grossman SIGIR numbers to estimate the total number of documents that will have eyeballs on them by the time the process is finished) you might only have to go through 11,000 of those documents to have put eyeballs on a defensible 75 percent of the responsive documents. That means that your one 250 doc/hour SME could do the entire review (because review = training, training = review) themselves in just a little over one full week.

Or, suppose that you only have 40,000 documents in your collection, but 20 percent of them are responsive, i.e. 8,000 total responsive documents. Again, using Cormack-Grossman approximate numbers, you could put SME eyeballs on a defensible 75 percent of them after having to only go through 12,000 of the 40,000 documents. And again, this could be done in about six full days of 250 doc/hour work. It’s not ideal, but it’s doable.

So the warning common in the industry about requiring your SMEs to do training is not a problem for CAL, if your collection happens to be small or sparse.

However, what if your collection is large or rich? Or both? For example, suppose you have a 5 million document collection, and it is 1 percent rich. That is a total of 50,000 responsive docs, or (using the same assumptions we’ve been using above) about 75,000 total reviewed (= training = review) documents to get to a defensible production point. Or, suppose that your have a 1 million document collection, and it is 10 percent rich. That’s 100,000 responsive docs, for an estimated total review (= training = review) cost of 160,000 documents. That number of documents would take an SME, one working at 250 docs/hour, 37.5 days (7.5 weeks) and 80 days (16 weeks), respectively. Even if you added a second, highly trusted (translation: costly) SME to the process, you’re still requiring those two people to sit and review documents eight hours a day, five days a week, for 3.75 weeks or eight weeks respectively. Non-stop. With no other work performed in between.

Clearly that’s a problem. I don’t know many people that will want to sit, reviewing 250 docs per hour, for 16 weeks straight, for just a single case. It seems that one of the implications of CAL is the problem that it creates for this expert training requirement. Or vice versa. Correct? Maybe. Let’s read on.

Some Algorithms Do Not Require SMEs

If the requirement for all training to be done with an SME is to be believed, but that SME doesn’t want to work for 16 weeks non-stop reviewing documents for a single case, then there are two choices, two options available:

  1. Switch to an SPL or SAL protocol, in that all machine learning stops, all improvements to the ranking or classification of as-yet unseen documents cease the moment the subject matter stops reviewing documents (after 2,000 or 5,000 or whenever they decide that they’d rather be doing something else) and some highly ranked proportion of the remaining documents are batched out to your contract reviewers. Essentially, this first choice is to not do CAL.
  2. Continue with the CAL protocol by letting your contract reviewer document coding be fed into the machine learning algorithms. That is, turn over the review (and therefore training) to your contract reviewers. With some SME supervision, to be sure. The contract reviewers are not working in a vacuum. But relax the requirement that the SME has to judge every single TRAINING document.

Now, if your vendor has been admonishing you that their system does not work with non-SME training, because “if the SME is not doing the training, it will mess up the predictions, and will throw off your entire process,” then you really don’t have any options. You cannot do CAL. It does not matter if your vendor is technologically capable of incessant iteration. If that technology is thrown off by non-expert training, then you have to stick with the old, simple (SPL or SAL) way of doing things.

However, not all algorithms are so fragile that non-expert training will throw them off. Some are naturally more robust, some have been explicitly designed to be more robust. And not just designed, but tested. Starting a few years ago, we at Catalyst have been doing that testing. And we’ve been writing about it, here and here. We have found that, contrary to popular industry perception, you can train with your non-expert reviewers. At least with our technology.

(Aside: As a scientist, and having empirically observed the various reasons why non-expert training works, I have to be completely open and mention that I suspect that other vendors’ technology might also support non-expert reviewers as well. But if so, one has to wonder why they may have been saying otherwise for such a long time.)

Nevertheless, the upshot is that if the technology is capable of supporting not only continuous iteration, continuous learning, but also capable of supporting non-expert training, then you have a choice, above, between whether to switch to SPL or SAL when collection sizes are large or richness is high, or to continue doing CAL, using mostly (but not all) contract reviewer judgments.

Or perhaps you still don’t have a choice. As CAL is the more effective option, you would simply go with CAL. Otherwise, you are of necessity stuck with the less effective SPL or SAL option.

The Bottom Line

In conclusion and to reconnect the dots: The first two Myths of TAR that we’ve written about were about continuous vs. simple learning (continuous wins) and expert vs. non-expert training (non-expert works just as well, if not sometimes better). Those two myths don’t exist in isolation. Sure, I suppose it would be possible to do non-expert training with SPL or SAL. And sometimes, if the conditions are just right, if the moon is half full and there is a light north-by-northeasterly breeze rustling across the milkweed and not disturbing the monarch butterflies in their slumber, you can do CAL with nothing other than a single SME. But why?

It’s when you can combine the advantages of continuous learning with the flexibility that non-expert training gives you that TAR really starts to come alive. CAL means a lower total number of documents reviewed. Non-expert training means flexibility about how and when you can start the process, not to mention the ability to be massively parallel and cut down total elapsed clock time. Instead of having to wait, as you do in SAL and SPL, for your expert to have free time in order to train documents, with these two busted Myths you can hit the ground running, and be done long before your SPL or SAL may have even started.


About Jeremy Pickens

Jeremy Pickens is one of the world’s leading information retrieval scientists and a pioneer in the field of collaborative exploratory search, a form of information seeking in which a group of people who share a common information need actively collaborate to achieve it. Dr. Pickens has seven patents and patents pending in the field of search and information retrieval. As Chief Scientist at Catalyst, Dr. Pickens has spearheaded the development of Insight Predict. His ongoing research and development focuses on methods for continuous learning, and the variety of real world technology assisted review workflows that are only possible with this approach. Dr. Pickens earned his doctoral degree at the University of Massachusetts, Amherst, Center for Intelligent Information Retrieval. He conducted his post-doctoral work at King’s College, London. Before joining Catalyst, he spent five years as a research scientist at FX Palo Alto Lab, Inc. In addition to his Catalyst responsibilities, he continues to organize research workshops and speak at scientific conferences around the world.