Ask Catalyst: If TAR 2.0 Discourages Using Random Documents for Training, Aren’t the Results Biased?

[Editor’s note: This is another post in our “Ask Catalyst” series, in which we answer your questions about e-discovery search and review. To learn more and submit your own question, go here.]Ask_Catalyst_Tom_Gricks_Biased_Results

We received this question:

TAR 2.0 seems to discourage using randomly selected documents for training. Doesn’t this bias the results? How do I know what I don’t know?

Today’s question is answered by Thomas Gricks, managing director of professional services. 

It isn’t that TAR 2.0 discourages the use of randomly selected documents. Continuous active learning that focuses first and foremost on the most likely responsive documents is simply more efficient. And empirical studies have shown that, rather than induce a bias, CAL algorithms accurately locate even tangentially related facets of a multi-faceted primary topic. Catalyst’s Insight Predict goes one step further by incorporating a contextual diversity algorithm into TAR 2.0 to deliberately search for “unknown” documents. With TAR 2.0, there’s virtually nothing that you don’t know.

Nothing about TAR 2.0 prevents you from using randomly selected documents to train the algorithm. You can use any positive documents you want to train the tool. But limiting your training to randomly selected documents limits the effectiveness of the tool. For example, in one of my recent projects, the richness of the collection was 4.7 percent. If you were to randomly select documents to train the system, roughly five out of every 100 documents would likely be responsive. And that rate would never improve — every batch of 100 documents would have about five responsive documents to train the tool.

By using Insight Predict, and reviewing and coding the highest-ranked documents (i.e., those most likely to be responsive), the batch richness was 38.1 percent, on average. That means that 38 out of every 100 documents reviewed were responsive — a seven-fold increase in training efficiency compared with randomly selected training documents.

It is not at all unusual to see this level of improved efficiency when using the general relevance feedback methodology associated with CAL. This is precisely how CAL is intended to operate.

This enhanced efficiency does not compromise or bias the results of a CAL review. In fact, it does exactly the opposite. By the time a CAL algorithm reaches an acceptable level of recall on a primary topic, the algorithm will generally have located and achieved a reasonable level of recall for any ancillary topics as well.

This exact issue was explicitly studied by Maura R. Grossman and Gordon V. Cormack in their study, Multi-Faceted Recall of Continuous Active Learning for Technology-Assisted Review, Proceedings of the 38th International ACM SIGIR Conference. They evaluated the notion that “CAL’s emphasis on the most-likely relevant documents may bias it to prefer documents like the ones it finds first, causing it to fail to discover one or more important, but dissimilar, classes of relevant documents.” They specifically concluded otherwise:

For all experiments, our results are the same: CAL achieves high overall recall, while at the same time achieving high recall for the various facets of relevance, whether topics or file properties. While early recall is achieved for some facets at the expense of others, by the time high overall recall is achieved — as evidenced by a substantial drop in overall marginal precision — all facets (except for a single outlier case that we attribute to mislabeling) also exhibit high recall. Our findings provide reassurance that CAL can achieve high recall without excluding identifiable categories of relevant information.

Their results shouldn’t be that surprising. Since the entire collection is always available for training, it is easy to see how the words that may make a document responsive to a primary topic will eventually point the tool at documents that contain other words that are responsive to ancillary topics. And that cascading process will continue until the chain of responsive word patterns runs dry.

With TAR 2.0, Catalyst adds one more arrow in your quiver — contextual diversity — to reach even the most remotely related documents. Contextual Diversity is a companion algorithm that looks for groups of documents about which the CAL algorithm knows the least.

Typical rolling collections provide one of the best examples of how contextual diversity works. Imagine an initial collection of documents from the client’s sales department. Those documents will be littered with sales terminology such as product and customer names, invoicing terms, and pricing and profit margin analyses. Consequently, a CAL algorithm will train, and in turn rank, the initial collection, on the basis of words relating primarily to the sales function.

Imagine then adding a subsequent collection from the client’s engineering group. Engineering documents are likely to contain terms that are much more technical than the sales documents — chemical compounds and processes, research and development concepts, suppliers as opposed to customers. Since none of these highly technical terms were included in the sales documents, the CAL algorithm initially knows very little about the value of the engineering documents in the review process. And, because CAL ranks documents loosely on the basis of their similarity to the documents that have been used to train the algorithm, the bulk of the engineering documents are likely to be ranked at the low end of the spectrum.

To overcome the lack of knowledge and training based on this disparity in terms, the contextual diversity algorithm proactively seeks groups of documents about which the tool knows the least. In our example, the contextual diversity algorithm would identify some group of engineering documents as relative unknowns. The algorithm would then select the most representative document — the document from which the CAL algorithm can learn the most — from among the engineering documents. That representative document would then be included with the most-likely responsive documents for review. Once that representative document has been reviewed and coded, those engineering documents are no longer “unknown” and the CAL algorithm can begin to effectively rank them. This process of seeking contextually diverse groups of documents continues throughout the review to penetrate deeper and deeper into the collection to locate groups of unknown documents.

As you can see, TAR 2.0 doesn’t bias the system at all. TAR 2.0 is much more efficient at locating responsive documents than a random selection. And, by combining CAL with contextual diversity, TAR 2.0 penetrates much more deeply into the collection to effectively let you know precisely what you don’t know.

mm

About Thomas Gricks

Managing Director, Professional Services, Catalyst. A prominent e-discovery lawyer and one of the nation's leading authorities on the use of TAR in litigation, Tom advises corporations and law firms on best practices for applying Catalyst's TAR technology, Insight Predict, to reduce the time and cost of discovery. He has more than 25 years’ experience as a trial lawyer and in-house counsel, most recently with the law firm Schnader Harrison Segal & Lewis, where he was a partner and chair of the e-Discovery Practice Group.