Does Your TAR System Have My Favorite Feature? A Primer on Holistic Thinking

A_to_B-01-01I have noticed that in certain popular document-based systems in the e-discovery marketplace, there is a feature (a capability) that often gets touted.  Although I am a research scientist at Catalyst, I have been on enough sales calls with my fellow Catalyst team members to have heard numerous users of document-based systems ask whether or not we have the capability to automatically remove common headers and footers from email. There are document-based systems that showcase this capability as a feature that is good to have, so clients often include it in the checklist of capabilities that they’re seeking.

This leads me to ask: Why?

For the longest time, this request confused me. It was a capability that many have declared that they need, because they saw that it existed elsewhere. That leads me to want to discuss the topic of holistic thinking when it comes to one’s technology assisted review (TAR) algorithms and processes.

Too often, the selection of a TAR system boils down to a checklist of capabilities, with little to no consideration about how these capabilities interoperate or the joint effect that they have on the outcome as a whole. Instead, it is assumed that if a TAR platform has a lot of features, it must be a good one, right?

I do not buy into that line of thinking. As a research scientist, my primary interest is in measuring and improving the outcome as a whole. My raison d’être is to come up with algorithms and processes around those algorithms that save the TAR client as much money, time, and (because of real-world challenges) headaches as possible. It is not to fulfill as many checklist items as possible. And the only way I do this is by developing capabilities that significantly move the needle on the client’s TAR result.

It is through that holistic lens, focused on the final outcome, that I now pose the question about whether one really needs, in general, the capability of automatically removing email headers and footers.

Document-Based vs. Corpus-Based Algorithms

What started me thinking about this was a post over at the blog Bits in the Balance. Joshua Rubin has an interesting writeup on the dichotomy between two broad classes of TAR algorithms. Rubin calls these two classes “document-based” and “corpus-based.” The post is worth a read, but the two approaches can be briefly summarized in the following manner: Document-based systems pre-determine the similarity between documents in a collection.  When a document is then manually coded as relevant or non-relevant during TAR training, that coding is then propagated to the predetermined “neighbor” documents.

Corpus-based systems, on the other hand, are characterized by global functions that are inferred from the entire set of manually coded training documents. Different functions have different mathematical forms, so at the risk of oversimplifying, they assign importance weights to the various vocabulary terms in the collection. These weights tend to be positive when the term is an indicator of relevance, negative when the term is an indicator of non-relevance, and tend toward neutral when the term has little to no predictive power, i.e. is found relatively equally in both relevant and non-relevant documents.

Rubin also mentions that document-based systems have a high sensitivity to initial training conditions. This is because document similarities are (generally) fixed a priori and do not change or adapt as TAR training continues. Therefore, it is of critical importance not only that the judgment on the “seed” document be absolutely correct, but also that the similarity relationships, the local “neighbors” be correct, i.e. that a local neighbor is about the same thing that the seed document is about.

A Necessity for Document-based Algorithms?

Now let us return to the question of whether one really needs the capability of automatically removing email headers and footers. Recall from the discussion above that the primarily distinction between document-based and corpus-based is that the former pre-determines document similarities whereas the latter infers a function from (runtime) training data and adjusts the relative importance of vocabulary terms as training proceeds.

What this means is that if document similarity is predetermined, it becomes very important to the final outcome to pre-select the perfect terms for use in determining document-to-document similarity. If the wrong terms are selected, if extraneous terms are included in the similarity determination, then two documents could be predetermined to be similar when in fact they are not. For example, imagine two emails, each from person X to person Y, and on the same day. The first email says:

Please come talk to me privately about the Hartback account.

Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Employees of Acme Widgets Inc. are expressly required not to make defamatory statements and not to infringe or authorize any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising.

The second email says:

How did your bowling game go last night?

Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Employees of Acme Widgets Inc. are expressly required not to make defamatory statements and not to infringe or authorize any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising.

In both emails, the vast majority of the email is made up of the footer. If that footer is not removed, these two documents run the high risk of being each other’s closest neighbors in a document-based system. And if each of these documents were each other’s closest neighbors, if the second were (correctly) marked as non-relevant, the first would (incorrectly) be predicted to be non-relevant as well. Or if the first were (correctly) marked relevant, the second would be (incorrectly) predicted to be relevant.

Thus in a document-based system, it becomes very important for the terms extracted from those documents to be about what the document is truly about. That’s why so much emphasis is placed on automatically removing common headers and footers. With the footers removed, these two documents become:

Please come talk to me privately about the Hartback account.

How did your bowling game go last night?

With the footers removed, it becomes much clearer to the algorithm that these two documents are not related. Thus, it appears to be a necessity for document-based systems to remove these footers.

How a Corpus-based Algorithm Differs

On the other hand, how would a corpus-based TAR algorithm handle these emails? Again, at the risk of oversimplifying, let’s imagine some of the weights that are inferred from the entire set of training documents:

hartback +5

privately +2

account +1

necessarily represent 0

defamatory statements 0

opinions 0

personally liable 0

game -1

night -3

bowling -6

The TAR system has inferred the weights on these terms and phrases because it has seen many documents, both relevant and non-relevant—not just a single closest neighbor. And in so doing, it found that mentions of “Hartback” are strong indicators of relevance, mentions of “bowling” are strong indicators of non-relevance, and mentions of the phrases “personally liable” and “defamatory statements” are neutral in that they are found uniformly in both relevant and non-relevant training documents.

When the algorithm makes predictions about which unjudged documents are relevant and non-relevant, documents that mention the Hartback account get a strong positive response.  Documents that mention bowling get a strong negative response.  And documents with the email footer in them are unaffected, i.e. they will neither rise nor fall in the predictive ranking, because the algorithm has learning from global, corpus-wide experience to ignore those terms.  Thus, for corpus-based approaches, leaving the headers and footers in a document does not change the final outcome. It is not a problem.

In fact, in certain circumstances, not only might leaving in generic headers and footers not present a problem for the final outcome, but it is possible that they might even be predictive of relevance. For example, imagine a custodian who has no email footer at all, but then right before beginning a series of suspicious behaviors decides to add an email footer that claims that his or her messages are confidential, privileged and protected by work product immunity.

Now, a particular email may or may not be privileged, and adding the footer does not necessarily make it so. But in this circumstance, the very existence of the footer indicates the custodian’s change in behavior, and documents that contain that footer have a slightly higher probability of being relevant to the matter at hand than do documents without that footer.  Corpus-based TAR algorithms can pick up on that signal, and use it as one among many pieces of evidence to predict document relevance.

Conclusion

The question we started with is whether it is necessary for a system to remove headers and footers in order to be able to properly make predictions about the relevance of unjudged documents. Some systems have this capability, so shouldn’t all of them? The answer is no.

The exact form of the TAR algorithm itself has an effect on whether or not this particular header and footer removal pre-processing step is necessary. Document-based systems, with their emphasis on pre-determining document similarity before a single training document has been seen, suffer from the problems of extraneous header and footer text. Corpus-based systems, on the other hand, learn globally (as training continues) which terms and phrases are predictive and which are not, and sets term weights accordingly. Thus, corpus-based systems essentially learn to ignore not only extraneous headers and footers, but any text, anywhere in the document, that does not help make better predictions.

Naturally, the next question one would want to ask is whether document-based systems with header and footer removal work better than corpus-based systems without. I’ll leave that question for another day. In the meantime, I hope that I have been able to clearly communicate the admonition that TAR systems should not be checklists of capabilities, but should be regarded holistically. Not to mentioned evaluated holistically as well. I hope in the coming weeks to elaborate on a few more examples of checklisted versus holistic thinking.

mm

About Jeremy Pickens

Jeremy Pickens is one of the world’s leading information retrieval scientists and a pioneer in the field of collaborative exploratory search, a form of information seeking in which a group of people who share a common information need actively collaborate to achieve it. Dr. Pickens has seven patents and patents pending in the field of search and information retrieval. As Chief Scientist at Catalyst, Dr. Pickens has spearheaded the development of Insight Predict. His ongoing research and development focuses on methods for continuous learning, and the variety of real world technology assisted review workflows that are only possible with this approach. Dr. Pickens earned his doctoral degree at the University of Massachusetts, Amherst, Center for Intelligent Information Retrieval. He conducted his post-doctoral work at King’s College, London. Before joining Catalyst, he spent five years as a research scientist at FX Palo Alto Lab, Inc. In addition to his Catalyst responsibilities, he continues to organize research workshops and speak at scientific conferences around the world.