Category Archives: Search

An Open Look at Keyword Search vs. Predictive Analytics

An_Open_Look_at_Keyword_SearchCan keyword search be as or more effective than technology assisted review at finding relevant documents?

A client recently asked me this question and it is one I frequently hear from lawyers. The issue underlying the question is whether a TAR platform such as our Insight Predict is worth the fee we charge for it.

The question is a fair one and it can apply to a range of cases. The short answer, drawing on my 20-plus years of experience as a lawyer, is unequivocally, “It depends.” Continue reading

Thinking Through the Implications of CAL: Who Does the Training?

Before joining Catalyst in 2010, my entire academic and professional career revolved around basic research. I spent my time coming up with new and interesting algorithms, ways of improving document rankings and classification. However, in much of my research, it was not always clear which algorithms which may or may not have immediate application. It is not that the algorithms were not useful; they were. They just did not always have immediate application to a live, deployed system.

Since joining Catalyst, however, my research has become much more applied. I have come to discover that doesn’t just mean that the algorithms that I design have to be more narrowly focused on the existing task. It also means that I have to design those algorithms to be aware of the larger real world contexts in which those algorithms will be deployed and the limitations that may exist therein.

So it is with keen interest that I have been watching the eDiscovery world react to the recent (SIGIR 2014) paper from Maura Grossman and Gordon Cormack on the CAL (continuous active learning) protocol, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery. Continue reading

A TAR is Born: Continuous Active Learning Brings Increased Savings While Solving Real-World Review Problems

In July 2014, attorney Maura Grossman and professor Gordon Cormack introduced a new protocol for Technology Assisted Review that they showed could cut review time and costs substantially. Called Continuous Active Learning (“CAL”), this new approach differed from traditional TAR methods because it employed continuous learning throughout the review, rather than the one-time training used by most TAR technologies.

Barbra Streisand in ‘A Star is Born’

Their peer-reviewed research paper, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” also showed that using random documents was the least effective method for training a TAR system. Overall, they showed that CAL solved a number of real-world problems that had bedeviled review managers using TAR 1.0 protocols.

Not surprisingly, their research caused a stir. Some heralded its common-sense findings about continuous learning and the inefficiency of using random seeds for training. Others challenged the results, arguing that one-time training is good enough and that using random seeds eliminates bias. We were pleased that it confirmed our earlier research and legitimized our approach, which we call TAR 2.0. Continue reading

Measuring Recall in E-Discovery Review, Part Two: No Easy Answers

In Part One of this two-part post, I introduced readers to statistical problems inherent in proving the level of recall reached in a Technology Assisted Review (TAR) project. Specifically, I showed that the confidence intervals around an asserted recall percentage could be sufficiently large with typical sample sizes as to undercut the basic assertion used to justify your TAR cutoff.

download-pdfIn our hypothetical example, we had to acknowledge that while our point estimate suggested we had found 75% of the relevant documents in the collection, it was possible that we found only a far lower percentage. For example, with a sample size of 600 documents, the lower bound of our confidence interval was 40%. If we increased the sample size to 2,400 documents, the lower bound only increased to 54%. And, if we upped our sample to 9,500 documents, we got the lower bound to 63%.

Even assuming that 63% as a lower bound is enough, we would have a lot of documents to sample. Using basic assumptions about cost and productivity, we concluded that we might spend 95 hours to review our sample at a cost of about $20,000. If the sample didn’t prove out our hoped-for recall level (or if we received more documents to review), we might have to run the sample several times. That is a problem.

Is there a better and cheaper way to prove recall in a statistically sound manner? In this Part Two, I will take a look at some of the other approaches people have put forward and see how they match up. However, as Maura Grossman and Gordon Cormack warned in “Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review’” and Bill Dimm amplified in a later post on the subject, there is no free lunch. Continue reading

TAR 2.0 Capabilities Allow Use in Even More E-Discovery Tasks

Recent advances in Technology Assisted Review (“TAR 2.0”) include the ability to deal with low richness, rolling collections, and flexible inputs in addition to vast improvements in speed. [1] These improvements now allow TAR to be used effectively in many more discovery workflows than its traditional “TAR 1.0” use in classifying large numbers of documents for production.

To better understand this, it helps to begin by examining in more detail the kinds of tasks we face. Broadly speaking, document review tasks fall into three categories:[2]

  • Classification. This is the most common form of document review, in which documents are sorted into buckets such as responsive or non-responsive so that we can do something different with each class of document. The most common example here is a review for production.
  • Protection. This is a higher level of review in which the purpose is to protect certain types of information from disclosure. The most common example is privilege review, but this also encompasses trade secrets and other forms of confidential, protected, or even embarrassing information, such as personally identifiable information (PII) or confidential supervisory information (CSI).
  • Knowledge Generation. The goal here is learning what stories the documents can tell us and discovering information that could prove useful to our case. A common example of this is searching and reviewing documents received in a production from an opposing party or searching a collection for documents related to specific issues or deposition witnesses. Continue reading

How Much Can I Save with CAL? A Closer Look at the Grossman/Cormack Research Results

As most e-discovery professionals know, two leading experts in technology assisted review, Maura R. Grossman and Gordon V. Cormack, recently presented the first peer-reviewed scientific study on the effectiveness of several TAR protocols, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” to the annual conference of the Special Interest Group on Information Retrieval, a part of the Association for Computing Machinery (ACM).

download-pdfPerhaps the most important conclusion of the study was that an advanced TAR 2.0 protocol, continuous active learning (CAL), proved to be far more effective than the two standard TAR 1.0 protocols used by most of the early products on the market today—simple passive learning (SPL) and simple active learning (SAL). Continue reading

The Seven Percent Solution: The Case of the Confounding TAR Savings

SevenPercentSolution

“Which is it to-day,” [Watson] asked, “morphine or cocaine?”

[Sherlock] raised his eyes languidly from the old black-letter volume which he had opened. 
“It is cocaine,” he said, “a seven-per-cent solution. Would you care to try it?”

-The Sign of the Four, Sir Arthur Conan Doyle, (1890)

Back in the mid-to-late 1800s, many touted cocaine as a wonder drug, providing not only stimulation but a wonderful feeling of clarity as well. Doctors prescribed the drug in a seven percent solution of water. Although Watson did not approve, Sherlock Holmes felt the drug helped him focus and shut out the distractions of the real world. He came to regret his addiction in later novels, as cocaine moved out of the mainstream.

This story is about a different type of seven percent solution, with no cocaine involved. Rather, we will be talking about the impact of another kind of stimulant, one that saves a surprising amount of review time and costs. This is the story of how a seemingly small improvement in review richness can make a big difference for your e-discovery budget. Continue reading

Case Study: Using TAR to Find Hot Docs for Depositions

Common belief is that technology assisted review is useful only when making productions. In fact, it is also highly effective for reviewing productions from an opposing party. This is especially true when imminent depositions create an urgent need to identify hot documents.

A recent multi-district medical device litigation dramatizes this. The opposing party’s production was a “data dump” containing garbled OCR and little metadata. As a result, keyword searching was virtually useless. But by using TAR, the attorneys were able to highlight hot documents and prepare for the depositions with time to spare. Continue reading

Comparing Active Learning to Random Sampling: Using Zipf’s Law to Evaluate Which is More Effective for TAR

Maura Grossman and Gordon Cormack just released another blockbuster article,  “Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review,’” 7 Federal Courts Law Review 286 (2014). The article was in part a response to an earlier article in the same journal by Karl Schieneman and Thomas Gricks, in which they asserted that Rule 26(g) imposes “unique obligations” on parties using TAR for document productions and suggested using techniques we associate with TAR 1.0 including: Continue reading

Pioneering Cormack/Grossman Study Validates Continuous Learning, Judgmental Seeds and Review Team Training for Technology Assisted Review

This past weekend I received an advance copy of a new research paper prepared by Gordon Cormack and Maura Grossman, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery.” They have posted an author’s copy here.

The study attempted to answer one of the more important questions surrounding TAR methodology: Continue reading