Category Archives: Predictive Ranking

Your TAR Temperature is 98.6 — That’s A Pretty Hot Result

Our Summit partner, DSi, has a large financial institution client that had allegedly been defrauded by a borrower. The details aren’t important to this discussion, but assume the borrower employed a variety of creative accounting techniques to make its financial position look better than it really was. And, as is often the case, the problems were missed by the accounting and other financial professionals conducting due diligence. Indeed, there were strong factual suggestions that one or more of the professionals were in on the scam.

As the fraud came to light, litigation followed. Perhaps in retaliation or simply to mount a counter offense, the defendant borrower hit the bank with lengthy document requests. After collection and best efforts culling, our client was still left with over 2.1 million documents which might be responsive. Neither time deadlines nor budget allowed for manual review of that volume of documents. Keyword search offered some help but the problem remained. What to do with 2.1 million potentially responsive documents? Continue reading

Thinking Through the Implications of CAL: Who Does the Training?

Before joining Catalyst in 2010, my entire academic and professional career revolved around basic research. I spent my time coming up with new and interesting algorithms, ways of improving document rankings and classification. However, in much of my research, it was not always clear which algorithms which may or may not have immediate application. It is not that the algorithms were not useful; they were. They just did not always have immediate application to a live, deployed system.

Since joining Catalyst, however, my research has become much more applied. I have come to discover that doesn’t just mean that the algorithms that I design have to be more narrowly focused on the existing task. It also means that I have to design those algorithms to be aware of the larger real world contexts in which those algorithms will be deployed and the limitations that may exist therein.

So it is with keen interest that I have been watching the eDiscovery world react to the recent (SIGIR 2014) paper from Maura Grossman and Gordon Cormack on the CAL (continuous active learning) protocol, Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery. Continue reading

A TAR is Born: Continuous Active Learning Brings Increased Savings While Solving Real-World Review Problems

In July 2014, attorney Maura Grossman and professor Gordon Cormack introduced a new protocol for Technology Assisted Review that they showed could cut review time and costs substantially. Called Continuous Active Learning (“CAL”), this new approach differed from traditional TAR methods because it employed continuous learning throughout the review, rather than the one-time training used by most TAR technologies.

Barbra Streisand in ‘A Star is Born’

Their peer-reviewed research paper, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” also showed that using random documents was the least effective method for training a TAR system. Overall, they showed that CAL solved a number of real-world problems that had bedeviled review managers using TAR 1.0 protocols.

Not surprisingly, their research caused a stir. Some heralded its common-sense findings about continuous learning and the inefficiency of using random seeds for training. Others challenged the results, arguing that one-time training is good enough and that using random seeds eliminates bias. We were pleased that it confirmed our earlier research and legitimized our approach, which we call TAR 2.0. Continue reading

Measuring Recall in E-Discovery Review, Part Two: No Easy Answers

In Part One of this two-part post, I introduced readers to statistical problems inherent in proving the level of recall reached in a Technology Assisted Review (TAR) project. Specifically, I showed that the confidence intervals around an asserted recall percentage could be sufficiently large with typical sample sizes as to undercut the basic assertion used to justify your TAR cutoff.

download-pdfIn our hypothetical example, we had to acknowledge that while our point estimate suggested we had found 75% of the relevant documents in the collection, it was possible that we found only a far lower percentage. For example, with a sample size of 600 documents, the lower bound of our confidence interval was 40%. If we increased the sample size to 2,400 documents, the lower bound only increased to 54%. And, if we upped our sample to 9,500 documents, we got the lower bound to 63%.

Even assuming that 63% as a lower bound is enough, we would have a lot of documents to sample. Using basic assumptions about cost and productivity, we concluded that we might spend 95 hours to review our sample at a cost of about $20,000. If the sample didn’t prove out our hoped-for recall level (or if we received more documents to review), we might have to run the sample several times. That is a problem.

Is there a better and cheaper way to prove recall in a statistically sound manner? In this Part Two, I will take a look at some of the other approaches people have put forward and see how they match up. However, as Maura Grossman and Gordon Cormack warned in “Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review’” and Bill Dimm amplified in a later post on the subject, there is no free lunch. Continue reading

TAR 2.0 Capabilities Allow Use in Even More E-Discovery Tasks

Recent advances in Technology Assisted Review (“TAR 2.0”) include the ability to deal with low richness, rolling collections, and flexible inputs in addition to vast improvements in speed. [1] These improvements now allow TAR to be used effectively in many more discovery workflows than its traditional “TAR 1.0” use in classifying large numbers of documents for production.

To better understand this, it helps to begin by examining in more detail the kinds of tasks we face. Broadly speaking, document review tasks fall into three categories:[2]

  • Classification. This is the most common form of document review, in which documents are sorted into buckets such as responsive or non-responsive so that we can do something different with each class of document. The most common example here is a review for production.
  • Protection. This is a higher level of review in which the purpose is to protect certain types of information from disclosure. The most common example is privilege review, but this also encompasses trade secrets and other forms of confidential, protected, or even embarrassing information, such as personally identifiable information (PII) or confidential supervisory information (CSI).
  • Knowledge Generation. The goal here is learning what stories the documents can tell us and discovering information that could prove useful to our case. A common example of this is searching and reviewing documents received in a production from an opposing party or searching a collection for documents related to specific issues or deposition witnesses. Continue reading

TAR in the Courts: A Compendium of Case Law about Technology Assisted Review

Magistrate Judge Andrew Peck

Magistrate Judge Andrew Peck

It is less than three years since the first court decision approving the use of technology assisted review in e-discovery. “Counsel no longer have to worry about being the ‘first’ or ‘guinea pig’ for judicial acceptance of computer-assisted review,” U.S. Magistrate Judge Andrew J. Peck declared in his groundbreaking opinion in Da Silva Moore v. Publicis Groupe.

Judge Peck did not open a floodgate of judicial decisions on TAR. To date, there have been fewer than 20 such decisions and not one from an appellate court.

However, what he did do — just as he said — was to set the stage for judicial acceptance of TAR. Not a single court since has questioned the soundness of Judge Peck’s decision. To the contrary, courts uniformly cite his ruling with approval.

That does not mean that every court orders TAR in every case. The one overarching lesson of the TAR decisions to date is that each case stands on its own merits. Courts look not only to the efficiency and effectiveness of TAR, but also to issues of proportionality and cooperation.

What follows is a summary of the cases to date involving TAR. Each includes a link to the full-text decision, so that you can read for yourself what the court said. Continue reading

How Corporate Counsel are Integrating E-Discovery Technologies to Help Manage Litigation Costs

The newsletter Digital Discovery & e-Evidence just published an article by Catalyst founder and CEO John Tredennick, “Taking Control: How Corporate Counsel are Integrating eDiscovery Technologies to Help Manage Litigation Costs.” In the article, John explains why savvy corporate counsel are using the multi-matter repository and technology assisted review to manage cases and control costs. Continue reading

How Much Can I Save with CAL? A Closer Look at the Grossman/Cormack Research Results

As most e-discovery professionals know, two leading experts in technology assisted review, Maura R. Grossman and Gordon V. Cormack, recently presented the first peer-reviewed scientific study on the effectiveness of several TAR protocols, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” to the annual conference of the Special Interest Group on Information Retrieval, a part of the Association for Computing Machinery (ACM).

download-pdfPerhaps the most important conclusion of the study was that an advanced TAR 2.0 protocol, continuous active learning (CAL), proved to be far more effective than the two standard TAR 1.0 protocols used by most of the early products on the market today—simple passive learning (SPL) and simple active learning (SAL). Continue reading

The Seven Percent Solution: The Case of the Confounding TAR Savings

SevenPercentSolution

“Which is it to-day,” [Watson] asked, “morphine or cocaine?”

[Sherlock] raised his eyes languidly from the old black-letter volume which he had opened. 
“It is cocaine,” he said, “a seven-per-cent solution. Would you care to try it?”

-The Sign of the Four, Sir Arthur Conan Doyle, (1890)

Back in the mid-to-late 1800s, many touted cocaine as a wonder drug, providing not only stimulation but a wonderful feeling of clarity as well. Doctors prescribed the drug in a seven percent solution of water. Although Watson did not approve, Sherlock Holmes felt the drug helped him focus and shut out the distractions of the real world. He came to regret his addiction in later novels, as cocaine moved out of the mainstream.

This story is about a different type of seven percent solution, with no cocaine involved. Rather, we will be talking about the impact of another kind of stimulant, one that saves a surprising amount of review time and costs. This is the story of how a seemingly small improvement in review richness can make a big difference for your e-discovery budget. Continue reading

Measuring Recall in E-Discovery Review, Part One: A Tougher Problem Than You Might Realize

A critical metric in Technology Assisted Review (TAR) is recall, which is the percentage of relevant documents actually found from the collection. One of the most compelling reasons for using TAR is the promise that a review team can achieve a desired level of recall (say 75% of the relevant documents) after reviewing only a small portion of the total document population (say 5%). The savings come from not having to review the remaining 95% of the documents. The argument is that the remaining documents (the “discard pile”) include so few that are relevant (against so many irrelevant documents) that further review is not economically justified. Continue reading