Category Archives: Sampling

Citing TAR Research, Court OKs Production Using Random Sampling

Catalyst_Court_OKs_Production_Using_Random_SamplingCiting research on the efficacy of technology assisted review over human review, a federal court has approved a party’s request to respond to discovery using random sampling.

Despite a tight discovery timeline in the case, the plaintiff had sought to compel the defendant hospital to manually review nearly 16,000 patient records. Continue reading

Why Control Sets are Problematic in E-Discovery: A Follow-up to Ralph Losey

Why_Control_Sets_are_Problematic_in_E-DiscoveryIn a recent blog post, Ralph Losey lays out a case for the abolishment of control sets in e-discovery, particularly if one is following a continuous learning protocol.  Here at Catalyst, we could not agree more with this position. From the very first moment we rolled out our TAR 2.0, continuous learning engine we have not only recommended against the use of control sets, but we actively decided against ever implementing them in the first place and thus never even had the potential of steering clients awry.

Losey points out three main flaws with control sets. These may be summarized as (1) knowledge Issues, (2) sequential testing bias, and (3) representativeness. In this blog post I offer my own take and evidence in favor of these three points, and offer a fourth difficulty with control sets: rolling collection. Continue reading

Your TAR Temperature is 98.6 — That’s A Pretty Hot Result

Our Summit partner, DSi, has a large financial institution client that had allegedly been defrauded by a borrower. The details aren’t important to this discussion, but assume the borrower employed a variety of creative accounting techniques to make its financial position look better than it really was. And, as is often the case, the problems were missed by the accounting and other financial professionals conducting due diligence. Indeed, there were strong factual suggestions that one or more of the professionals were in on the scam.

As the fraud came to light, litigation followed. Perhaps in retaliation or simply to mount a counter offense, the defendant borrower hit the bank with lengthy document requests. After collection and best efforts culling, our client was still left with over 2.1 million documents which might be responsive. Neither time deadlines nor budget allowed for manual review of that volume of documents. Keyword search offered some help but the problem remained. What to do with 2.1 million potentially responsive documents? Continue reading

Measuring Recall in E-Discovery Review, Part Two: No Easy Answers

In Part One of this two-part post, I introduced readers to statistical problems inherent in proving the level of recall reached in a Technology Assisted Review (TAR) project. Specifically, I showed that the confidence intervals around an asserted recall percentage could be sufficiently large with typical sample sizes as to undercut the basic assertion used to justify your TAR cutoff.

download-pdfIn our hypothetical example, we had to acknowledge that while our point estimate suggested we had found 75% of the relevant documents in the collection, it was possible that we found only a far lower percentage. For example, with a sample size of 600 documents, the lower bound of our confidence interval was 40%. If we increased the sample size to 2,400 documents, the lower bound only increased to 54%. And, if we upped our sample to 9,500 documents, we got the lower bound to 63%.

Even assuming that 63% as a lower bound is enough, we would have a lot of documents to sample. Using basic assumptions about cost and productivity, we concluded that we might spend 95 hours to review our sample at a cost of about $20,000. If the sample didn’t prove out our hoped-for recall level (or if we received more documents to review), we might have to run the sample several times. That is a problem.

Is there a better and cheaper way to prove recall in a statistically sound manner? In this Part Two, I will take a look at some of the other approaches people have put forward and see how they match up. However, as Maura Grossman and Gordon Cormack warned in “Comments on ‘The Implications of Rule 26(g) on the Use of Technology-Assisted Review’” and Bill Dimm amplified in a later post on the subject, there is no free lunch. Continue reading

How Much Can I Save with CAL? A Closer Look at the Grossman/Cormack Research Results

As most e-discovery professionals know, two leading experts in technology assisted review, Maura R. Grossman and Gordon V. Cormack, recently presented the first peer-reviewed scientific study on the effectiveness of several TAR protocols, “Evaluation of Machine-Learning Protocols for Technology-Assisted Review in Electronic Discovery,” to the annual conference of the Special Interest Group on Information Retrieval, a part of the Association for Computing Machinery (ACM).

download-pdfPerhaps the most important conclusion of the study was that an advanced TAR 2.0 protocol, continuous active learning (CAL), proved to be far more effective than the two standard TAR 1.0 protocols used by most of the early products on the market today—simple passive learning (SPL) and simple active learning (SAL). Continue reading

Measuring Recall in E-Discovery Review, Part One: A Tougher Problem Than You Might Realize

A critical metric in Technology Assisted Review (TAR) is recall, which is the percentage of relevant documents actually found from the collection. One of the most compelling reasons for using TAR is the promise that a review team can achieve a desired level of recall (say 75% of the relevant documents) after reviewing only a small portion of the total document population (say 5%). The savings come from not having to review the remaining 95% of the documents. The argument is that the remaining documents (the “discard pile”) include so few that are relevant (against so many irrelevant documents) that further review is not economically justified. Continue reading

Is Random the Best Road for Your CAR? Or is there a Better Route to Your Destination?

John Tredennick Car

One of the givens of traditional CAR (computer-assisted review)[1] in e-discovery is the need for random samples throughout the process. We use these samples to estimate the initial richness of the collection (specifically, how many relevant documents we might expect to see). We also use random samples for training, to make sure we don’t bias the training process through our own ideas about what is and is not relevant.

Later in the process, we use simple random samples to determine whether our CAR succeeded. Continue reading

Courts Should Consider Search Technology, Say New Penn. E-Discovery Rules

 

The Supreme Court of Pennsylvania

The Supreme Court of Pennsylvania has adopted new e-discovery rules that expressly distance federal e-discovery jurisprudence and instead emphasize “traditional principles of proportionality under Pennsylvania law.” Notably, the new rules provide that, when weighing proportionality, parties and courts should consider electronic search and sampling technology, among other factors.

The court promulgated the new e-discovery rules June 6 as amendments to the Pennsylvania Rules of Civil Procedure. They take effect Aug. 1, 2012. Continue reading

Catalyst’s Jim Eidelman Discusses Predictive Coding in ‘Law Technology News’

Now that U.S. District Judge Andrew L. Carter Jr. has affirmed the groundbreaking predictive coding order issued by U.S. Magistrate Judge Andrew J. Peck in Da Silva Moore v. Publicis Groupe, Law Technology News reporter Evan Koblentz went back and spoke to leading professionals in the legal technology field for their reactions. You can read his story here: Take Two: Reactions to ‘Da Silva Moore’ Predictive Coding Order.

One of the people Koblentz quotes is Catalyst’s own Jim Eidelman, senior search and analytics consultant on the Catalyst Search & Analytics Consulting team. These court decisions gave predictive coding “a legitimacy that was needed,” Eidelman told Koblentz. Continue reading

Judge Peck Provides a Primer on Computer-Assisted Review


PrimerCoverMagistrate Judge Andrew J. Peck issued a landmark decision in Monique Da Silva Moore v. MSL Group, filed on Feb. 24, 2012. This much-blogged-about decision made headlines as being the first judicial opinion to approve the process of “predictive coding,” which is one of the many terms people use to describe computer-assisted coding.

Well, Judge Peck did just that. As he hinted during his presentations at LegalTech, this was the first time a court had the opportunity to consider the propriety of computer-assisted coding. Without hesitation, Judge Peck ushered us into the next generation of e-discovery review—people assisted by a friendly robot. That set the e-discovery blogosphere buzzing, as Bob Ambrogi pointed out in an earlier post. Continue reading