Last week, I attended the DESI IV workshop at the International Conference on AI and LAW (ICAIL). This workshop brought together a diverse array of lawyers, vendors and academics–and even featured a special guest appearance by the courts (Magistrate Judge Paul W. Grimm). The purpose of the workshop was, in part:
…to provide a platform for discussion of an open standard governing the elements of a state-of-the-art search for electronic evidence in the context of civil discovery. The dialog at the workshop might take several forms, ranging from a straightforward discussion of how to measure and improve upon the “quality” of existing search processes; to discussing the creation of a national or international recognized standard on what constitutes a “quality process” when undertaking e-discovery searches.
Hot on the list of topics, of course, was predictive coding. Much of the discussion centered around determining exactly what standards were needed not only to convince users of such systems that non-linear, smart review would save them time and money, but also to convince the courts (and lawyers who don’t want to receive sanctions from the courts) that such technology may be safely applied to a matter at hand while still meeting all the legal requirements of discovery.
So it was with keen interest that I noted the press release from a vendor, Recommind, that it had obtained a patent on the process of predictive coding itself. Having been involved in writing a few patents in my time, my immediate thought was, “What exactly was patented, what are the specific claims? Is this going to be a broad patent, covering a high level process? Or is it going to be a narrow patent, covering one or two specific ways of doing predictive coding?”
According to Craig, the press release is “about more than terminology: it is about a process patent covering ‘systems and processes’ for iterative, computer-assisted review. Recommind believes it has long been on the record as to exactly what predictive coding is, and as a result of this patent, it expects competing vendors to follow suit accordingly, and stop claiming predictive coding capabilities they do not have.” Clearly, Recommind feels it has pioneered the concept of predictive coding and doesn’t want any competitors riding on coattails.
Second, from the explanation:
Predictive Coding seeks to automate the majority of the review process. Using a bit of direction from someone knowledgeable about the matter at hand, Predictive Coding uses sophisticated technology to extrapolate this direction across an entire corpus of documents – which can literally “review” and code a few thousand documents or many terabytes of ESI at a fraction of the cost of linear review. …
The technology aspect of Predictive Coding is not trivial and cannot be discounted; it is not easy to do, which is why linear review has continued to outlive its useful lifespan. But what makes Predictive Coding so defensible and effective are the processes, workflows and documentation of which it is an integral part. Although technology is at its CORE, Predictive Coding includes all of these parts as one integrated whole.
OK, so predictive coding as a whole (and therefore the patent on predictive coding) is not a single technology, so much as it is a “process, workflow, and documentation.” Fine; I’ll accept that. However, nowhere in this post entitled “Predictive Coding Explained” were the process, workflow and documentation really ever explained. Great pain was taken to say what predictive coding was not (e.g. threading, clustering, etc. – which I agree with). But no actual logical sequence of steps was given as to what predictive coding, at least from the perspective of this patent, was supposed to be.
For that, I had to turn to the patent itself. See Figure 5 in the patent (above), labeled “Predictive Coding Workflow.” See also Claim #1 (the top level independent patent claim). That claim says that the patent covers a method for analyzing a plurality of documents, comprising:
(1) Receiving the plurality of documents via a computing device
(2) Receiving user input from the computing device, the user input including hard coding [aka labeling] of a subset of the plurality of documents, the hard coding based on an identified subject or category [e.g. responsiveness, privilege, or issue]
(3) Executing instructions stored in memory, that:
(a) generates an initial control set based on the subset of the plurality of documents and the received user input on the subset
(b) analyzes the initial control set to determine at least one seed set parameter associated with the identified subject or category
(c ) automatically codes a first portion of the plurality of documents, based on the initial control set and the at least one set seed parameter associated with the identified subject or category
(d) analyzes the first portion of the plurality of documents by applying an adaptive identification cycle, the adaptive identification cycle being based on the initial control set, user validation of the automatic coding of the first portion of the plurality of documents and confidence threshold validation
(e ) retrieves a second portion of the plurality of documents based on a result of the application of the adaptive identification cycle on the first portion of the plurality of documents
(f) adds further documents to the plurality of documents on a rolling load basis, and conducts a random sampling of initial control set documents both on a static basis and the rolling load basis
(4) receiving user input via the computing device, the user input comprising inspection, analysis and hard coding of the randomly sampled initial control set documents, and
(5) executing instructions stored in memory , wherein execution of the instructions by the processor automatically codes documents based on the received user input regarding the randomly sampled initial control set documents
So that appears to be the primary workflow, the primary patented claim. Let’s compare and contrast that workflow with that of traditional relevance feedback. Though relevance feedback dates back to the early 1970s, here is a passage from the Introduction to Information Retrieval (published in 2008) describing the basic workflow:
The idea of relevance feedback is to involve the user in the retrieval process so as to improve the final result set. In particular, the user gives feedback on the relevance of documents in an initial set of results. The basic procedure is:
- The user issues a (short, simple) query.
- The system returns an initial set of retrieval results.
- The user marks some returned documents as relevant or nonrelevant.
- The system computes a better representation of the information need based on the user feedback.
- The system displays a revised set of retrieval results.
Relevance feedback can go through one or more iterations of this sort.
In other words, the relevance feedback workflow seems to do everything that the predictive coding workflow does. It starts with a collection of documents. It selects a subset of those documents in some manner. It presents those documents to a human annotator for expert labeling. Based on the labels provided by the human, the algorithm goes through an “adaptive identification cycle” in which it modifies itself so as to better align itself with the human understanding of the document labels. And, based on this adapted algorithm, it revises the set of results. That is, it recomputes the probabilities of the labels (relevance or nonrelevant, responsive or nonresponsive) for all the results. Finally, it should be noted that the traditional, decades-old relevance feedback process workflow also is capable of iteration.
So what is the difference? I don’t just ask this rhetorically. I see a very strong similarity in the overall workflows between both predictive coding and relevance feedback, so I would honestly and transparently like to understand where the crucial differences are. If we are to understand what Recommind believes predictive coding to be–and if this understanding is going to help the courts set the legal precedent for defensible use of these technologies, a goal in which I fully agree with Recommind–then we really need to understand the process as a whole and what makes it unique.
The only thing I can think of is that there are a few occasions in the claimed predictive coding workflow that integrate random sampling and this is most likely to insure that the process is defensible. If that is the case, then how does that differ from active learning? Here is an example of the active learning workflow which incorporates uncertainty-based sampling, from a 2007 academic research paper by Andreas Vlachos, “A Stopping Criterion for Active Learning“:
seed labelled data L, unlabelled data U,
batch size b
Train a model on L
Active Learning Loop:
Until a stopping criterion is satisﬁed:
Apply the trained model classiﬁer on U
Rank the instances in U using the uncertainty of the model
Annotate the top b instances and add them to L
Train the model on the expanded L
That is, instead of just presenting the expert user (e.g. lawyer) with the documents that have the highest probability of responsiveness, or of privilege, or of whatever issue they’ve been coded for, an active learning process or workflow explicitly seeks to add those document instances about which the learning algorithm is the most uncertain. That could mean documents for which the probability of that document’s label is relatively even or undistinguished (highest entropy) across all classes (in the case of generative machine learning models) or documents which lie the nearest to a decision boundary (in the case of discriminative machine learning models).
However, it could also mean that a document doesn’t lie near any boundary or have any probability estimate associated with it, because the appropriate signals have not yet been added to the model. In such cases, the best way–nay even the only way–of doing uncertainty sampling is to randomly sample from the collection, as random sampling helps you discover those documents, and therefore those decision boundaries, that you otherwise would not be aware of. Thus, active learning as a general workflow pattern also incorporates random sampling.
So again, it is still not clear to me exactly what makes the Recommind predictive coding workflow unique, what distinguishes it from methods that have gone before, what its core characteristics are. That isn’t to say that they don’t exist. However, I believe further discussion is warranted, both in public as well as at workshops such as DESI (http://www.umiacs.umd.edu/~oard/desi4/), as this will serve to advance the market as a whole. That is, I agree with Barry Murphy over at eDiscovery Journal that:
No matter what, this is good news for the eDiscovery market as a whole. One could say that Recommind is doing prospects a favor by throwing down the gauntlet and forcing competitors to transparently define exactly what “predictive coding” capabilities they do/do not have. While that might be a side-effect, it’s more likely that Recommind is trying to take the heat around predictive coding and have it warm up the vendor’s prospects more than anything else. We at eDJ take this as a call to better define what predictive coding is and what solutions need to offer to be valuable.
I take this as a call for vendors not only to define exactly what “predictive coding” capabilities they do/do not have, but for the industry as a whole to begin to set court-friendly guidelines around what predictive coding truly is.