Deep learning. The term seems to be ubiquitous these days. Everywhere from self-driving cars and speech transcription to victories in the game “Go” and cancer diagnosis. If we measure things by press coverage, deep learning seems poised to make every other form of machine learning obsolete.
Recently, Catalyst’s founder and CEO John Tredennick interviewed Catalyst’s chief scientist, Dr. Jeremy Pickens (who we at Catalyst call Dr. J), about how deep learning works and how it might be applied in the legal arena.
JT: Good afternoon Dr. J. I have been reading about deep learning and would like to know more about how it works and what it might offer the legal profession.
Dr. J: We could spend hours talking about the various ins and outs of deep learning but let me provide an overview for starters.
We start with a machine learning model known as a neural network (sometimes also referred to as an artificial neural network, or ANN). A neural network is set up as a set of nodes and edges between those nodes, similar to neurons connecting to other neurons in the brain.
A neutral network is trained by looking at how well the “firing of all the neurons” predicts some variables of interest, and then adjusting the magnitudes and/or thresholds of the firing of various neurons until the neural network can accurately predict the labels in the training data.
Neural networks have been around for half a century, but only recently has computational power grown large enough and, more importantly, only recently have training data sizes grown large enough, that neural networks could be made sufficiently large to make state-of-the-art predictions in certain domains. The way in which neural networks have grown large is by adding layers to them.
For starters, you can read more about them here.
Early networks had a single input layer connected to an intermediate hidden layer, connected to an output layer. With more training data, some deep learning models have been able to add over a dozen hidden layers. The more hidden layers, and the more data to train those layers, the more complex and powerful are the functions that can be learned by the model, which is is the essence of deep learning:
JT: Thanks for the starting point. How might deep learning be useful for e-discovery professionals?
Dr. J: Let’s take a look at it from the perspective of document review in e-discovery, which is a supervised machine learning perspective. In general, the algorithms that make up a technology-assisted document review come in three forms: (1) feature (aka signal) extraction algorithms, which then feed into (2) some sort of supervised machine learning algorithm, which then gets (3) controlled by some sort of process algorithm.
JT: OK, but hold on there. Can you give me a little more about these three categories please.
Dr. J: Well a feature could be a lot of different text items including unigrams (individual words), phrases (n-grams), named entities (people, places, things), dates, sentiment, and so forth.
Supervised machine learning algorithms include such things as Support Vector Machines, Logistic Regression and Naive Bayes, as well as a technique that we’ve developed here at Catalyst.
Processes, or the way in which the supervised machine learning algorithms are wielded, are typically talked about in terms of either continuous (TAR 2.0) or simple (TAR 1.0) learning.
JT: That helps, so how does deep learning figure in here?
Dr. J: What deep learning does is munge together the first two stages, feature extraction and supervised machine learning. It goes a step further than such standard feature extraction techniques — which primarily involve contiguous words and phrases as features — by not distinguishing between feature extraction and prediction. Features are learned on the fly, during supervised training, by adjusting weights at different layers in the neural network, creating combined signals at higher layers of the network from features at lower layers in the neural network, capturing non-linear, non-contiguous relationships between raw words.
It’s a clever technique, but as mentioned previously, it takes lot of data to jointly learn features and predictions. This observation is our introduction to moving beyond the hype of deep learning. First, I note that when it comes to features, deep learning is not required to have non-contiguous word or phrase features. As I explored in the very first paper that I ever published as a young grad student back in 1999, features may include non-contiguous sets of terms.
Second, I would like to point out something about all the tasks that one sees reported in the media for which deep learning has shown success: They all involve really large quantities of (typically) lesser semantically structured data, i.e. images, video, audio, and sensor data.
JT: What do you mean about large quantities? Are we talking “big data” here? Does that equate to a big case in the legal world?
Dr. J: A self-driving car probably collects more video, LiDAR (rangefinder information), and other sensor data in an hour than most law firms likely deal with in a year. In the game of Go, massive amounts of training data can be created by having the algorithm play against different versions of itself, i.e. hundreds of millions of simulated games.
Image recognition comes on the heels of the global ubiquity of digital cameras in consumer hands combined with web-scale image hosting services. Home automation, which often involves significant amounts of audio processing, from voice commands to behavioral analysis, not to mention the aggregation of this data from millions of homes, also involves collecting massive amounts of data.
Cancer diagnoses, with the high-resolution MRI and other scans, involves massive amounts of pixels, and crime prevention, with everything from national databases of incidents to audio monitoring of gunshot and other events in cities like New York, stand to benefit immensely from deep learning, because audio and video and LiDAR are semantically unstructured.
In contrast, while the meaning of textual information might not always be clear (e.g. “jaguar” the car vs “jaguar” the animal), even the much-maligned raw “keyword” itself carries an incredible amount of semantic, conceptual information in comparison to a pixel from an image or frame of video, or to a DFT (discrete fourier transform) from a few milliseconds of audio. Furthermore, because these tasks are indeed some of the world’s biggest challenges, big data is available in a way that they are not in a typical e-discovery matter.
So the question: Can e-discovery similarly benefit from deep learning — outperform everything else in existence — the way self-driving cars can? At the risk of going out on a limb, I don’t think so. E-discovery is different from the aforementioned “world’s biggest challenges” in two ways: (1) despite toy counter-examples like jaguar/jaguar, most words by themselves are filled with oodles of semantic meaning. There is much less room for deep learning to add value to words than there is for deep learning to add value to pixels and LiDAR scans; (2) despite what marketing likes to claim, our industry does not necessarily have a big data problem.
Do we have a lot of data? Yes. More than is feasible for linear review? Yes. Is it big data? Well 50,000 documents is small. Even 3 million documents is small, when you compare against the scale of text available on the web, which as of the moment this is being writing, is estimated to be at least 4.5 billion pages (http://www.worldwidewebsize.com/), and that doesn’t even count the amount of general natural language (text) available via tweets, Facebook posts, and other social media. “Too many documents to review” is not the same thing as “big data.” This has consequences for the broader applicability of deep learning to the e-discovery domain.
JT: I get your point about big data but the legal industry certainly has a lot of data to deal with.
Dr. J: Let’s take a closer look at this by reading an open letter posted on LinkedIn, which was addressed to Yann LeCun, one of the fathers of deep learning. LeCun has been working on neural networks and or deep learning (recurrent neural networks, convolutional neural networks, etc.) for over 30 years, from when they were popular in the 1980s, to when they were massively unpopular in the 1990s and 2000s, to when they became popular again in the early 2010s.
In this letter, the research scientist Shalini Ananda lays out a quadrant, with big vs. small data on one dimension, and generalized (supervised machine learning, aka TAR in e-discovery parlance) vs. specialized deep learning (unsupervised deep learning, aka clustering in e-discovery parlance) on the other dimension.
When you have big data, truly big data, generalized (supervised) deep learning works well. When you have small data, specialized deep learning (clustering) works well. But when you have small data and want to do generalized (supervised) deep learning, you are in a dead zone. It doesn’t work. Er, correction: It’s not that it doesn’t work. You can still get decent predictions. But the predictions are no better, and ofttimes even worse, than other state-of-the-art supervised machine learning algorithms, such as logistic regression or decision trees, that have been combined with robust feature selection algorithms. In the dead zone, deep learning is just a buzzword:
In his response to this open letter, LeCun agrees. He writes (emphasis mine):
In the early 2000s, the ‘standard’ dataset for object recognition was Caltech-101, which had only 30 training samples per category (and 101 categories). Convolutional nets [deep learning] didn’t work very well compared with more conventional methods because the dataset was so small.”
What is interesting is that Ananda notes that the rough split between big and small data is a petabyte. “However, for industries that have Small Data sets (less than a petabyte), a Specialized Deep Learning approach based on unsupervised learning is necessary.
JT: Many of our readers are familiar with gigabytes and terabytes even. But take a moment and give us a frame of reference for petabytes please.
Dr. J: How large is a petabyte? The naming convention goes mega, giga, tera, peta. Which means that a petabyte is 1,000 terabytes, or 1 million gigabytes. At Catalyst, from the cases we’ve dealt with over the past three years, a rough estimate is that there are on average 3,810 documents in a gigabyte. Perhaps if your case contains more than 3,810 * 1,000,000 = 3.8 billion documents, then you’ve got a big data problem on your hand and might consider looking into deep learning. Otherwise, you are probably in the dead zone, and may very well be better off going with more conventional supervised machine learning. Again, it’s not that deep learning won’t give you predictions. It will. Maybe even pretty decent ones. But it’s not going to be auto-magically orders of magnitude better on your matter the way it has been for self-driving cars or image recognition.
JT: Well, I haven’t seen many cases with a billion documents let alone 3.8 billion. What are you learning from other scientists in the field?
Dr. J: As part of my job as chief scientist at Catalyst, I travel to and occasionally speak at various scientific conferences on information retrieval (search) and machine learning.
Last summer at the world’s top information retrieval (search engine) conference, SIGIR, I attended the Neu-IR workshop, which had as its topic the problem off applying neural networks to the (supervised) document ranking problem, i.e. the same sort of thing that we deal with in document review. This workshop is attended by some of the top research scientists in the world, from some of the biggest machine learning companies in the world (e.g. Facebook, Microsoft, Google, etc.)
In the report from the workshop, the organizers note the following from one of the keynote speakers, Hang Li:
In recent years, deep learning has become the key technology of state-of-the-art systems in areas of computer science, such as computer vision, speech processing, and natural language processing. A question that naturally arises is whether deep learning will also become important in information retrieval. In fact, there has been a large amount of effort made to address the question and significant progress has been achieved. Yet there is still doubt about whether it is the case.
In this talk, Li argued that, if we take a broad view on IR, then deep learning can indeed greatly boost IR. It has already been observed that deep learning can make great improvements on some hard problems in IR such as question answering from knowledge bases, image retrieval, etc. On the other hand, for some traditional IR tasks, in some sense easy tasks, such as document retrieval, the improvements might not be so notable. Li introduced some of the work on deep learning for IR conducted at Huawei’s Noah’s Ark Lab, to support his claim.
So yes, broadly, if we’re interested in playing Jeopardy (i.e. question answering), or in ranking images, then deep learning has something to offer. But prioritized document review in e-discovery is more like traditional IR, in that it concerns the ranking of text documents. And there is doubt as to whether (Deep) Neural Networks work well for this kind of task. This sentiment was mirrored in an open discussion among the participants of the workshop:
The final session of the day was a group discussion among all attendees. The goal of the discussion was to identify key challenges and opportunities in the area of neural IR. A popular topic during this session focused on the lack of positive results from deep neural network (DNN) models on the ad-hoc document retrieval tasks. One view from participants was that there is insufficient training data, and larger data will be required before DNN models can succeed on document ranking.
More recently, information retrieval research scientists Eugene Yang, David Grossman, Ophir Frieder, and Roman Yurchak published a paper at the International Conference on Artificial Intelligence and Law (ICAIL, June 2017) entitled Effectiveness Results for Popular e-Discovery Algorithms. That work tested a number of different e-discovery prediction and feature extraction algorithms, not all of which I will get into here. However, I do note that these authors tested a deep learning neural network against four other, more standard machine learning algorithms, and their results confirmed the experience of the Neu-IR workshop participants: Deep learning ranked either second to last or dead last on every single topic tested.
JT: What kind of folks attend these conferences?
Dr. J: What is interesting to me about this is that many of the participants at the Neu-IR workshop were not from “small data” industries like e-discovery. They were from large web companies. Of all the places that one would expect deep learning for document ranking to work, it would be at a large web search company, because they have so much more (supervised) training data than any e-discovery matter will ever have.
So if the web giants are struggling to get their deep learning algorithms to outperform existing conventional techniques with web scale data, it stands to reason that deep learning on the much smaller e-discovery-scale data would not necessarily fare much better.
JT: So, where do you come out on deep learning?
Dr. J: Is deep learning bad? No, of course not. If I’ve left you with the impression that it is, I have not done my job here. Instead, what I hope is that this discussion has encouraged our readers to think a little more critically about claims of not only deep learning, but any algorithm — and think about how one might go about evaluating such claims for oneself. I hope that I’ve encouraged you to think past the current hype and buzz, and understand that, like any tool, deep learning isn’t always or even necessarily better. Rather, that much depends on the nature of the task, the amount of data available, and the nature of that data. Not all tools are appropriate to all tasks. Just because deep learning is having a lot of success with image recognition and self-driving cars does not necessarily mean that it translates into every domain with equal disruptive potential.
JT: Any closing thoughts here, Dr. J.?
Dr. J: The astute reader should have a number of questions for me at this point. For example, one question might be, “You’ve mentioned a couple of different deep learning architectures, such as recurrent neural networks and convolutional neural networks. There are many more neural network architectures, some which are deep and some of which are not. Maybe deep learning for document ranking doesn’t work for many of these architectures, but how do you know that there isn’t an architecture for which it does work?” Another question might be, “Well, ok, maybe deep learning doesn’t yet help ranking for the web giants, but how do you know it won’t help my case?”
I have no desire to give a BS answer to either of these questions. I’m not interested in marketing-speak. So the answer to both questions is: I don’t know. There may indeed be some amazing, unique, innovative architecture that out-predicts, out-prioritizes all other supervised machine learning algorithm + feature extraction approaches on supervised, small data (i.e. prioritized document review). Or it may be that for your particular case, and just that case, some standard deep learning architecture works best. So I will never say never, even if all the evidence to this point in time points to deep learning for e-discovery as not much more than a buzzword riding on the coattails of other classes of problems.
So why have this discussion if I am only going to hint in certain directions, but not give a definitive answer? The reason is because I would rather you, the reader, not accept what I’m saying at face value, but also not accept other claims at face value, either. I would rather that you find out for yourself what works and what doesn’t work, and by how much.
How would you do that? Simulation. Since early 2013, just over four years now, in blog posts and white papers that I’ve written, in discussions with current and potential clients, in CLEs and conference keynote speeches that I’ve given, I have encouraged those working in the industry to run their own studies, using their own data, by getting the vendor(s) with whom they work to run simulations on their data.
A simulation involves collecting all the judgments that were made on a particular matter, plus the documents from that matter, and running them through a vendor tool to show the order in which those documents would have been reviewed, had that tool been used to run the review. A simulation will discover no new relevant documents. A simulation (well, this kind of simulation) will not overturn the judgments on any existing documents. Instead, a simulation simply shows the order, and therefore how much effort it would have taken to achieve various target recall points. Not on someone else’s data. Not on a case that doesn’t look like yours. Not on a hand-picked marketing claim of effectiveness. But on your own data.
By running the simulation one way on a deep learning system, and another way on another system (oh, let’s say Catalyst’s, for example), one can see whether one approach yields higher recall for less overall effort. Furthermore, all this can be done with zero additional human effort (cost), because one simply reuses the judgments that were already made on an existing document review.
Deep learning is a hot buzzword right now. The evidence I’ve seen in the research is that while it works well for certain classes of problems (image recognition, self-driving cars), it has not achieved state-of-the-art performance in other classes (text document ranking). This may change in the future, or it may not, given that true big data may never be available in an e-discovery setting. But rather than overly engage in conjecture, either negative or positive, the best approach to evaluate a buzzword is empirically. And for document review, that empirical evaluation can be done via simulation. Anyone serious about saving money in e-discovery should be actively preparing their data, and all the proper NDAs, for simulation. Instead of relying solely on press releases or blog posts, including but not limited to this one.
 This is an aside, but many e-discovery vendors spend a lot of time worrying about the “meaning” of a word, as if document search problems would go away if we could only distinguish Jaguar from jaguar. Seriously, how many e-discovery collections are there out there that have collected documents from both car manufacturers and zoo keepers at the same time. Or from fruit growers (apple) and computer companies (Apple) at the same time?
 In fact, it has often been said that part of the reason neural networks fell out of favor between the 1980s and today is because there simply was not enough data to train them. It is only since hitting web-scale collections that deep neural architectures have started to have some success.
 Note the problem domains that Li mentions here. Computer vision, speech, natural language processing. Problems for which there are either very low semantic content (images, audio) or for which there exist petabytes of data (e.g. web scale text pages) or both. In other words, these are the sorts of problems that deep learning works well on. These are large, general problems, with large, general answers, where data could be collected from nearly every person on the planet with a smart phone. These are all of not just a quantitatively, but a qualitatively different scale than e-discovery.