Catalyst’s Report from TREC 2016: ‘We Don’t Need No Stinkin Training’

blog_data_500One of the bigger, and still enduring, debates among Technology Assisted Review experts revolves around the method and amount of training you need to get optimal[1] results from your TAR algorithm. Over the years, experts prescribed a variety of approaches including:

  1. Random Only: Have a subject matter expert (SME), typically a senior lawyer, review and judge several thousand randomly selected documents.
  2. Active Learning: Have the SME review several thousand marginally relevant documents chosen by the computer to assist in the training .
  3. Mixed TAR 1.0 Approach: Have the SME review and judge a mix of randomly selected documents, some found through keyword search and others selected by the algorithm to help it find the boundary between relevant and non-relevant documents.

Indeed, in the early TAR 1.0 days, the hot issue was whether you could use two SMEs for training without confusing the system. The idea was to reduce the burden on one senior lawyer, who otherwise would have to spend days reviewing thousands of marginally relevant documents before review could begin. See: Subject Matter Experts: What Role Should They Play in TAR 2.0 Training? (November 2013)

Readers of the Catalyst blog will know that from the beginning we challenged the notion that you needed a senior lawyer to do the training as well as the idea that training should be done using randomly selected or marginally relevant documents.[2] Specifically, we shared our research showing that with an advanced continuous active learning protocol (CAL), SMEs didn’t materially improve training.

Instead, we advocated a simpler approach. Find a handful of relevant documents using whatever method you like and get going. The CAL protocol and a few good reviewers would take care of the rest.

TREC 2016

This year, our team again participated in the Text Retrieval Conference (TREC) program, sponsored by the National Institute of Standards and Technology (NIST). TREC brings together academics and software developers (along with our friend Ralph Losey) to try different algorithms and approaches against a standard set of documents. Although some tout it as a competition, it is not, nor should it be viewed as such. Rather, the real goal of the program is to advance research in this important area of artificial intelligence.

We and several other e-discovery types participated in the Total Recall track, which was administered by Maura Grossman and Gordon Cormack and a team of volunteers. The track involved a collection of about 290,000 emails obtained from Jeb Bush’s years as governor of Florida. Based on these documents, the TREC coordinators developed 34 topics, providing a title and description to the participants for each.

Here is an example:

401: Summer Olympics: All documents concerning a bid to host the Summer Olympic Games in Florida.

Each team was tasked with finding as many relevant documents for each topic as possible using the least amount of effort.[3] One of the requirements was to record every document viewed—whether relevant or not. You did so by submitting its document ID to the TREC system for a judgment as to its official relevance. This process allowed the TREC administrators to determine how many documents each team had to view to reach a given level of recall.

Catalyst’s Experiment

This year, we wanted to compare two different approaches to initial training (aka seeding) and see what effect it might have on our algorithm. With 34 topics to run, you can imagine that none of us wanted to spend much time on the training process. Instead, we decided to limit our efforts to a few minutes per topic and see how well our algorithm would do with limited input.

The team consisted of four people from Catalyst:

  1. Tom Gricks: Licensed attorney, former senior partner and one of the first to use TAR in a legal case.
  2. Bayu Hardi: Project consultant at Catalyst, with over 10 years of experience managing e-discovery cases/matters.
  3. Jeremy Pickens: Our senior research scientist who created the algorithm we are using.
  4. John Tredennick: Trial lawyer and former partner at Holland & Hart who founded Catalyst before there was an “e” in front of the word discovery.

We used two modes for training: 1) one-shot (single query) and 2) interactive (multiple queries).

For the one-shot mode, our task was to generate a single query based on our reading of only the topic title and description. The first 25 documents retrieved (relevant or non-relevant) were submitted as training seeds. We were not allowed to review the documents for relevance or modify our initial query after checking the results. We just searched and sent in the first 25 results regardless of their content.

For the iterative mode, we were free to create several queries and check the results to see whether the searches were effective—with the requirement that we submit to TREC every document viewed in the process. In general, we had about 15 minutes to read the topic, think about it, formulate queries, review documents and formulate additional queries, and find about 25 seeds. Again, we ended up submitting everything we looked at in the process (relevant, non-relevant, totally off topic) as seeds for initial training.

Each of us worked independently on our assignments, recording the searches we ran and time spent on the topic. (For the record, John Tredennick did most of his work on the back porch enjoying a late summer day.) We divided the 34 topics such that two people took each topic either in a one shot or iterative mode. Thus, each of us did 17 topics as one-shot queries and 17 in iterative mode.

You can read our lengthy, more detailed report here: An Exploration of Total Recall with Multiple Manual Seedings. Figure 6 in the report shows the amount of time spent by each team member in the iterative mode (the one-shot mode only took a couple minutes) and the number of queries run.

John Tredennick was Reviewer 2. Being lazy by nature, he rarely spent more than 10 or 11 minutes on an iterative topic. Tom Gricks was Reviewer 4. Probably because hunting season had not started in Pennsylvania, he spent as much as 30 minutes or so on his iterative topics.

What Did We Learn?

Would the type of training we did, or who did the training, make a difference in our results? The answer was no—neither factor made a difference. To the contrary, the results across the board were similar in almost all cases, and all but identical in many case, regardless of who did the training and how it was done (one-shot or interactive).

You can judge this for yourself. For each of our 34 topics, we plotted what the industry calls a “gain” curve showing the rate at which we would have found relevant documents for each of our reviewers. A gain curve is a simple tool to show have fast the algorithm produced relevant documents and to compare the efforts of different reviewers or different runs of the algorithm.

As an example, here is the gain curve for Topic 401 (described above):

Topic 401: Results from four reviewers using either one-shot or iterative training plotted against full document population.

Topic 401: Results from four reviewers using either one-shot or iterative training plotted against full document population.

The X axis shows the number of documents reviewed, expressed as a percentage of the total documents available for review. In this case, there were 290,000 documents in the collection. The Y axis shows the percentage of relevant documents found during the course of the simulated review.

The diagonal line in the middle represents what might be expected in a linear review. It shows that if the reviewer finished 20 percent of the documents for review, she would likely have seen 20 percent of the relevant documents in the collection. At 50 percent, he would have seen 50 percent of the relevant documents. And so on.

The colored lines show the progress for each reviewer based on the initial seeds submitted. Reviewer 1 is red, Reviewer 2 is blue, Reviewer 3 is green and Reviewer 4 is brown.

A solid line means that the reviewer did a one-shot search for this topic. A dashed or dotted line means the reviewer did iterative searching.

For this topic, all of the review lines rocketed upward to about 85 percent recall immediately. In essence, a review using our algorithm and a handful of starter seeds would have found 85 percent of the relevant documents after reviewing only a small fraction of the population. From there, the lines diverged a bit with Reviewer 3 (green) having slightly better results than, say, Reviewer 1 (red). In this case, however, the review would likely have stopped long before the individual lines diverged. And, the difference in numbers of documents viewed were minimal.

Zoomed In View

Using the same data from Topic 401 we decided to zoom into the top of the ranked list to make it easier to see the distinctions between the individual reviewers as they reviewed and submitted their first few documents. In this case, the X axis goes from zero to about 0.28 percent of the total population, meaning up to the first 800 documents (out of 290,000) in the ranked list.

Topic 401: Zoomed in results from four reviewers using either one-shot or iterative training plotted against full document population. Shows approximately the first 800 documents reviewed (out of 290,000)

Topic 401: Zoomed in results from four reviewers using either one-shot or iterative training plotted
against full document population. Shows approximately the first 800 documents reviewed (out of 290,000)

The chart shows that each team member would have reached the 80 percent relevant mark after reviewing only 800 (out of 290,000) documents.

That number is far less than just the training required in a TAR 1.0 process (thousands of documents) even if you don’t include the 500 random documents required to create the initial control set.

How About the Other Topics?

We plotted our results for all 34 of the TREC topics, which you can see in our report.

Here is how the first four topics came out:

Four Topics Plotted

You can see the natural variations resulting from different topics but in each case, the reviewers found the bulk of the relevant documents (well over 80% in almost every case and sometimes higher than 90%) after reviewing only a fraction of the total population.

Conclusions

We offer the following summary conclusions based on our TREC experiment:

  1. Not much initial training is required for a CAL system to be effective.

There are a lot of different views out there about how one should train a TAR system. Our experience and this TREC experiment suggests that you don’t need to invest a lot of time on initial training. Give the system as many good examples as you can reasonably find (and one may be enough) and get going with your review.

  1. Who does the training doesn’t seem to matter, at least for a continuous active learning process.

Our team consisted of people from several different walks of life, with academics, seasoned trial lawyers and a non-lawyer legal professional all reaching the same level of results. While our training efforts were constrained, it didn’t seem to matter who was finding the initial seeds.

Here was a conclusion from our report:

That said, one general observation is that, no matter who the reviewer doing the initial seeding was, high recall is achieved by almost every reviewer on almost every topic with relatively no review of the entire collection of documents. There remains a belief among many industry practitioners engaged in high recall tasks that only experts may select seed documents, that only experts have the capability to initiate a recall-oriented by selecting the initial training documents.

  1. TAR finds relevant documents quickly and efficiently.

This was only one round of 34 experiments against one set of documents. Nonetheless, you can’t help but realize that TAR, and particularly TAR built around a continuous learning process works. The savings over a linear review or even a carefully crafted keyword search protocol make it a no-brainer whenever you need to find relevant documents.

Gordon Cormack once analogized a CAL system to a bloodhound. You waive a piece of clothing in front of the dog’s nose and he is off to the races. It seems to be the same with TAR 2.0.

Part Two

We plan to write more about our thoughts on the TREC experiment in a follow-on post. Look for it shortly.

Training? We don’t need much stinkin training.

[1] The word “optimal” in this context means finding the desired amount of relevant documents with the least amount of effort.

[2] See, e.g., Why are People Talking About CAL? Because it Lets Lawyers Get Back to Practicing Law; Your TAR Temperature is 98.6 — That’s A Pretty Hot Result; Thinking Through the Implications of CAL: Who Does the Training?; A TAR is Born: Continuous Active Learning Brings Increased Savings While Solving Real-World Review Problems; How Much Can I Save with CAL? A Closer Look at the Grossman/Cormack Research Results; Continuous Active Learning for Technology Assisted Review (How it Works and Why it Matters for E-Discovery); Pioneering Cormack/Grossman Study Validates Continuous Learning, Judgmental Seeds and Review Team Training for Technology Assisted Review; The Five Myths of Technology Assisted Review, Revisited; Is Random the Best Road for Your CAR? Or is there a Better Route to Your Destination?; Are Subject Matter Experts Really Required for TAR Training? (A Follow-Up on TAR 2.0 Experts vs. Review Teams); Subject Matter Experts: What Role Should They Play in TAR 2.0 Training?; and TAR 2.0: Continuous Ranking – Is One Bite at the Apple Really Enough?

[3] “Effort” really means total effort. It includes everything that you do not only for review, but for training as well. Every document viewed means every document viewed at any stage of the process, not just during review. The training process is not “free” but is part of the total cost of review.

mm

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.

mm

About Thomas Gricks

Managing Director, Professional Services, Catalyst. A prominent e-discovery lawyer and one of the nation's leading authorities on the use of TAR in litigation, Tom advises corporations and law firms on best practices for applying Catalyst's TAR technology, Insight Predict, to reduce the time and cost of discovery. He has more than 25 years’ experience as a trial lawyer and in-house counsel, most recently with the law firm Schnader Harrison Segal & Lewis, where he was a partner and chair of the e-Discovery Practice Group.

mm

About Jeremy Pickens

Jeremy Pickens is one of the world’s leading information retrieval scientists and a pioneer in the field of collaborative exploratory search, a form of information seeking in which a group of people who share a common information need actively collaborate to achieve it. Dr. Pickens has seven patents and patents pending in the field of search and information retrieval. As Chief Scientist at Catalyst, Dr. Pickens has spearheaded the development of Insight Predict. His ongoing research and development focuses on methods for continuous learning, and the variety of real world technology assisted review workflows that are only possible with this approach. Dr. Pickens earned his doctoral degree at the University of Massachusetts, Amherst, Center for Intelligent Information Retrieval. He conducted his post-doctoral work at King’s College, London. Before joining Catalyst, he spent five years as a research scientist at FX Palo Alto Lab, Inc. In addition to his Catalyst responsibilities, he continues to organize research workshops and speak at scientific conferences around the world.