How Good is That Keyword Search? Maybe Not As Good As You Think

Despite advances in machine learning over the past half-decade, many lawyers still use keyword search as their primary tool to find relevant documents. Most e-discovery protocols are built around reaching agreement on keywords but few require testing to see whether the keywords are missing large numbers of relevant documents. Rather, many seem to believe that if they frame the keywords broadly enough they will find most of the relevant documents, even if the team is forced to review a lot of irrelevant ones.

Precision vs. Recall: It Can Be Deceiving

Using terms borrowed from the science of information retrieval, keyword advocates believe they are achieving high recall (the percentage of relevant documents found) while hoping that precision (the number of relevant versus total documents) also stays high. Most know there is a tradeoff between recall and precision—the better the recall, the lower the precision. What they may not realize is that the opposite is often true—better precision often means worse recall. As a result, when the keywords seem to bring back a lot of relevant documents (i.e. the search is precise), they become convinced they have found most of the relevant documents in the bargain (i.e. they also have high recall).

Is that true? Most search experts would say no. If your search goal is to find a nearby Italian restaurant and your search brings back several, that feels like a good result. However, if your goal is to find everything responsive to your search, you might not have such a good result if there are many local Italian restaurants that your search did not return. Precision doesn’t often come with high recall.

Consider this Case Study

Our client collected over 1.4 million documents and loaded them into Insight. Using keyword terms agreed upon by both sides, we culled the population down to around 250,000 documents. These documents comprised the initial review set. The thought, at least from those who created the keyword set, was that most of the relevant documents would be found here.

Our team was concerned (rightly so it turned out) about leaving so many documents out of the review population. To be sure, there were a good number of relevant documents to be found after running search terms. However, what did the keyword searches leave out? What percentage of relevant documents (recall) did the keyword searches deliver?

To find out, we pulled a random sample from this “discard pile.” That sample turned out to be 36% responsive, which translates into over 420,000 responsive documents out of the nearly 1.2 million documents that were not culled into the original review set. This means that the number of responsive documents left behind would have been nearly double the size of the entire keyword-culled review population.

Put another way, even if we assumed that all of the keyword hits were responsive (which was not the case), the keyword searches found only about 38% of the responsive documents. That doesn’t seem adequate by anyone’s standards.

In light of this alarming information, the client decided to add most of the discard pile back into the review set. They will obviously need to review many more documents than they had originally planned, but the review will be defensible as well as efficient using Predict.

The Problem with Keywords

This problem is similar to one that we wrote about several years ago: An Open Look at Keyword Search vs. Predictive Analytics. And earlier we reported on a client team who seemed upset that our predictive ranking tool suggested they needed to review more documents than came back from their keyword searches: My Key Word Searches are Better than Your Predictive Ranking Technology!.

The problem with keywords can be illustrated with this four quadrant chart:

In this case, the client had used keyword searches to narrow a relatively small document population. Using keywords, they found just over 11,000 documents to review. Of those, it appeared that 7,267 were likely responsive. About 4,000 were likely not responsive. Thus, search precision was about 66%, which is pretty good. That means you could expect that more than six of 10 documents reviewed would be responsive.

However, when we ran Insight Predict, we found another 11,000 documents that seemed likely to be responsive. Assuming they were, that would suggest that keywords found only 39 percent of the responsive documents. Once again, that isn’t a good result. It certainly surprised the client.

How Can I Improve Keyword Searches?

There is nothing wrong with using keywords or metadata features (e.g. doc date, custodian, filetype) to cull a data set before review. But if you are going to pare down a data set, you must do so in a principled manner that at the very least includes sampling the data you are proposing to leave behind. Below are few tips to help you get the most of your keyword searches:

1. Cover all relevant topics.

Far too often we see search term lists that cover only a few key areas, leaving several relevant topics underrepresented or completely neglected. Examine the complaint, responsiveness criteria, or any other document you have that specifies the relevant subject matter. Make an outline of the various topics and plan out searches to cover all the necessary areas. Without complete coverage, you are assured to have an impoverished keyword set that will lead to an incomplete result set.

2. Start broad and fine-tune as you go.

It’s best to start with broad terms that over-capture and work on the precision later if needed. Always keep in mind that recall is much more important than precision for keyword culling, so err on the side of over-inclusion. When you’re working on a new topic, a good way to begin is to think, “A document isn’t likely to be relevant if it doesn’t contain one of these terms.” That list is your starter set. Those extremely precise terms that everyone wants to include are great for identifying hot documents, but they’re usually far too specific for culling.

3. Include word variants (stem search).

Some review platforms have stem search capabilities, but they don’t all work the same and they aren’t always transparent about which variants they include in searches. It’s a good idea to think of all possible relevant variants and string them together with ORs. If you want to include variants of taxes, you could compose a search such as (“tax” OR “taxes” OR “taxing” OR “taxation” OR “taxable”).

Be careful with wildcards; overinclusion is better than underinclusion, but if your database contains lots of references to taxi cabs, taxiing airplanes, or taxidermy, you might not want to use tax* to capture all its variants without some testing (see Step 5).

4. Include synonyms for important terms.

An online thesaurus is a great resource for synonyms. Simply typing “synonyms for X” into a Google search can give you fairly comprehensive results. For industry-specific terms, you might have to do a bit more digging or ask a subject matter expert. Even if it takes a little time to compile a list of insider jargon for a term, the extra work is worth it since the closer a person is to a particular subject matter, the less likely they are to refer to it by its official name.

5. Test, revise, and re-test your search terms.

There’s both an art and a science to developing a comprehensive search term list, and as with any scientific process, some testing is required. Start off by running your searches individually or in small topically related groups. It’s fine to run all your searches together to get an idea of the total hit count, but this is not a good tactic for the testing and revision process. If the results indicate that you need to revise, only change one element at a time when you’re re-testing. Otherwise it can be difficult to tell which of the changes you made had an impact on the results.

When analyzing the results, take a random sample if there are too many to look through. A random sample of 50 documents can tell you a lot more than looking through the first few-hundred results. Review tools normally display results ordered by Bates number or Doc Date, which tend to group similar documents together and will skew the results if you haphazardly look through documents instead of taking a random sample.

6. Test both the culled-in and culled-out document sets.

Pull random samples from both the keyword hits (the proposed review set) and the documents not hit by the keywords (the discard pile). You will not be able to adequately measure the effectiveness of your keyword culling unless you compare the relevance rates of both sets. If you only test the discard pile and get a relevance rate of 3%, you might be tricked into a false sense of security if it turns out that the relevance rate of the review set is only slightly higher at 5%. Similarly, if you only test the review set and get a high relevance rate, that tells you nothing about the number of relevant documents you left behind in the discard pile. As we mentioned earlier in the case study, our client would have excluded an estimated 420,000 relevant documents had they not done the right thing, which was to thoroughly test the discard pile before permanently setting it aside.

The Bottom Line

Whether you are using keyword searching alone or in some combination with machine learning, your goal should be to get it right. No matter how carefully you craft your search terms, keyword searching is imprecise. The only way to be sure your searches are sufficiently comprehensive is by testing your results.


About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.


About Andrew Bye

Andrew is the director of machine learning and analytics at Catalyst, and a search and information retrieval expert. Throughout his career, Andrew has developed search practices for e-discovery, and has worked closely with clients to implement effective workflows from data delivery through statistical validation. Before joining Catalyst, Andrew was a data scientist at Recommind. He has also worked as an independent data consultant, advising legal professionals on workflow and search needs. Andrew has a bachelor’s degree in linguistics from the University of California, Berkeley and a master’s in linguistics from the University of California, Los Angeles.