Using TAR Across Borders: Myths & Facts

As the world gets smaller, legal and regulatory compliance matters increasingly encompass documents in multiple languages. Many legal teams involved in cross-border matters, however, still hesitate to use technology assisted review (TAR), questioning its effectiveness and ability to handle non-English document collections.  They perceive TAR as a process that involves “understanding” documents. If the documents are in a language the system does not understand, then TAR cannot be effective, they reason.

The fact is that, done properly, TAR can be just as effective for non-English as it is for English documents. This is true even for the complex Asian languages including Chinese, Japanese and Korean (CJK). Although these languages do not use standard English-language delimiters such as spaces and punctuation, they are nonetheless candidates for the successful use of TAR.

Of course, computers don’t actually “understand” anything (so far, at least). They are “dumb as a box of rocks,” as someone once told me. Rather, TAR programs simply catalog the words in documents and apply mathematical algorithms to identify relationships among them. To be more precise, we call what TAR algorithms recognize “tokens,” because often the fragments are not even words, but numbers, acronyms, misspellings or even gibberish.

The question, then, is whether computers can recognize and analyze tokens (words or otherwise) when they appear in non-English languages. The simple answer is yes. To understand why TAR can work with non-English documents, you need to know two basic points:

  1. TAR doesn’t understand English or any other language. It uses an algorithm to associate tokens (words or otherwise) with relevant or non-relevant documents.
  2. To use the process for non-English documents, particularly those in CJK languages, the system has to first tokenize the document text so it can identify individual words. (We recently wrote about how to avoid common pitfalls associated with the challenges of CJK character tokenization.)

We will hit these topics in order, and then provide a case study.

1. TAR Doesn’t Understand English

It is beyond the province of this posting to provide a detailed explanation of how TAR works, but a basic explanation will suffice for our purposes. Let us start with this: TAR doesn’t understand English or the actual meaning of documents. Rather, it simply analyzes words algorithmically according to their frequency in relevant documents compared to their frequency in other documents.

Think of it this way: We train the system by marking documents as relevant or non-relevant. When I mark a document relevant, the computer algorithm analyzes the words in that document and ranks them based on frequency, proximity or some other such basis. When I mark a document non-relevant, the algorithm does the same, this time giving the words a negative score. As the review progresses, the computer sums up the analysis from the individual training documents and uses that information in its continued ranking.

While each algorithm may work differently, think of TAR system as creating huge searches using the words found in the documents . It might use 10,000 positive terms (found in relevant documents), with each ranked for importance. It might similarly use 10,000 negative terms (found in non-relevant documents), with each ranked in a similar way. The search results would come up in an ordered fashion sorted by importance, with the most likely relevant ones coming first. None of this requires that the computer know English or the meaning of the documents or even the words in them. All the computer needs to know is which words are contained in which documents.

2. If Documents Are Properly Tokenized, the TAR Process Will Work

Tokenization may be an unfamiliar term to you but it is not difficult to understand. When a computer processes documents for search, it pulls out all of the words and places them in a combined index. When you run a search, the computer doesn’t go through all of your documents one by one. Rather, it goes to an ordered index of terms to find out which documents contain which terms.

That’s why search works so quickly. Even Google works this way, using huge indexes of words. As we mentioned, however, the computer doesn’t understand words or even that a word is a word. Rather, for English documents it identifies a word as a series of characters separated by spaces or punctuation marks. Thus, it recognizes the words in this sentence because each has a space (or a comma) before and after it. Because not every group of characters is necessarily an actual “word,” information retrieval scientists call these character groupings “tokens,” and the act of identifying these tokens for the index as “tokenization.”

All of these are tokens:

  1. Bank
  2. door
  3. 12345
  4. barnyard
  5. mixxpelling

They would likely all be kept in a token index for fast search and retrieval.

Certain languages, such as the CJK languages, don’t delineate words with spaces or western punctuation. Rather, their characters run together, often with no breaks at all. It is up to the reader to tokenize the individual words or phrases in order to understand their meaning.

Many early English-language search systems couldn’t tokenize Asian text, resulting in search results that often were less than desirable. More advanced search systems have special tokenization engines that are designed to index Asian languages and many that don’t follow the Western conventions (e.g. Arabic). They provide more accurate search results than their less-advanced counterparts.

Similarly, the first TAR systems were focused on English-language documents and could not process Asian text. A few more advanced systems (like Catalyst Insight) had a text tokenizer built in to make sure that different languages were handled properly. As a result, these systems can analyze Chinese and Japanese documents just as if they were in English. Word frequency counts are just as effective for these documents and the resulting rankings are as effective as well.

A Case Study to Prove the Point: Using TAR Continuous Active Learning to Cut Costs by Over 85% for Japanese Patent Litigation

In a recent project, our client was a multinational Japanese company facing a large document production in an international patent dispute. The initial review collection exceeded 2 million documents. After a series of rolling uploads, which continued throughout the review, the population slated for review grew to 3.6 million. Facing millions in review costs, the client sought an alternative to linear review.

The client’s goal was to finish the review in four weeks with a small team handling the project. The documents were primarily in Japanese, with some English in the mix, and many involved highly technical subject matter.

Tokenizing Japanese Documents

TAR 2.0 systems such as Catalyst’s Insight Predict employ special software to tokenize Japanese and similar languages. They are able to analyze the text and break out actual words and word phrases, not just arbitrary groups of characters. Once the Japanese documents were properly tokenized, the TAR 2.0 process could index and analyze them more effectively.

Estimating Richness and Training

Even though the client had taken steps to remove junk and other documents not subject to production, the collection’s estimated richness was still miniscule. An initial systematic random sample of 1,000 documents (97% confidence with a 3.5% margin of error) suggested that there were fewer than six relevant documents in every thousand that might be presented through a linear review. As is often the case in litigation, richness (or prevalence) was low at 0.6%.

Before Catalyst was engaged, a team of lawyers had reviewed about 10,000 documents found through keyword search. For some TAR 1.0 engines, which require randomly-selected training seeds, these judgments would have been of no use. Because Predict is a TAR 2.0 engine that uses continuous learning and continuous ranking, we could make use of these judgments to start the ranking.

As you can see from the yield curve below, the initial training using the 10,000 seeds indicated that almost all of the relevant documents could be found after reviewing just 17% of the total review population. This meant that the review team could immediately exclude most of the non-relevant documents and start finding relevant documents many times faster than using other methods. There was no need for the team to spend nonproductive hours looking at largely non-relevant files selected randomly for initial training.Optimizing Review with Continuous Active Learning

The initial training worked. Richness in the documents presented to the review team jumped from 0.6% to as much as 35%, which represented a 60-fold improvement in review efficiency. At the same time, the reviewers received a mix of documents selected for contextual diversity. This proprietary feature, which we developed for Predict, allows the algorithm to keep finding and training against documents which are different from those already found through keyword search or seen by the reviewers in their initial rounds, resulting in a far more accurate review of the data.

The review continued while the collection team added more documents. Since Predict can continually rank all the documents in the collection, there is no problem adding new documents during the review. As they are added, the documents are ranked and mixed into the total collection.

To the extent they are similar to already ranked documents, they join the ranking in their proper place. To the extent they are different than what has already been collected, they become candidates for contextual diversity and can be included in the review sets for hands-on evaluation by the reviewers.1

Rolling Collections and Continuous Learning

Through rolling collections over the course of several weeks, the Predict population grew to 3.6 million unique, rankable documents. As the review team found new types of responsive documents and learned more about the case, they could also use a number of Insight search and analytics tools available to explore the document population. Every decision they made was continuously fed back into Predict to improve its ranking. When the review team ran out of relevant documents, they stopped the review and conducted a further systematic random sample of the entire population. Here is what they learned:As you can see from the resulting yield curve, Predict was still pushing relevant documents to the top of the review pile, even after multiple rolling collections were added while the review was in progress.

Ultimately, the total review effort was about 500,000 documents, out of 3.6 million collected for review. Predict allowed the review team to achieve the requisite recall after reviewing only a small fraction of the population, which met the client’s needs for both speed and efficiency.

Results

This case presented a number of challenges. The collection was mostly in Japanese and contained a number of highly technical documents. The richness of the collection was low and it contained a lot of junk. The client was on a tight timeline for review but collections kept arriving on a rolling basis.

Despite these challenges, we were able to make use of the 10,000 documents the legal team had already reviewed to jumpstart the ranking process and accelerate the review. Even with the collection’s low richness, the team was able to find highly relevant documents many times faster than with any other approach. And because Predict never stopped learning from newly-reviewed documents, it continued to improve and help attorneys explore the collection even as new documents were constantly being added.

In the end, using Predict, the client was able to cut the time and cost of its review by over 85% and meet its production deadline.

Conclusion

Today, TAR is successfully applied in many contexts, and it increasingly knows no geographic boundaries. With the proper technology and expertise, TAR can be used with any language, even challenging Asian languages such as CJK.

Whether for English or non-English documents, the benefits of TAR are the same. By using computer algorithms to rank documents by relevance, lawyers can review the most important documents first, review far fewer documents overall, and ultimately cut both the cost and time of review. In the end, that is something their clients will understand, no matter what language they speak.

_________________________________________________________

1. TAR 1.0 systems typically train against a reference set, which makes handling rolling collections difficult. To be representative, the reference set must be chosen randomly from the entire population and then carefully tagged by a subject matter expert. If new documents are collected during the TAR 1.0 process, you have two options, with neither being ideal. Either you hope/assume that the new documents are similar to those already collected such that you don’t need to address them. Or you start training again from scratch, discarding the initial reference set and its related training for a new round.

mm

About John Tredennick

A nationally known trial lawyer and longtime litigation partner at Holland & Hart, John founded Catalyst in 2000. Over the past four decades he has written or edited eight books and countless articles on legal technology topics, including two American Bar Association best sellers on using computers in litigation technology, a book (supplemented annually) on deposition techniques and several other widely-read books on legal analytics and technology. He served as Chair of the ABA’s Law Practice Section and edited its flagship magazine for six years. John’s legal and technology acumen has earned him numerous awards including being named by the American Lawyer as one of the top six “E-Discovery Trailblazers,” named to the FastCase 50 as a legal visionary and named him one of the “Top 100 Global Technology Leaders” by London Citytech magazine. He has also been named the Ernst & Young Entrepreneur of the Year for Technology in the Rocky Mountain Region, and Top Technology Entrepreneur by the Colorado Software and Internet Association. John regularly speaks on legal technology to audiences across the globe. In his spare time, you will find him competing on the national equestrian show jumping circuit or playing drums and singing in a classic rock jam band.

mm

About David Sannar

A veteran e-discovery executive with extensive experience in Asia and the Pacific, Dave Sannar is responsible for all Catalyst operations and business growth throughout Japan and Asia, including our operations and data center in Tokyo. Dave has been immersed in the e-discovery industry since 2004, when he became president and COO of AccessData Corp., the second-largest computer forensics software company in the world. Dave spearheaded a restructuring of AccessData that grew its workforce by 200 percent and its sales revenues from $4.2 million to over $10 million in just two years.