How to Avoid Asian Language Pitfalls in Discovery

A surge in cross-border litigation and enforcement of antitrust and Foreign Corrupt Practices Act violations is subjecting many Asian-based companies to U.S. discovery obligations. While e-discovery is “business as usual” in the U.S., discovery involving companies in Asia is still relatively new—and rife with potential pitfalls.

When parties involved in cross-border litigation or investigations are faced with multi-language documents subject to discovery, including the challenging Chinese, Japanese and Korean (CJK) languages, they must understand how to accurately process and index CJK documents for proper search, review and analysis. Many Western search and review systems were not designed to capture the nuances of CJK language complexities. As a result, they offer sub-optimal search results, sometimes finding too many documents and sometimes missing important ones. An understanding of CJK differences can help you select the right technology and experts.

In this article, I will address language identification, character encoding and tokenization. These are the three critical steps necessary for accurate processing, search and review of Asian language documents.  Selecting e-discovery software that is optimized for Asian language reviews, like Catalyst’s Insight platform, will lead to time and cost savings.

1. Language Identification

When dealing with Asian language documents, language identification is the most important first step. Without proper language identification, you will end up with gibberish (think Wingdings font). Catalyst’s Insight discovery platform, for example, recognizes over 270 different languages for accurate language identification.

Western European languages are written using the Latin (or Roman) alphabet. It is the most widely used alphabet and writing system in the world today. As such, most e-discovery software has been designed to handle the Latin alphabet properly.

Unlike Latin based languages, each Asian language has a unique writing system. While there is some overlap in characters that are used in the three CJK languages, they each have significantly different syntax structures. This can make a difference for both search and analysis.

Japanese, for example, has three written languages: hiragana, katakana and kanji. The first two are phonetic; each character represents a syllable. The third, Kanji, is a logographic system that uses characters to represent a word or phrase, and is common in written Chinese. Written Chinese uses kanji for both logograms representing a word or a phrase, and phonetic pronunciations as when writing a foreign language word such as a person’s name. Korean uses a phonetic alphabet of 40 base characters that are combined for most all writing, but kanji was also used in many older documents.

2. Character Encoding

As noted above, CJK content is composed of a sequence of characters using Hangul (Korean phonetic alphabet), hiragana and katakana (Japanese phonetic alphabets) and Kanji (logographic writing used by Chinese, Japanese and to a lesser extent, Korean). To understand how a computer creates the many thousands of different characters that exist between these three writing systems, you need to understand the difference between single byte and double byte characters.

A byte is made up of 8 bits; each bit is either 1 or 0. There are 256 different possible combinations of bits in a byte. For a computer, single byte character systems (i.e. ASCII) are limited to 256 different characters. This works well for English and other Western European languages that are based upon the Latin alphabet. However, for Asian writing systems with thousands of characters, there are not enough options.

A solution is to combine bytes to create “double byte” characters. Double byte characters have 16 bits which provide for 65,536 possible combinations, enough to satisfy CJK writing systems.

As double byte character sets were developed to address the CJK character problem, diverse “encoding” systems were developed to handle each language. These various character encoding systems were created by different companies for each language resulting in competing encoding sets such as Guobiao (GB) and Big5 for Chinese. Likewise, competing encoding systems were developed for Japanese and Korean.

When a document is created, the computer program selects an encoding set to create written characters. As you can imagine, the differing double byte encoding systems were not always compatible, resulting in non-legible gibberish if you were to use the wrong computer system to interpret shared documents. Over time, conversion tables were established so the dissimilar encoded data sets could be shared between systems, but the complexity of encoding systems and languages made for messy data sharing.

Eventually Unicode was adopted as a standard encoding set for all languages. However, the combination of legacy data sets and current computer programs that allow the user to choose the language encoding means that we still see new data created in non-Unicode encoding formats.

3. Tokenization

Tokenization means parsing characters into meaningful units, such as words or numbers.  Computers parse data into logical “tokens” for indexing and then searching data. Accurate indexing results in accurate searching. Some systems, like Insight, are optimized for Asian language reviews, with intelligent tokenization capabilities.

Most text extraction programs use the spaces between words and standard punctuation characters (like a period or semicolon) as the breakpoints for “tokenizing” data for indexing and search.  As such, tokenization occurs naturally for many languages because their writing systems place spaces between words or punctuation at the beginning or end of words.

In CJK languages, words are not separated by spaces or Western punctuation characters, creating a challenge for computer systems to intelligently recognize the right words within the context of the written language. When the system cannot parse the data into intelligent tokens (words), the result is inaccurate indexing leading to inaccurate searching.

Here’s an example in English, where punctuation and spaces can be used to delineate separation between tokens/words.

Input: “Friends, Romans, Countrymen, lend me your ears.”



Challenges of tokenization are language-specific.

Here is a sentence in Japanese, which combines three and sometimes four different writing systems:


(It was reported the luggage will arrive in Tokyo tomorrow at around 3:00.)

Without either spacing or an understanding of the language, we don’t know where words start and stop.

E-discovery software with intelligent tokenization can accurately determine the logical parsing of the Japanese sentence:

荷物|は|東京都|に|到着|予定|が|明日|の| 3 |時|ぐらい|と|報告|された|の|です|。

Accurate tokenization thus results in correct indexing for accurate searching:


In our experience, using e-discovery software that does not properly tokenize CJK characters to find responsive documents not only can miss key documents but also result in up to 50 percent false-positive identification, which leads to excessive review costs.

Putting It All Together

Accurate search and review of Asian language documents requires proper language identification and character encoding recognition followed by intelligent tokenization. We have seen a lot of garbled gibberish in the many years we’ve handled Asian data—but when we apply the right technology, gibberish can be translated into clear language.

In Sum: Select the Right Technology and Team

When you know that a cross-border case will involve CJK and mixed language documents, it’s important to select the technology and provider with expertise and on-the-ground support for Asian language e-discovery.


About David Sannar

A veteran e-discovery executive with extensive experience in Asia and the Pacific, Dave Sannar is responsible for all Catalyst operations and business growth throughout Japan and Asia, including our operations and data center in Tokyo. Dave has been immersed in the e-discovery industry since 2004, when he became president and COO of AccessData Corp., the second-largest computer forensics software company in the world. Dave spearheaded a restructuring of AccessData that grew its workforce by 200 percent and its sales revenues from $4.2 million to over $10 million in just two years.