Search Term Translation for eDiscovery: The Science of Text Tokenization Meets the Art of Language

Search Term Translation for eDiscovery: The Science of Text Tokenization Meets the Art of Language

Search Term Translation for eDiscovery
July 7, 2025

The Implications of Text Segmentation & Tokenization: Logographic vs. Syllabic vs. Alphabetic Writing Systems

Logographic and alphabetic languages represent two distinct global writing systems. Logographic languages, such as Chinese and Japanese Kanji, use symbols (logographs) to represent words, with each character conveying meaning rather than sound. For example, the Chinese character 木 represents "tree" regardless of pronunciation. Logographic systems contain vast character sets, often requiring thousands of symbols for literacy.

In contrast, alphabetic languages like English and Spanish use letters to represent sounds, which combine to form words. For instance, the English word "tree" is assembled from the letters T, R, E, E, each contributing to the word's pronunciation. Alphabetic systems have smaller symbol sets (often fewer than 30), making them easier to learn but reliant on phonetic rules. Both systems effectively convey language but differ in complexity.

Syllabic writing systems like Japanese Katakana represent a middle ground between logographic and alphabetic systems. In syllabic systems, characters correspond to a whole syllable rather than a single sound. These characters convey more information than individual letters in an alphabetic system (like English), yet they’re less complex than full words represented by characters in a logographic system (like Chinese).

The critical point: Many of the rules and conventions that apply to search in alphabetic languages don’t hold in logographic and syllabic languages. In fact, strategies that work for a language like English often require a complete rethink—setting aside long-held assumptions entirely.

Let’s dive into why, starting with text segmentation.

Text Segmentation: Alphabetic’s Boundaries vs. Logographic’s Absent Boundaries. The Implications.

Text segmentation is the process of dividing written text into units such as words, sentences, or paragraphs. In the eDiscovery world, indexes are typically "word"-oriented, but the definition of a word varies significantly depending on whether we're talking about an alphabetic or logographic language.

In alphabetic languages like English, French, or Russian, text segmentation is relatively straightforward. That’s because these languages rely on whitespaces between words and clear punctuation to mark boundaries, so tokenization follows the principle of separating words based on spaces and punctuation marks. While challenges still exist—such as compound words in German or contractions in English—"word” boundaries are pretty clear from an indexing perspective.

In contrast, logographic languages like Chinese and Japanese don’t use spaces to separate words. Further deepening the challenge, each individual character often carries meaning and functions as a word in its own right.

Example:

Take this Japanese Kanji. The name 東芝 is composed of two characters— meaning "east" and meaning "lawn." On their own, they have distinct meanings, but when combined, they form the proper noun Toshiba, the well-known electronics company. There are no spaces or markers to signal that these characters belong together or that their combined meaning differs from their individual parts.

When looking at the sentence level:

JAPANESE: 東芝側が上告 不正会計巡る旧経営陣への賠償請求訴訟

1 whitespace // 23 “words”

ENGLISH: Toshiba appeals lawsuit seeking damages from former management over accounting fraud.

10 whitespaces // 11 “words”

The presence or absence of whitespace shapes not only how we read and write but also how machines interpret language. As a result of these multi-layered complexities, eDiscovery systems working with Chinese or Japanese text are forced to index at the character level, rather than whitespace-defined tokens that we’re familiar with in English. This drastically alters how search terms need to be structured. Everything from wildcards and brackets to proximity rules gets instantly turned on its head. It’s impossible to overstate the significance this distinction carries.

Without language-aware and carefully customized Chinese and Japanese search terms, search results will inevitably be incomplete and over-inclusive—usually both at the same time. Failing to use language-aware terms strips away all pretence of an accurate and defensible eDiscovery process. Awareness of segmentation rules and reflecting them in the term translations is what distinguishes translations that nail the target from those that miss.

To illustrate, let’s explore how German compound words and Chinese generally behave in Relativity.

Wildcards and German Compound Words

Let’s start in the West in an English-adjacent language system: German. One of the hallmarks of the German language is the frequent use of ad hoc compound words formed by combining two or more smaller nouns into a single, cohesive term. This allows for the creation of highly specific and descriptive words on the fly.

A classic example fuses krank ("sick") and haus ("house") to form krankenhaus (hospital).

The key tokenization detail here is the lack of whitespace between the component nouns. In Relativity (and most search systems), everything between spaces is treated as a single token or "word," meaning that relevant search terms might well appear within a larger compound—potentially buried mid-word. This poses a unique challenge for search and eDiscovery, as traditional EN keyword structure will miss important hits unless segmentation or compound-splitting logic is applied.

For Example:

ENGLISH: water*

LINGUIST-LED GERMAN

MISSES: löschwassereinspeisung (firefighting water supply) // trinkwasser (drinking water) // brachwasser (non-potable water) // Süßwasser (fresh water) // etc.: wasser*

TRANSLATION REQUIRED: *wasser*

While German is best known for compounding, it’s hardly alone in high-frequency compound word usage. Dutch and the Nordic languages are also big fans of compounding—with all the same implications for search and text analysis.

When it comes to wildcards, Chinese and Japanese completely flip the script. Let’s look at how.

Wildcards and Chinese/Japanese Text

Because Chinese and Japanese text is tokenized and indexed at the character level, wildcards are wholly unnecessary. Wait—what?!

In contrast to English, where tokenization relies on whitespace for delineating words, Chinese and Japanese text is tokenized at the character level, with each individual character indexed as a discrete searchable token. Character-level indexing eliminates the need for wildcard syntax because the root matching typically achieved via wildcarding in English is inherently built into the tokenization process for these languages.

Example:

Searching for a single Chinese character (money-related record keeping) will match words such as 户 (account) or 记 (accounting). It will also match 记员 (bookkeeper), as the smaller term is included in the larger text unit. No wildcards are required for all of those results.

The Yin and Yang

This subtle distinction can be hugely beneficial when negotiating with opposing counsel or trawling for information from their evidence pool. Microscopic tweaks to translations can significantly expand a review pool in very advantageous ways.

But wherever a yin dwells, the yang lurks not far away, offering a similarly stark flipside. Character-level tokenization can generate hundreds of thousands of false hits in the review pool. This "lesser includes the greater" principle is a critical strategic consideration—particularly, for instance, when negotiating English terms with the DOJ. Agreeing on a term like "account" could result in a Chinese translation that inadvertently pulls in vast volumes of unrelated hits. There are many offensive and defensive search techniques, which we’ll be covering in a future post.

This raises another question: Are wildcards harmful with Chinese or Japanese terms?

In a word: no. On a technical level, they’re completely neutral.

If wildcards are, technically speaking, “neutral,” why does it matter if they’re present on a Japanese term? Because they represent a critical signal about the knowledge of the translator: wildcards are a tell-tale signal that those involved with the translation process don’t understand the implications of tokenization, and point to an almost certain occurrence of over- and under-translation.

Over-translation occurs when additional meaning is introduced in the translation that’s not present in the source term. Conversely, under-translation happens when key information or relevant nuance is omitted. Both issues distort intent and compromise the accuracy of search results in the form of missing relevant documents and false hits—often with both at play simultaneously, and in significant volume.

Under-translation (consultant vs. linguist-led) example:

ENGLISH: account* (i.e., financial domain. account (n); account(v); accountant (n); etc.)

LINGUIST-LED CHINESE TRANSLATION: 账*

MISSES: 会计 (accountant, because no shared base character)

At the most basic level, wildcards on Japanese terms indicate that the translator was unaware of the parallel technical translation that needed to take place alongside the linguistic translation—with all the serious repercussions for your case that comes with the territory. The presence of a wildcard is a simple but powerful signal: even if you don’t understand a single word of either language, the mere appearance of wildcards should prompt scrutiny of the linguistic work. But this also highlights a deeper truth: this kind of technical detail renders legal colleagues in Tokyo and Beijing no better equipped to assist than linguists. Tokenization isn’t a term most lawyers encounter regularly—let alone something they’re trained or expected to navigate.

Conclusion: Why Search Term Translation for eDiscovery Requires Precision

In the complex world of non-English eDiscovery search, text segmentation and tokenization reveal a critical yet often overlooked reality: effective search requires linguistic precision customized to the writing system, paired with the technical translation and notation necessitated by database systems.

Alphabetic languages thrive on clear boundaries, while logographic systems like Chinese and Japanese require character-level finesse to avoid pitfalls like over-translation or vast pools of missed documents. The errors, signalled by wildcards, sharply erode accuracy and defensibility.

Tailored language-specific translation strategy is critical for ensuring precise and comprehensive search results.

Need help navigating non-English eDiscovery? Speak to an expert about how tailored translation strategies can improve search accuracy and defensibility.

Blog Info
By Robert Wagner, Global Director of Multilingual eDiscovery, TransPerfect Legal