Search Term Translation for eDiscovery: Proximity Distance is the Leading Unseen Failure Point

Search Term Translation for eDiscovery: Proximity Distance is the Leading Unseen Failure Point

eDiscovery
September 2, 2025

This article is the third in a series of “Search Term Translation for eDiscovery” blogs by Robert Wagner.

Today we discuss the humble proximity search operator (W/###) and its role as a crucial tool for boosting search result relevance by defining how close two terms must appear in text. While widely used in the legal industry, it remains one of the least understood search features in multilingual data scenarios. And that’s because - unknown to most users - the operator behaves with dramatic variation between languages, yielding highly unintended outcomes.

Short Background

Our earlier article delved into text tokenization, which is the process of splitting text into units, commonly called "words." However, “words” in the technical sense is a murky and complex world. In essence, accurate proximity searches hinge on two key factors:

  1. Language - specific rules that must be considered, and
  2. How the system counts these “word” units.

Both factors demand a deep understanding of the particular language(s) in play alongside the indexing system’s technical particulars. This article explores both of those points, along with the critical challenges they present in multilingual contexts. If you’ve ever used a proximity operator in English data, prepare to have your mind blown by the non - English realms.

Tokenization & Proximities Distance

Let’s dive in by looking at Japanese versus English, which provides the most extreme contrast between languages. In Relativity, proximity searches in both English and Japanese work by counting the number of “words” between the search terms. But what is a “word”? In English, it’s the text unit between white spaces. But Japanese doesn’t use white spaces; instead the system counts each character as a “word.”

"TransPerfect"- in English and Japanese:

  • ENGLISH: TransPerfect = The Index counts 1 “word”
  • JAPANESE: トランスパーフェクト = The index counts this as 10 “words”

Robert Wagner’s – in English and Japanese:

  • ENGLISH: Robert Wagner’s = The index counts 3 “words”
  • JAPANESE: ロバート・ワグナーの = The index counts 9 “words”

It’s hard to overstate the significance of this difference in proximity searching. As mentioned in a previous article, proximity distance in translated terms should rarely match English - and is a red flag when it does. In English to European languages, the proximity distance differences usually range from W/-25% to W/+25%, depending on the language.

Proximity distance in European languages grows or shrinks purely for linguistic reasons - for instance, Spanish often requires 20% more words than English to say the same thing, so the proximity needs a 20% uplift.

Example:

  • English term: “TransPerfect” W/50 “translation”
  • Spanish translation: “TransPerfect” W/65 “traducción”

 

With Asian languages proximity, distances grow purely for technical reasons - every character is counted as a “word.” This means effective proximity distance for Chinese should be approximately two to three times that of English (a 2:1 or 3:1 ratio) to yield comparable semantic coverage. For Japanese, the ratio is even greater, falling somewhere between 5:1 and 7:1 (or even higher in certain circumstances).

Example:

  • English: “TransPerfect” W/50 “translation”
  • Japanese: “トランスパーフェクト” W/300 “翻訳”

 

Just imagine the impact of applying English proximity rules to privilege terms translated into Japanese. That scenario delivers an effective proximity window that’s 80% tighter, which ironically translates into a flood of missed privileged documents in the review pool.

The real challenge lies in incorporating varying proximity distances into query logic. The rest of this article focuses specifically on that issue. It's important to note that common assumptions about how terms behave fall apart here. The complexity of search term translation in this area poses unique difficulties that are seldom addressed in typical workflows used by case teams. Let’s dive in and put the challenge on display.

Squaring a Circle: Re-Engineering Term Logic

On the one hand, best practices require term integration so they’re bilingual - aware to match multilingual documents, ensuring comprehensive retrieval across languages. On the other hand, structural differences between languages complicate proximity thresholds. This raises a fundamental question: How does one reconcile the need for integrated bilingual terms with the differing proximity requirements each language imposes? Let’s look at a practical example translating a simple a three - “word” English term into Chinese.

Basic Monolingual to Bilingual Example

Separate monolingual terms:

  • English source term: ("Shipment") W/50 (“suspend” OR “stop”)
  • Chinese translation with uplift: ("发货") W/150 (“暂停” OR “停止”)

Integrated bilingual term:

  • ("Shipment" OR "发货") W/??? (“suspend” OR “stop” OR “暂停” OR “停止”)

Integrating terms and making them bilingual is easy, but ultimately inadequate when a proximity connector is involved. The integrated term now forces a choice between proximity settings: W/50 or W/150 - neither of which can simultaneously serve both languages effectively. What works for the Chinese terms will cause the English to over - return. What suits the English will cause the Chinese to underperform. So how does this circle get squared? Making a string bilingual isn’t just linguistic - it’s technical.

The 'Technical Translation': Term Restructuring to Make it Bilingual

Proper translation necessitates a complete reorganization of the term. Rather than applying a single proximity rule across both languages, a much more sophisticated, language - sensitive approach is required. The needed “technical translation” applies customized proximity windows that reflect the linguistic characteristics of each language involved. This involves splitting the term into language - specific variants that dynamically address the different requirements.

English source term:

("Shipment") W/50 (“suspend” OR “stop”)

Bilingual consultant - led translation:

(("发货") W/150 (suspend OR “stop” OR 暂停 OR 停止 )) OR (("Shipment") W/150 (暂停 OR 停止)) OR (("Shipment") W/50 (“suspend” OR “stop”))

On a logic level, this particular configuration allows:

  • The Chinese elements to interact with themselves at W/150.
  • The Chinese elements to interact with English elements at W/150.
  • The English elements to interact with themselves at W/50.

The complexity increases exponentially in large litigations where custodians span ten or more countries. Imagine what’s required for a London-Zurich-Milan-Tokyo spread of custodians - where Japanese, German, French, Italian, and English may all be in play. Each language introduces its own linguistic structure, tokenization behavior, and proximity sensitivity, rendering a one - size-fits-all approach deeply flawed and untenable.
There’s a related, if tangential, headache to contend with bilingual proximitized terms: growth in overall character length.

Headache - Inducing Technical Limitation Imposed by Relativity

Relativity’s Search Term Reports (STR) imposes a strict character limit of 450 characters per search string. Imagine this scenario: You start with 30 term strings, each comfortably under the limit at around 400 characters. But once those terms are translated - accounting for verb tenses, gender, plurality, construct forms, and all the other complexities of multilingual expansion - each string balloons to 1,200 – 2,000 characters, which Relativity promptly rejects.

Each string then demands manual dissection, breaking it down into smaller segments while still maintaining the original search logic. What started as 30 strings quickly explodes into 250+ painstakingly carved fragments to stay under the 450 - character constraint while maintaining logic. The effort required here can consume days - it’s slow, manual, and brutally detail-oriented.

The difficulty multiplies in non-European writing systems, where unfamiliar character systems heighten the challenge. And if the scenario involves Arabic or Hebrew, it’s honestly less effort to resign and take a new position at a different law firm. (We’re only half joking - will cover why in a future article.)

Why Linguist-Led Term Translations Fall Short

This re-engineering requirement is one of the most pronounced examples of the inherent complexity in term translation. It underscores how even seemingly simple operators like proximity can quickly complicate the technical translation running in parallel with the linguistic translation.

Put simply, linguists lack the technical expertise required in this arena. This extends to legal colleagues in European and Asian offices often co-opted into translating search terms for a matter. Their background is law - not the intricacies of language morphology, tokenization, and index proximity logic. In fact, even most litigation support teams find themselves on uncertain ground where foreign language intersects with these deep notational complexities.

In Conclusion

You’ve probably noticed we’ve been putting quotes around the word “word” throughout this article. That’s because what we legal mortals typically conceptualize of as a "word" is quite different from how search indexes define it, especially across different languages. And that difference carries serious technical consequence with real implications for what evidence see the light of day.

Without a strong grasp of this technical nuance, translated terms miss evidence - with critical implications for your case and outcomes for your client.

In the fourth instalment of this series, we’re heading to the Middle East to explore right-to-left languages like Hebrew and Arabic, which have done more to shorten my life than all the Second Requests and Rocket Docket litigation I’ve worked on over my career. Get ready to revisit my linguistic PTSD, unpacking the unique issues that arise with these languages.

Multilingual proximity logic isn’t just translation - it’s technical reengineering. If your team is facing multilingual data, proximity term headaches, or global custodian complexity, we can help. Speak to an expert to ensure your search terms deliver the results your case demands.

Blog Info
By Robert Wagner, Global Director of Multilingual eDiscovery