Search Term Translation for eDiscovery: Bidirectional Text and Code-Switching—Where Translation Meets Chaos

Search Term Translation for eDiscovery: Bidirectional Text and Code-Switching—Where Translation Meets Chaos

Translation
December 22, 2025

This article is the fourth in a series of “Search Term Translation for eDiscovery” blogs by Robert Wagner.

Understanding Code-Switching and Bidirectional Text

Decoding how two script directions collide in keyword searches

eDiscovery presents no shortage of complex and time-consuming challenges. This post covers the hardest of them all: working with Hebrew and Arabic search terms. Simply put, it's a nightmare. The combination of code-switched bidirectional text paired with hidden control characters turns seemingly simple keyword searches into a minefield of frustration and errors. I won’t use the word dread, but Hebrew and Arabic terms are right at the very top of my list of technical and personal nemeses. Let’s dive into why. 

When I talk about code-switching and bidirectional text, those mean very particular things. In linguistics, code-switching refers to alternating between two or more languages within a single sentence. Bidirectional text refers to strings that contain both left-to-right (LTR) and right-to-left (RTL) elements within the same sequence. 

Taken together, what I’m zeroing in on here is the natural tension between keywords pulling RTL and the search operators pulling LTR. Put simply, they don’t sit naturally together, and all sorts of behind-the-scenes tech is attempting to fake harmony in this unhappy marriage of scripts. 

There are a few things going on. For one, as linguists shift back and forth between Arabic keywords and English search operators, they often insert hundreds or thousands of invisible Unicode control characters (invisible LTR and RTL markers) to maintain the correct positioning. But those same characters can scramble searches, break tokenization, or render text in all sorts of unpredictable ways across platforms (e.g., MS office, text editors, and Chrome may all interpret these characters differently). What looks visually correct in Excel may copy-paste in a completely different semantic order—and that difference won’t be obvious to an English speaker.  

EXAMPLE: Two Arabic words, which is the correct order?  ترانسبيرفكت دبي ///     دبي ترانسبيرفكت).

Control characters is where the chaos begins, but certainly not where it ends. 

The BiDi Algorithm

Exposing how Unicode's logic engine misinterprets your search intent

Another seriously complicating dimension with Arabic and Hebrew search is the Unicode Bidirectional (BiDi) text algorithm. This algorithm, which is integrated into Microsoft Office and many other software applications, classifies the direction of bidirectional text and specifies its visual layout on screen.

Text classification is as follows:

  • Strong characters: with inherent directionality (e.g., English [LTR], Arabic [RTL]).
  • Weak characters: digits, punctuation like * and brackets -- directionality depends on context.
  • Neutral characters: whitespace, line breaks, etc.

The designated weak characters are one of the sources of the menace. The BiDi algorithm then applies cues from surrounding words, with these weak characters ultimately being reordered based on nearby strong characters. Because the BiDi algorithm makes assumptions about what the user intends, it often misinterprets structure—especially when numbers and search operators like wildcards are involved. 

The issue is further amplified in search contexts because strong Arabic characters (keywords) and strong English characters (operators) are often close in length. This parity makes it even harder for the BiDi algorithm to form accurate assumptions.

Example: (*زوج OR *قرين OR *مرافق OR *عائل)—7 weak and 21 strong characters. Any guesses how many hidden control characters are lurking? 

Quick tip: Paste strings into Notepad++ with “Show All Characters” enabled to see the hidden LRM/RLM/ALM characters.

The Specific Failures

Identifying the four ways BiDi corrupts your search logic

So how does the BiDi algorithm specifically break Hebrew and Arabic terms?

  • Wildcards shift to the front (right end) of Arabic and Hebrew words
  • Parts of multi-word Arabic and Hebrew terms leapfrog the OR/AND search operators and join other keywords (good luck spotting that)
  • Parentheses can flip direction, becoming )like this( instead of (this)
  • Numbers visually reverse order 

We’ve completed hundreds of Arabic and Hebrew term translations over the years, and these issues have been present to some degree in every single one. For added fun, an English speaker armed with a left-to-right keyboard and good intentions will invariably make things worse. They won’t see the invisible control characters and will end up deleting or moving them around. Once one is missing or shifted, everything else in the string can break. At that point, the poor hero is left staring helplessly at the ruins of the string, invariably in great frustration. In search, these two languages are pure joy.  

The Technical Reality Beneath the Surface

Uncovering why on-screen rendering diverges from what copy-pastes

Summarizing all the things conspiring to break right-to-left terms:

  1. Text is stored in logical order inside files—that is, characters are saved based on the order they were typed, not how they visually appear.   
  2. Invisible control characters, such as U+200E (L/R Mark) and U+200F (R/L Mark), are sometimes embedded by linguists, influencing how text is displayed (not how it copy-pastes).
  3. The BiDi algorithm interprets the logical text and applies layout rules to render it visually. This rendering is purely for display and can differ from the logical order stored in the file.

THE KEY TAKEAWAY: On-screen renderings are not necessarily what copy-pastes. You might see parentheses hugging a Hebrew phrase correctly on screen in MS word, but under the hood, the characters may be stored in a completely different sequence, held in check only by invisible RTL/LTR override characters. Conversely, a file might store text correctly, but without the right directional control characters, Office displays it as backwards and jumbled—or worse, makes it look correct until a native speaker copy-pastes it and the illusion breaks. 

With bidirectional strings, you’re trusting a silent, invisible choreography of directional rules and rendering engines to get it right. And in many instances, it does not.

If all of that sounds like you're living in The Matrix, well, with Arabic and Hebrew search terms, you are living in a matrix of BiDi display bugs.  

Working with Names and Transliteration

Recognizing why multiple spellings mask the same identity

I called it a nightmare earlier and stand by the characterization. As someone who specializes in non-English search, I can say this with confidence: with these two languages, it doesn't get easier the more times you encounter them. Every project brings an invisible slew of code-switch-related problems, and the worst part is that fixing one issue often breaks another elsewhere in the string. 

But forewarned is forearmed. At least now you know what to watch for, even if that knowledge doesn't make fixing the problems any less painful. It’s also worth noting that if opposing counsel is contending with Hebrew or Arabic data, they are virtually assured of making significant errors unless a native speaker is managing the process. 

I’ll cover translation deficiency analysis reports in a future post. Let’s pivot and talk about names and entities. 

PROPER NAMES: REVERSE TRANSLITERATION (NOT TRANSLATION). 

Working with Arabic names as search terms in eDiscovery isn’t really a translation task; it’s transliteration. More specifically, back-transliterating English characters into Arabic is essentially an exercise in reconstructing identity. When you see a name written in English, you’re often looking at one of many possible spellings that all trace back to a single Arabic original. 

Example: “Mohamed,” “Muhammad,” “Mohamad,” “Mehmet,” and 20 others all transliterate back to محمد. 

The tricky part isn’t back-transliterating English to Arabic; it’s realizing how many different ways that same identity can surface in a dataset, and how many instances you might be missing. This brings us to a subtler point here related to review teams working with English data. While many reviewers can easily associate “Mohamed” with “Muhammad,” the connection to “Mehmet” may be far less obvious.

The challenge deepens with less common Arabic names like “Isma’il,” “Ismaeel,” and “Esmaeel,” where transliteration differences can obscure the fact that all three refer to the same person. This is a high-risk area for search that requires careful attention on the English side, particularly for Western review teams relying primarily on machine translation. In many cases, it may not be obvious that the same individual is being referred across documents.

Construct Forms: The Overlooked Grammatical Challenge

Understanding how grammatical possession creates search blind spots

Let’s talk about one final, not-so-obvious nuance of Hebrew and Arabic that trips up many translated term lists: construct forms. Specifically, they are often overlooked and omitted by linguists.

CONSTRUCT FORMS: Background.

Construct forms are a bit of an oddity, with no direct parallel in the English language. They are how possession and noun relationships are expressed. In English, possession is typically marked using the word of or apostrophes (’s). 

ENGLISH EXAMPLES: 

  • the color of the car
  • the car’s color
  • Peter’s bonus

Even though the ending changes with the apostrophe s, the apostrophe is treated as whitespace by search indexes, so “car” will capture both forms. With construct forms, however, the actual ending of the noun itself changes.

This distinction has major implications for how nouns need to be translated for search terms. As I’ve mentioned before, most translators—even those highly skilled in language—don’t typically think in terms of search indexes or an eDiscovery mindset. Their focus is linguistic accuracy, not the mechanical behavior of how language elements are tokenized, indexed, and retrieved in a database.

This is where many linguists unfamiliar with eDiscovery stumble: they don’t adjust for how Arabic and Hebrew grammatical dependencies interact with the indexing logic. In traditional translation work, these relationships are understood conceptually; in eDiscovery term translation, they must be handled operationally through additional search operators and restructured terms.

Superficially, translating “stock option*” into Hebrew or Arabic seems straightforward. The base translation is the obvious starting point, but effective multilingual searching requires more. You need to include both the root term and construct forms to capture possessive Hebrew and Arabic text such as “John’s stock options vest in 2028.”

Failing to make these adjustments means searches won’t capture all relevant variations, creating blind spots in review and missing key documents. For Hebrew and Arabic datasets, this isn’t a minor oversight—it’s a systemic risk to the matter.

Conclusion

Accepting why native speakers are non-negotiable for accuracy

…nigeb I sa gnidnE (just kidding—but that’s the reality of what can transpire).
Ending as I begin (because it’s therapeutic), I want to reiterate how much these two language drive my team nuts due to all the technical complexities noted above. The bottom line is this: unless Arabic and Hebrew terms are imported into a database and corrected by a native speaker, there's an extremely high likelihood that results will be at least partially (and potentially wholly) inaccurate. 

But even getting to the point where terms are imported into a database is itself a fraught, painful process, littered with visible and invisible challenges that trip up the majority of people navigating the BiDi path.  

Next article: The strategic dimension—and hidden leverage—multilingual keyword strategy confers. Most teams treat translation as a checkbox. The sharp ones use it as a weapon. I’ll cover how actively managing this point can massively amplify wider case strategy—or, by equal measure, the damage it can inflict when mishandled.

Transform your multilingual eDiscovery strategy. Let TransPerfect's experts guide your multilingual keyword research, from initial translation through database import and validation. Learn about TransPerfect's multilingual eDiscovery solutions.
 

Blog Info
By Robert Wagner, Global Director of Multilingual eDiscovery, TransPerfect Legal