Avoiding Regulatory Reckoning: The #1 Problem in Surveillance with Translated Due Diligence Lexicons
Avoiding Regulatory Reckoning: The #1 Problem in Surveillance with Translated Due Diligence Lexicons

This translation trap sits squarely in the blind spot of most banks.
Somewhere between “approved in English” and “translated into 27 languages,” many banks’ surveillance infrastructures collapse. And yet, the institutions themselves don’t realize anything has failed—or how vulnerable it has left them.
At the center of this vulnerability sits multilingual lexicon translation, a function most banks assume is routine but which directly determines the integrity of their surveillance lexicons.
Virtually all compliance operations start with English lexicons that form the backbone of nearly every surveillance and due diligence program. The global rollout process varies little at most banks, and those English lexicons typically follow some version of this:
- Lexicons are drafted in English
- Above-the-line and below-the-line testing is conducted
- Review and sign-off by compliance, data governance, and legal follows (often after rounds of relentless argument)
- Finally, the approved English lexicons are sent out for translation into 10–30 languages
And it’s at that last step that years of institutional fastidiousness, procedural discipline, and stakeholder scrutiny disintegrate.
This article (and the next three) dissects the “how” and “why” of translated lexicons acting as a Trojan horse for surveillance breakdown in a bank’s monitoring apparatus.
Let’s dive in.
Language Service Providers (LSPs) Live in the Wrong Frame of Reference
If you’re reading this, you are probably in the orbit of contending with translated lexicons. Let’s start with a question: When you last had your lexicons translated, how in depth was the conversation with the translation vendor about (A) what system(s) your bank uses and (B) the technical parameters of those platforms?
If the answer is “no” and “it never came up,” then underperformance of the translated lexicons you’re running today isn’t a possibility; it’s a virtual certainty.
With lexicons, the conversation inevitably centers on language (finding the right term, capturing the right nuance, etc.), but that framing fundamentally misses what lexicons actually are: technical instruments intended to trigger responses within the complex indexes of surveillance systems.
In other words, these are not just translated word lists—they’re operational surveillance lexicons that must function precisely inside RegTech infrastructure.
Examples of points requiring discussion pre-translation:
- How does the platform tokenize Asian character languages?
- What are the particulars around diacritic handling in European languages as those characters index?
- What are the system's stop words, and what does escaping them look like?
If you’ve worked at more than one bank, you’ll have insight into how vastly different such systems function on those points. One institution might run on MS Purview, another on Smarsh + Global Relay, and another may have an entirely in-house-built system. Each has considerable differences in index structure and, by extension, corresponding differences in the critical technical notation that allows lexicons to properly engage the index. Such considerations fall well outside the translation industry’s prevailing frame of reference for lexicons.
We want to pull back the curtain on how “the sausage gets made” with lexicon translation to help organizations conceptualize the institutional risk. When you multiply 10,000 lexicons across 25 languages, a bank is actually engaging 50+ linguists (potentially as many as 100). At every language service provider (even at a top-tier provider like TransPerfect), those ~50 linguists will be vastly out of their depth on this critical technical dimension. Solving this bedeviling problem requires an approach to lexicon translation that is entirely unlike common industry practice.
This article focuses exclusively on the technical translation layer. Language-oriented challenges with lexicons will follow in our second instalment.
The Missing Parallel Technical Translation
Lexicon translation is much more than simply converting words from one language into another. It demands fluency in two languages: human language and platform language. And therein lies the problem: these duo lingo professionals don’t really exist. And you now need ~50 of them across 25 languages.
These complexities are often misunderstood (or overlooked entirely) in the translation marketplace. What that means in practice is that translation companies treat the whole thing as a single-track linguistic exercise, which it most certainly is not.
TransPerfect Vantage Point: Our background as both the world’s largest language service provider and one of the largest eDiscovery providers means we are acutely aware of the dual linguistic and technical dimensions involved. We built a dedicated team and custom workflows specialized in bridging this linguistic-technical gap
To underline this gap, puzzle over these two lexicons from recent translations
(Global Relay notation):
- "(before OR b4) ((<it is> OR it*s) (too OR to OR 2) late)"
- "keep (confidential OR <for yourself> OR hush OR private OR qt OR quiet OR secret OR silent OR <to yourself> OR <to urself> OR <under wraps> OR <under hat> OR <tight lipped> OR tight*lipped OR zipped )"~2
Even in English, they are rather impenetrable, aren’t they? There were 7,800 more just like these. Dense, technical, and challenging—even for experienced compliance experts who know the terrain. Would you notice a single unpaired bracket that breaks the whole string? This type of content is alien to linguists who may well have last been translating a cookbook or travel content for Expedia.
Cyrillic, Arabic, Greek, and similar scripts add an extra layer of visual impenetrability in that sea of brackets and parentheses, significantly reducing the chance that compliance teams will detect syntax issues. Regulators usually identify these translation voids only once a major surveillance failure occurs. By then, it’s already splashed across the WSJ or FT, and an eight-, nine-, or ten-figure fine is taking shape.
Let’s switch gears and look at something known as segmentation and its huge implications.
Text Segmentation: Alphabetic Boundaries vs. Logographic Absence of Boundaries—The Implications
Text segmentation is the process of dividing written text into units such as words, sentences, or paragraphs. In the regulatory technology (RegTech) world, indexes are typically word-oriented, yet “word” is a linguistically relative term. The definition of a word varies significantly depending on whether we're discussing alphabetic or logographic languages. This technical nuance is foundational to effective RegTech compliance, yet it rarely enters conversations about translation.
In alphabetic languages like English, French, or Russian, text segmentation is relatively straightforward. That’s because languages of this varietal rely on whitespace between words and clear punctuation to mark boundaries. Tokenization follows the principle of separating words based on spaces and punctuation marks. While challenges still exist, such as compound words in German or contractions in English, "word” boundaries are generally clear from an indexing perspective.
In contrast, logographic languages (such as Chinese and Japanese) don’t use spaces to separate words.
Example:
"TransPerfect" – in English and Japanese Katana script:
- ENGLISH: TransPerfect = The index counts 1 “word”
- JAPANESE: トランスパーフェクト = Most indexes count 10 “words”
Robert Wagner’s – in English and Japanese:
- ENGLISH: Robert Wagner’s = The index counts 3 “words”
- JAPANESE: ロバート・ワグナーの = The index counts 9 “words”
Sentence-Level Example:
Further deepening the challenge, each individual character carries meaning and functions as a word in its own right. Take the Japanese Kanji name 東芝, which is composed of two characters. On their own, they have distinct meanings: 東 means "east," and 芝 means "lawn." When combined, they form the proper noun Toshiba, the well-known electronics company. There are no spaces or markers to signal that these characters belong together or that their combined meaning differs from their individual parts.
When looking at the sentence level, it becomes obvious why this is critical:
JAPANESE: 東芝側が上告 不正会計巡る旧経営陣への賠償請求訴訟
1 whitespace // 23 “words”
ENGLISH: Toshiba appeals lawsuit seeking damages from former management over accounting fraud.
10 whitespaces // 11 “words”
As a result of these multi-layered complexities, RegTech systems working with Chinese or Japanese text are forced to index at the character level rather than at the whitespace-defined word level we’re familiar with. This drastically alters how search terms need to be structured. Everything from wildcards and brackets to proximity rules gets instantly turned on its head. It’s impossible to overstate the significance this distinction carries, and yet it is virtually universally overlooked in the translation world.
For a bank, lexicon translations that ignore this higher-order technical layer are inevitably incomplete or overinclusive (often both at the same time), which is a genuine compliance nightmare. Failing to use technically aware foreign-language lexicons undermines the entire surveillance framework. When segmentation logic is ignored, RegTech compliance is compromised at the indexing layer itself. The difference between hitting and missing the target lies in whether segmentation rules are understood and reflected in the translations.
The anatomy of this problem stems from the fact that, while compliance is the ultimate gatekeeper, it is equally a data governance and surveillance technology issue. True ownership sits with the cross-functional teams that design, maintain, and audit monitoring frameworks. Since no one team holds the dual fluency, it creates the perfect conditions for translation-driven breakdown.
Text Tokenization & Proximity Distance
Here’s how the segmentation point plays out in practice: Nearly all banks utilize proximity operators in their lexicons. Proximity distances in translated terms should rarely match English and are a massive red flag when they do.
Moving between European languages, word counts grow or shrink purely for linguistic reasons. Spanish illustrates this well, typically using approximately 20% more words to express the same content than English. In English-to-European language pairs, the proximity distance differences usually range from W/-25% to W/+25%, depending on the language, which needs to be reflected in the translations.
Example:
- English lexicon: “TransPerfect” NEAR/50 “translation”
- Spanish translation: “TransPerfect” NEAR/65 “traducción”
Now let’s look at Japanese versus English, which provides the most extreme contrast between languages. With Asian languages, proximity distances grow purely for technical reasons: every character is counted as a “word.” This means the effective proximity distance for Chinese should be approximately two to three times that of English (a 2:1 or 3:1 ratio) to yield comparable semantic coverage. For Japanese, the ratio is even greater, falling somewhere between 5:1 and 7:1 (or even higher in certain circumstances when Katakana is involved).
Example:
- English lexicon: “TransPerfect” NEAR/50 “translation”
- Japanese translation: “トランスパーフェクト” NEAR/300 “翻訳”
Quantifying the impact of applying English proximity rules to Japanese lexicons: Post-incident, one multinational bank discovered that its Japanese lexicons were running an 80% “undercatch” rate for flagged phrases versus the English-language baseline.
Reality Check & Path Forward
We only covered a small sliver of the overall picture, but hopefully this crash course on parallel technical translation was interesting—and clarified how much this dimension matters.
When technical translation is not given the attention it’s due, it creates a severe problem banks can’t simply audit their way out of retroactively. By the time it’s discovered that Mandarin or Portuguese lexicons are underperforming, multiple years of unmonitored communications and regulatory risk have already zipped right past compliance teams.
The uncomfortable reality is that traditional language service providers are fundamentally ill-equipped to handle the technical translation dimension. Unless your translation partner explicitly understands the tokenization logic of your specific surveillance instance and is fluent in its syntax, it’s impossible to deliver valid, accurate lexicon translations.
In the next article, we’ll move from the technical syntax wrapper to the lexicons themselves.
Curious about your bank’s exposure? Start with a quick proximity distance sanity check across your lexicons. If proximity distances look the same in Tokyo as they do New York, the bank has a problem.
Unsure whether your multilingual lexicons are functioning correctly? TransPerfect experts conduct platform-specific lexicon audits that test technical compatibility against your surveillance system. We assess current performance, identify gaps, and provide actionable recommendations. Contact our team to discuss your specific needs.