June 2, 2020

Better localization by understanding the code called “language”

What’s the purpose of language? You may think that “to communicate facts and ideas from one person to another” would be the slam-dunk answer. But that’s only partially correct.

In addition to communication, the other purpose of language is secrecy. And this segment from the TV sitcom Big Bang Theory is a perfect example.

Of course the guys are speaking Klingon, which is a “language” virtually unknown outside of Star Trekdom, and the ladies are using Ubbi Dubbi which is a language game based on English. But please consider this: Any language is made up of random sounds or symbols that make sense only if the sender and the receiver share the same set of rules. So in this sense, language is code.

The code we call “language” is what enables data to be transferred from person to person or handed down from generation to generation. It also keeps those who don’t know the code from understanding the data.

Language is code and that’s nothing new

Throughout Europe can be found cave art dating back some 30,000 years to the Ice Age. These depictions of bison, deer and other wildlife are breathtaking, but they do not pique the interest of paleoarchaeologist Genevieve von Petzinger nearly as much as the random symbols scribbled or carved near those paintings. In fact, her meticulous research showed that early humans repeatedly used 32 symbols to say… something.

What exactly was meant by those symbols including Circle, Half-Circle, Claviform, Crosshatch, Spiral, Triangle, Hand, Zigzag and others, is still being studied. It’s up to scientists of the future to break the code developed tens of thousands of years ago. But according to Dr. Petzinger, those symbols may provide an opportunity to trace the origins of humanity’s earliest writing systems.

Perhaps these symbols were inscribed by hunters cataloging their activities, or maybe even a ruling class keeping tabs on their subjects. Either way, their purpose was probably for a certain group to share information amongst themselves and hide information from others. Someday we may know the answer, but meanwhile, it would be interesting to see prehistoric man’s reaction to seeing Crosshatches (#, a.k.a. hashtags) everywhere today.

For every language there’s a language game

Penny and Amy were having a secret conversation using Ubbi Dubbi. But perhaps the most famous language game for English speakers is Pig Latin. Both use different coding and decoding rules, namely Ubbi Dubbi works by adding “ub” before each vowel sound in a syllable, while in Pig Latin, for words that begin with consonant sounds (including consonant clusters), all letters before the initial vowel are placed at the end of the word sequence and “ay” is added. So, “Welcome to CRESTEC” becomes “Wubelcubome tubo CRUBESTUBEC” in Ubbi Dubbi, and “Elcomeway otay ESTECCRAY” in Pig Latin.

Remember this one? “Ooklay in the Agbay.”

But what’s the situation in other languages? The Spanish-speaking world has Jerigonza, which adds the letter p after each vowel of a word and repeats the vowel, so “Bienvenido a CRESTEC” becomes “Bipiepenvepenipidopo apa CREPESTEPEC.” Interestingly, this is similar to a Japanese language game but more about that later.

German speaking countries have Löffelsprache, which doubles each vowel or diphthong and inserts “lew,” so that “Willkommen bei CRESTEC” transforms into “Wilewillkolewommelewen beilewei CRELEWESTELEWEC.”

The Swedes have Rövarspråket, where consonants are doubled up and “o” is inserted, so “Välkommen till CRESTEC” becomes “Vovälolkokomommomenon totilollol COCRORESOSTOTECOC.” Meanwhile, across the bron/broen (the bridge), the Danes have Røversprog that operates on the same rules, changing “Velkommen til CRESTEC” into “Vovelolkokomommomenon totilol COCRORESOSTOTECOC.” Quite a mouth full.

Even languages as non-European as Japanese have similar language games. In Japan, バビ語 (Babigo) works by inserting the characters ば/バ (ba), び/ビ (bi), ぶ/ブ (bu), べ/ベ (be), or ぼ/ボ (bo) in the appropriate places. Since the rules become overcomplicated if you’re not familiar with Japanese kana characters, suffice it to say Babigo is similar to Jerigonza except instead of doubling the vowels and inserting “p,” you insert “b.” So ようこそクレステックへ (Youkoso Kuresutekku e) becomes よック(Yoboubukobosobo Kuburebesubutebekkubu ebe).

The Bulgarians have Pileshki, the Finns have Sananmuunnos, the French have Verlan, the Koreans have Gwisin Mal, the Turks have Kuş dili, and the list goes obon and obon. But no matter which one’s being used, the unstated presumption is that the speaker and listener both share the same understanding of the code.

Languages have plenty built-in complexity to begin with

English is the collective result of Germanic Saxon roots, overwritten by Norman French, and infused with a world’s worth of loanwords. As a result, English spelling is so irregular that GHOTI could be pronounced FISH (taking the “gh” from “enough,” the “o” from “women,” and the “ti” from “nation). Beginners of English wonder why it’s “busy” and not “bizzy,” or “sugar” and not “shugger.” Remember your own frustration when first starting to learn how to spell in grade school? Well, that’s what the rest of the world has to deal with every day.

Of course, French has its quirks too. Its numbering system is unique to say the least. 60 is “soixante.” So far so good. But 70 is “soixante-dix” (60 + 10), 80 is “quatre-vingt” (4 × 20), 90 is “quatre-vingt-dix” (4 × 20 + 10), and 99 is “quatre-vingt-dix-neuf” (4 × 20 + 10 + 9). This is a relic of Celtic Gaul, which used a Base-20 counting system instead of the usual Base-10. Whether the Base-20 system suggests that the ancient Gauls used all their fingers and toes to count is up for debate.

The Swiss and Belgians, on the other hand, have adopted a less complicated system for their French, where 60 is “soixante,” 70 is “septante,” 80 is “huitante” (Switzerland only), 90 is “nonante,” and 99 is “nonante-neuf.” Very logical and simple.

As they say, where there’s a will there’s a way. When there is a willingness to simplify a language, it gets simplified. But when people are somehow satisfied with the language they use, it’s nearly impossible to change it, no matter how logical the change may be.

Language as code: Windtalkers

A prime example of a language being used to communicate amongst your own and to keep others in the dark was the so-called Windtalkers of World War II. As made famous in the Hollywood movie of 2002, bilingual Navajo speakers served with the US Marine Corps for “coded” radio communication that could not be easily broken. Although less known, the US Army also deployed Lakota, Meskwaki, Mohawk, Comanche, Tlingit, Hopi, Cree and Crow soldiers in the Pacific, North Africa and Europe during that time in history.

We should also note that Native American code talkers first appeared during World War I, when Cherokee and Choctaw soldiers used their respective languages in key battles to recover France.

The challenges of dealing with Japanese

The Japanese language presents a unique mixture of complications. Written Japanese consists of three different character sets — the phonetic hiragana (46 characters) and katakana (another 46 characters), and the logographic kanji (thousands of characters). But dealing with lots of characters isn’t the only challenge in dealing with Japanese; the plethora of alternative and irregular readings for the same character adds an extra dimension to the frustration.

Take the character 生 for instance. There are at least eight different ways to pronounce it. In 生きる (ikiru, to live) it’s “i.” In 生む (umu, to give birth) it’s “u.” In 生える (haeru, grow) it’s “ha.” In 生命 (seimei, life) it’s “sei.” In 生涯 (shougai, a lifetime) it’s “shou.” In 生糸 (kiito, raw silk) it’s “ki.” In 生物 (namamono, raw food) it’s “nama.” And in 生方 (ubukata, a proper name) it’s “ubu.” All this from one character, and this isn’t even counting the countless irregular readings that are possible.

Irregular readings are those that are not in the dictionary. Some irregular kanji and readings for Japanese family names have existed for centuries, while others were introduced when the government began officially logging family registries after the Meiji Restoration (1868). Kind of like spelling irregularities introduced at Ellis Island becoming the official spelling for immigrants to the USA.

In surnames and place names alike,  向坂 could be sakizaka, sakisaka, kouzaka, kousaka, sagisaka, sagizaka, mukouzaka, mukousaka, mukaizaka, mukaisaka, mukosaka, mukauzaka, or mukisaka. Take your pick.

Or, good luck deciding whether 丹生 is read tan’u, tan’o, tan’oi, tanshou, tanjou, tansei, tannyuu, tanmi, nii, niu, nio, nioi, niki, niku, nishou, nibi, nifu, nibu, nyuu, niwa, niwao, noo, hani, hanyuu, funyuu, marumi, or mibu in any particular instance. All of these are correct readings, so in a JA-EN translation job it really helps to have the source text annotated as to which reading should be used.

A major reason for this diversity stems from Japan’s feudal era, where cross communication between domains was limited. As a result, each region developed their own distinct dialect, some of which were undecipherable to non-natives. In fact, in some cases the details of a dialect were intentionally kept secret due to security reasons — making it easy to spot spies from outside the province. For example, travelling to a particular area in present-day Shiga Prefecture and calling it Kōga, instead of Kōka, would have easily gotten you killed. That’s the code of the ninja.

Given names are slightly different because there are strict rules on what kanji characters can be used, as stated in Article 60 of the Ordinance for Enforcement of the Family Register Act (戸籍法施行規則). But here’s the catch: There is a lot of freedom in how those characters can be read.

So you can name your daughter 七音 (nanato or shichion or shichiin, meaning “seven sounds”) and officially register the reading as “Doremi” (as some have actually done). No one outside of your family might be able to read the name correctly, but as long as it gets registered, it would be the correct reading. Like taking S-M-I-T-H and pronouncing it “Jones.”

Every industry has its own “language”

Just about any profession is accompanied by a set of jargon and lingo. The butcher, the baker and the candlestick maker all have their own terminologies.

We in the localization industry are just as guilty as everybody else, as we regularly talk about TM (Translation Memory), MT (Machine Translation), TAT (Turn-Around Time), CAT (Computer Assisted Translation) tools, and ICE (In Context Exact) matches. Some of the most code-like terms on the face of the earth are the G11N (Globalization), I18N (Internationalization), L10N (Localization) and T9N (Translation) that some collectively call GILT. People outside of our industry looking in might find it difficult to understand what on earth we’re talking about, but for the purposes of sharing information amongst ourselves, it works. One might argue, however, that this type of codified jargon only makes it more difficult for people (i.e. clients, customers) on the outside to understand and appreciate what the localization industry does. How ironic, considering our business is about bridging gaps and sharing information.

What to keep in mind when localizing

Before blindly sending a source text into the pipeline and expecting a stellar target output, just remember that language is inherently a blade that cuts both ways — it can communicate and it can hide. Details that are missing from the source, but necessary to provide a coherent translation, might need to be inferred through context. Cultural issues that go without mention in the source may need to be explained to a reader with a differing background. What’s the “hidden” piece that needs to be exposed in order to make the target text really work?

Filling in those gaps in a way that makes sense is what sets the experienced translator or transcreator apart from a machine. Understanding the true nature of the source and target languages, including the culture behind them and the pitfalls that may exist in between, is the secret to success.

Douglass McGowan

Related Posts

June 2, 2020

Better localization by understanding the code called “language”

Douglass McGowan

June 2, 2020

Better localization by understanding the code called “language”

Cameron Rasmussen

June 2, 2020

Better localization by understanding the code called “language”

Douglass McGowan