Sunday, January 17, 2010

4. Translation: Digital Lingua Franca

See Original
“Any sufficiently advanced technology is indistinguishable from magic.”
-Arthur C. Clarke [RIP]


As this device becomes a reality it will solve some of our oldest problems: Automated, real time, spoken language translation is possible, today. [UPDATE]

From an essay I wrote in December 2008: “Using only free software and a steady internet connection, in half an hour I set up something which while on the surface appears as merely a clever exercise in futility, actually holds more social implications than most people seem to be aware of. The process is as follows: When I spoke into my computer's microphone input using a makeshift "mic", made out of the cheapest possible dollar-store headphones, into the freeware program "Wav To Text", what I said (carefully pronounced) would appear in real time as on-screen text. This was then copied and pasted into Google Translate, which would convert the English message into Spanish. This Spanish message was then copied and pasted into a simple online text-to-speech converter, and turned into sound- the results of which actually can be seen at this link: http://tts.imtranslator.net/2cwi
The process can be summarized as follows: English speech----> English text---(Translation)----> Spanish text ----> Spanish speech (albeit robotized). Observing it in action- automated by a (free) windows macro program- is actually not very impressive. It takes about 40 seconds and sometimes doesn't work. Initial results are impressive however, because my test phrase, "What are you doing here in New York?" was successfully converted into a HAL-like voiced version of "¿Qué estás haciendo, aquí en Nueva York?" (the equivalent Spanish phrase) without any human guidance. What is more impressive is the fact that this was done in perhaps the most "ghetto" way imaginable- all software was free, all websites are public, and the hardware was some of the cheapest and most readily available possible. This could be repeated by almost anyone, so long as they had access to a working computer and the internet. Imagine what could be accomplished if the amount of research and effort that was put into Google Translate itself was utilized on such a project.”


So we've proven that two people without a common tongue could hold a spoken conversation in almost real time (the process could probably be shaved down to a one second delay). To make this a little less impersonal however, it would be nice not to have to rely on an artificial/robotic voice. Enter this technology:

"When Prodigy's next album drops, it could debut in nearly 1,500 different languages without the rapper having to so much as crack a translation dictionary.The lyrics to "H.N.I.C. Part 2" will be translated using proprietary speech-conversion software developed by Voxonic...
Here's how the Voxonic translation process works. After translating the lyrics by hand, the text is rerecorded by a professional speaker in the selected language. Proprietary software is used to extract phonemes, or basic sounds, from Prodigy's original recording to create a voice model. The model is then applied to the spoken translation to produce the new lyrics in Prodigy's voice. A 10-minute sample is all we need to imprint his voice in Spanish, Italian or any language," said Deutsch... "
Now see it in action keeping in mind that he doesn't know a word of Spanish and he isn't singing. This is a machine simlation of his voice: (sorry, the video keeps getting deleted...)


So a more complex version of my makeshift translation center would utilize similar technology, taking the text version of what it wants to say and using the voice profile that my device has built up and stored through hours of previous use, in order to produce the translated speech (as opposed to the robot voice). The way this works is by analyzing the exact overtones that makes my voice unique, through Fourier analysis (again, it's all down to a formula). Therefore, the listener would actually hear, "Entonces, ¿qué estás haciendo aquí, en Nueva York?" in what sounds like my own voice- even thought I've never been able to speak Spanish! I speak in English, he hears Spanish- He responds in Spanish, and I hear English, ad absurdum. What social effects would this have if plugged into a video chat service like Skype? "Google Video" was just integrated into all Gmail accounts as of 12/2008, and besides GT, they've been working on speech recognition for some time:

"Today, the Google speech team (part of Google Research) is launching the Google Elections Video Search gadget, our modest contribution to the electoral process. With the help of our speech recognition technologies, videos from YouTube's Politicians channels are automatically transcribed from speech to text and indexed. Using the gadget you can search not only the titles and descriptions of the videos, but also their spoken content. Additionally, since speech recognition tells us exactly when words are spoken in the video, you can jump right to the most relevant parts of the videos you find. In addition to providing voters with election information, we also hope to find out more about how people use speech technology to search and consume videos, and to learn what works and what doesn't, to help us improve our products."

There are some skeptics: "H. Samy Alim, a professor of anthropology at the University of California at Los Angeles who specializes in global hip-hop culture and sociolinguistics, also doubted the newly minted songs would retain the clever wordplay and innovative rhyme schemes inherent in popular music. Besides, he laughed, "How do you translate 'fo shizzle' in a way that retains its creativity and humor for a global audience?"


While correct, it’s an oversimplification. "Fo Shizzle" wouldn't be translated, because certain things are better left alone and learned. I don't need to have "Habibi" translated into "baby" in order to understand the lyrics of Amr Diab…
Like I said though, real time translation is possible today- its just a matter of a little funding and cooperation.

So what would it mean if everyone on earth could understand one another?

Relevant links and unpdates:

http://research.microsoft.com/en-us/groups/speech/

A Trainable Text-to-Speech Synthesis

We developed a new, statistically trained, automatic text-to-speech (TTS) system. Unlike our previous, concatenation-based TTS, the new one includes these distinctive features: 1) a universal, maximum-likelihood criterion for model training and speech generation; 2) a relatively small training database, needing just about 500 sentences to train a decent voice font; 3) a small-footprint (less than 2 megabytes) hidden Markov model (HMM); 4) flexible, easy modification of spectrum, gain, speaking rate, pitch range of synthesized speech, and other relevant parameters; 5) fast adaptation to a new speaker; and, 6) more predictable synthesis for pronouncing name entities. With its easy training and compact size, the new HMM is ideal for quick prototyping of a personalized TTS.



Update: Google is on it. With Gvoice they're offering free transcription of voicemail (read: free labor for debugging) with a rating to let them know whether or not the transcription was accurate. Obviously they're recycling this back into the system to tell it where it fails. Soon they'll enable a feature like "translate this message". Then you'll be able to translate your instant messages. Then you'll be able to transcribe video conversations. Then you'll be able to translate these transcriptions. Then you'll be able to do it in real time and we'll have opened pandora's box.

http://bigthink.com/zacharyshtogren/your-next-translator-may-be-a-robot

One outstanding task on the global conversation to-do list is how to communicate across languages on all our various new media. Now, a linguistic brain trust at MIT has stepped in to develop a real-time solution to not understanding each other.

The approach, pioneered by Pedro Torres-Carrasquillo of MIT’s Lincoln Laboratory, requires audio mapping a speaker’s low-level acoustics—the intonation of vowel and consonant groupings. Pedro Torres-Carrasquillo found that by focusing on these tiny parts of spoken language he could arrive at a much more accurate identification of a particular dialect than analyzing phonemes—a language’s word and phrasal groups.

Though a few years away, the real-world applications of the work is sweeping. In a surveillance context, low-level acoustics mapping could let a wire tapper narrow down a criminal’s location by dialect. From just one utterance, a phone system like Skype could identify a speaker’s language and regional dialect for the common user.

No comments: