
Microsoft turns spoken English into spoken Mandarin – in the same voice - evo_9
http://thenextweb.com/microsoft/2012/11/08/microsoft-demos-amazing-english-to-mandarin-translation-allowing-for-real-time-audible-translations/?fromcat=all
======
tokenadult
To someone who spent years learning Chinese as a second language, and then
made my living for years as a Chinese-English interpreter, that was pretty
impressive.

The economics of the issue is that a machine interpreter just has to be as
good as a human interpreter at the same cost. That's a reachable target with
today's computer technology. EVERY time I've heard someone else interpreting
English or Chinese into the other language, I have heard mistakes, and I am
chagrined to remember mistakes that I made over the years. We can't count on
error-free machine interpretation between any pair of languages (human
language is too ambiguous in many daily life cases for that), but if companies
develop tested, validated software solutions for consecutive interpreting
(what I usually did, and what is shown in the video) or simultaneous
interpreting (the harder kind of interpreting in demand at the United Nations,
where even in the best case it is not always done well), then those companies
will be able to displace a lot of human professionals who rely on their
language ability to make a living.

Right now a lot of interpreters in the United States make a lot of part-time
income from gigs that involve suddenly getting telephone calls and joining in
to interpret a telephone conversation in two languages. This is often
necessary, for example, for physician interviews of patients in emergency
rooms or pharmacist consultations with patients buying prescribed drugs (where
I last saw a posted notice on how to access such an interpretation service).
The IBM Watson project is already targeted at becoming an expert system for
medical diagnosis, and patient care markets will surely provide a lot of
income for further development of software interpretation between human
languages.

It's still good for human beings to spend the time and effort to learn another
human language (as so many HN participants have by learning English as a
second language). That's a broadening experience and an intellectual delight.
But just as riding horses is more a form of recreation these days than a basis
for being employed, so too speaking another language will be a declining
factor in seeking employment in the next decade.

~~~
qq66
I don't think there will be much of an impact on the interpreter industry
until the machine translations are significantly better than human
translations.

Human translators are so expensive today that they are only used in situations
where the translation has to be correct -- diplomacy, courtrooms, books, etc.
Until a machine is much better than a human, these use cases won't switch to
machine translation (similarly, self-driving cars won't be allowed until they
are proven to be much safer than human drivers).

On the other hand, there's a large casual market for machine translations
today for situations like reading foreign Web sites, chatting with people in
different countries, reading Tweets in a different language, etc.

~~~
ColinDabritz
Luis von Ahn, one of the creators of reCAPTCHA, has a fascinating project
going:

<http://duolingo.com/>

The idea is to teach people language at the same time as providing a real time
translation service. Apparently if you multi-plex novices (and not at a bad
rate) you get expert translation at a similar accuracy. The translators
benefit by learning language, and the service is self supporting by proving
translation.

He did an excellent TED talk on the subject:

[http://www.ted.com/talks/luis_von_ahn_massive_scale_online_c...](http://www.ted.com/talks/luis_von_ahn_massive_scale_online_collaboration.html)

~~~
Tipzntrix
Sounds like taking on interns or co-ops. Pretty nice idea.

------
paulgb
This is the second time Deep Neural Network research from the University of
Toronto has made the front page, the first being when it won first-place in a
Kaggle competition <http://news.ycombinator.com/item?id=4733335>

~~~
FrojoS
Here is a GREAT talk by Geoffrey Hinton (the Prof running said lab)
[http://www.youtube.com/watch?v=DleXA5ADG78&hd=1](http://www.youtube.com/watch?v=DleXA5ADG78&hd=1)
where he explains the method.

Unfortunately, even though it was posted three times to HN
[http://www.hnsearch.com/search#request/all&q=sex+machine...](http://www.hnsearch.com/search#request/all&q=sex+machine+learning)
it never made the fron page.

Here is the my summary and comment: " Great talk. I don't know much about
artificial neural networks (ANN) and even less about natural ones, but I have
the feeling that I learnt a lot from this video.

If I understand correct, Hinton uses so many artificial neurons compared to
the amount of learning data, that you would usually see an overfitting effect.
However, his ANN's randomly shut of a substantial part (~50%) of the neurons
during each learning iteration. He calls this "dropout". Therefore, a single
ANN represents many different models. Most models never get trained, but they
exist in the ANN, because they share their weights with the trained models.
This learning method avoids over specializing and therefore improves
robustness with respect to new data but it also allows for arbitrary
combination of different models which tremendously enlarges the pool of
testable models.

When using or testing these ANNs you also "dropout" neurons during every
prediction. Practically, every rerun predicts a different result by using a
different model. Afterwards, these results are averaged. The more results, the
higher the chance, that the classification is correct.

Hinton argues, that our brains work in a similar way. This explains among
other things a) Why are neurons firing in a random manner? It's an equivalent
implementation to his "dropout" where only a part of the neurons is used at
any given time. b) Why does spending more time on a decision improve the
likely hood of success? Even though there might be more at work, his theory
alone is able to explain the effect. The longer you think, the more models you
test, simply by rerunning the prediction. The more such predictions the higher
the chance, that the average prediction is correct.

To me, the latter also explains in an intuitive way, why the "wisdom of the
crowds" works well when predicting events that many people have an, halfway
sophisticated, understanding of. Examples are betting on sport events or
movies box office success. As far as I know, no single expert beats the
"wisdom of the crowd" in such cases.

What I would like to know is, how many, random model based predictions do you
need until the improvement rate becomes insignificant? In other words, would
humans act much smarter if they could afford more time to think about
decisions? Put another way, does the "wisdom of the crowd" effect stem from
the larger amount of combined neurons and the diversity of the available
models that follows, or from the larger amount of predictions that are used to
compute the average? How much less effective would the crowd be, if less
people make more ("e.g. top 5") predictions or if the crowd was made up of few
individuals which are cloned?

If the limiting factor for humans is the time to predict based on many
different models and not the amount of neurons we have, this would have
interesting implications. Once, a single computer would have sufficient
complexity to compete with the human brain, you could merely build more of
these computers and average there opinions to arrive at better conclusions
that any human could [1]. Computers wouldn't be just faster than humans, they
would be much smarter, too.

[1] I'm talking about brain like ANN implementations here. Obviously, we
already use specialized software to predict complex events like weather,
better than any single human could. But these are not general purpose
machines. "

~~~
chewxy
We're too busy debating the merits of the new iDevice, or bashing people who
debate the merits of the new iDevice, or wasting our upvotes on personality
cults, upvoting every little thing said personality has written about.

Did I about cover all the problems with groupthink and hivemind?

It's a shame though, but I hang out in new a lot, and I urge other HN'ers to
hang out there too.

~~~
mistermann
Where might this new place you hang out be?

~~~
chewxy
<http://news.ycombinator.com/newest>

Upvote only interesting things

------
scrrr
My current client is specialising in speech recognition, speech synthesis and
automatic translation. They have something similar, focused on enterprise
customers. I find this subject very interesting.

I am a Ruby guy and I only marginally get in contact with their C++ code, but
from what I learned so far this stuff is extremely memory and CPU hungry. It
also depends on having been fed the right amounts of input. That's why Google
Translate is so good. They have tons and tons of data from all the websites
they parse, and in many cases the content can be obtained in different
languages. Corporate pages are often translated paragraph by paragraph by
humans which results in perfect raw data to train these algorithms. Also for
example all documents that the European Parliament produces are translated
into the languages of all member states.

Everything that has to do with translation has to do with context. I think the
software right now is as smart as a six year old kid, except that it has a
much bigger vocabulary. But if you say "The process has stalled. Let's kill
it." it probably only makes sense if you know you are talking about computers.

It's hard to imagine that computers one day might really understand everything
we say. But just by using Google Translate I think they really might. Это
является удивительным. (I don't speak Russian. I hope I didn't insult anyone
now. ;))

~~~
evoxed
> Corporate pages are often translated paragraph by paragraph by humans which
> results in perfect raw data to train these algorithms.

Actually this may be one of the reasons why Google's Japanese translations
_are so terrible_. The why isn't really relevant here[0] (perhaps you already
know anyway) but it there are times when the raw data becomes the most
misleading.

[0] Obviously I still mean those actually translating by hand, not the
companies which just throw all of their material into Google Translate and
consider it a finely proofed document. There are plenty of the latter which
makes for an amusing loop in the system.

~~~
scrrr
I speak bad Japanese after having worked there for a while. But Japanese is
extremely context dependent. They leave out a lot of words and you just deduce
the meaning. (Don't hit me if that's wrong, but that is what I remember
thinking back when I was learning it.)

~~~
jpatokal
The way you phrase that is a bit misleading. It's not that Japanese speech has
any less info than most languages, it's just that you can set topics (add
state to a stack, to put it in geek terms) that carry over into subsequent
phrases. The correct translation of an individual sentence may thus depend on
that previous context.

A simple but famous case is "Watashi wa hamburger desu". The sentence has no
subject, so with no context that would get translated as "I am a hamburger",
but if you fill in a previously defined subject, it could be "I [order] a
hamburger", "My [favorite food] is a hamburger", etc.

~~~
thejsjunky
To clarify further for those unfamiliar with the language,a super literal
translation of "watashi wa hanbaga desu" is something like:

    
    
        (concerning/as for) myself, (it's) hamburger.
    

On it's own if Bob says this, basically it comes out as "I (am) (a)
hamburger".

However, if Sally has just said something like "I'll have a salad...what about
you Bob?" then it makes sense as Bob's order is the implied subject and it
becomes "My order is hamburger." or "I'll have hamburger."

I know very little about linguistics but I think there are a bunch of other
things that make Japanese-English difficult to translate via software as well.

There is the whole aspect of culture embedded in it. あなた could mean "you" or
something like "dear/sweetie" depending on the context. There also the
question of how to translate "you" (etc) in English text to Japanese as you
have to consider politeness etc. If you are just translating a business web
page it's probably safe to stick with polite forms, but if you are translating
say the dialogue in a TV show you want to preserve the tone of the characters.

In terms of voice recognition, Japanese seems to have a lot of homophones to
me when compared to English. It may just be my imagination, but here are some
I ran into recently:

舶,錘, 頭, 摘む, 積む, 詰む, and 紡錘 are all pronounced つむ and mean completely different
things. Or 六, 碌, and 録 are pronounced ろく. 上, 神, 紙, 髪, and 加味 are all
pronounced かみ.

I seem to run into things like that regularly; when just hearing it spoken you
need the context to figure out what they mean.

~~~
nandemo
> 上, 神, 紙, 髪, and 加味 are all pronounced かみ.

This can be mostly solved by context. There are very few situations in normal
speech where you'd hear "kami" and not know if they're talking about 神 (god)
or 髪 (hair). Also, it's not particularly hard to code that knowledge. E.g. try
かみにいのる (pray to god) and かみのけをきった (cut hair) on Google Translate. It will
suggest the correct kanji in both cases.

Anyway, I'm not a native Japanese speaker, but I find the whole homophone
thing a bit overrated. As far I as can recall the only pair of homophones that
cause trouble in normal speech are 科学/化学 (both pronounced _kagaku_ , meaning
science/chemistry) and 私立/市立 ( _shiritsu_ , private/municipal).

~~~
thejsjunky
Thanks for the reply.

> This can be mostly solved by context. T

Right, as I said. It's not too bad, but it's easier when you can just
translate word for word.

かみにいのる gives me "pray to bite" on Google translate; as you say, it suggests
the right kanji...but that's precisely my point. It needs you to disambiguate
for it to be sure.

I'm not saying this is an insurmountable problem, I'm contrasting the
difficulty.

> There are very few situations in normal speech where you'd hear "kami" and
> not know if they're talking about 神 (god) or 髪 (hair).

I ran into it recently in music. Babymetal has a song that starts:

伝説の黒髪を華麗に乱し

When you listen to the song, it'd be easy to momentarily think she might be
saying "black god" or "black paper" since while the pronunciation wouldn't be
identical, it's pretty close. Since I'm human, I figured out pretty quickly
what she is saying...but in the equivalent English phrase there's no issue
there...it's "black hair" or "black paper".

This is admittedly not "normal speech", but I could see it popping up there
too.

I've seen confusion over 神/髪 in other situations too, though those were
deliberately puns so probably don't count, but demonstrate it's possible to
have situations where it's at least somewhat ambiguous.

> I find the whole homophone thing a bit overrated

I'm sure it's exaggerated to me because my Japanese is pretty atrocious, but I
think my point is valid: any time you have homophones in a language it makes
things more difficult to set up a system that listens to speech and
translates. Japanese seems to have more homophones than English, and if that's
true it is proportionally more difficult to translate in that regard.

~~~
enqk
Also from what I understand, certain homophones are differentiated in practice
by differing accenting (raising/lowering) in speech. This is however region
specific.

~~~
w1ntermute
This is known as pitch accent:
<http://en.wikipedia.org/wiki/Japanese_pitch_accent>

------
dbul
Translation is as much of an art as it is a science, so I wonder where this
project is headed. _Le Ton beau de Marot_ is a great book for illustrating
this point.

In college I had studied Japanese and a friend introduced me to the anime
cartoon _Initial D_. His copy had the original Japanese with English
subtitles, and so I could assess the translation to some degree -- it was very
good. On Netflix you can watch Initial D, but after 2 minutes I had to turn it
off because the English dubbing really failed to capture the characters.

As someone noted in this thread, the presenter's synthesized voice in the
linked video doesn't seem to reflect his own. If he could have said something
like "Wo hui shua putonghua" and had the machine output say the same, it might
have been more convincing.

------
pbhjpbhj
I was just pondering today why PCs have adopted spell checking as a standard
feature but don't appear to use context techniques for word checking or
grammar checking yet. Perhaps I'm just using the wrong apps?

The speaker says "to take in much more data" but it gets parsed by the speech-
to-text as "to take it much more data" which is such an unlikely phrase I
can't really work out why it's not auto-corrected.

The phrase provided doesn't appear to be in either Google's nor Bing's web
indexes. Typing "to take i" in to either Google or Bing's search box produces
a hit for "to take in" as the most likely match; and within milliseconds.

Similarly (and ironically) with "about one error out of" being parsed as
"about one air out of".

That he goes on to say that they use statistical techniques and phrase
analysis for the translation makes this sort of error all the more intriguing,
why isn't that same statistical approach weeding out these sorts of errors.

Nonetheless an impressive demonstration.

~~~
evoxed
Green squiggly excluded, I can think of two reasons off the top of my head why
even the most advanced general purpose grammar checker would be a bit of a
controversial feature:

\- Because grammar is typically more expressive, and dependent upon a concept
that otherwise may not exist in words. Thus statistical grammar models and
context checkers would be much more volatile to generating nonsense from user
input (along the lines of the Sokal hoax) or restricting output to a range of
acceptable models (giving the machine its own voice in a sense). That leads to
the second thing...

\- It kills freedom and creativity (or at least, how we receive it). Imagine
comedy routines in stoic deadpan. Perfunctory exchanges in formal
constructions (and vice versa). Obviously you can avoid all of these
situations if you wanted to, but in that case it should probably be saved for
those special occasions. It could probably help a lot of businessmen wanting
to write their statements and messages in shorthand without spewing
boilerplate text. But it's potentially damaging to every child or student who
is still finding out how they want to express _themselves_ in the given
context.

Note: I think it is fair to assume that grammar checking would include the
ability to reformulate or generate text that obeys the relevant models. Spell
checkers suggest spellings, grammar checkers have to suggest fixes and changes
as well, and if we want to get any further than Win98 era Word it will
probably have to have a plain old fix-it generator as well.

~~~
koide
FWIW: There's a Chilean comedian whose weapon of choice is to deliver mostly
black humor in stoic deadpan. The result is hilarious. Probably because you
know he's a comedian.

------
zyb09
Fun thing to do: you can turn on transcribe audio on Youtube and directly
compare how Google's speech recognition tech stacks up against Microsoft's.

~~~
zmmmmm
I bet the audio directly from his mic would be enormously better quality than
whatever YouTube has recorded. Plus Google can hardly afford to dedicate
gigantic amounts of CPU to the transcription - they'll be going for a crude
but useful job where for this demo he probably has a whole lot of CPU grunt
just dedicated to it.

~~~
rplnt
I really don't think youtube transcribes audio for every single user. You can
see it's not available in many videos. I'd guess they run some test on the
audio to see if it's worth transcribing, and only then run a background task
to do the job.. doesn't really matter how fast.

You are right about the source quality though.

------
dchuk
The implications of this kind of technology reaching consumers in the next
decade or so are really interesting.

If we can get to the point of having handheld devices that can accomplish live
translation of spoken word, what exactly is the point of different languages
anymore?

~~~
Breakthrough
I don't follow your logic... Without those different languages, you wouldn't
have anything to translate in the first place. If anything, this type of
technology _promotes_ independent and different languages, as it makes it so
much easier to communicate with others _regardless_ of your native tongue.

Also, bravo to Microsoft; I'll remove my jaw from the floor after I watch your
video a second time.

~~~
dchuk
different languages emerged because of physical separation. As barriers have
been reduced because of technology (both physical and communication barriers)
there really aren't "borders" anymore. I can call someone in China right now
if I wanted to, something impossible to do even just 100 years ago.

So if we can all talk to each other across the world in real time, and we can
all understand each other because of this technology, what exactly is the
point of different languages anymore?

~~~
jarek
> what exactly is the point of different languages anymore?

True. This is why I will be teaching my children Basque and Basque only.

New languages might have originally emerged due to separation, but that
doesn't mean lack of separation will cause already existing languages to go
away.

~~~
DanBC
Not Basque and Spanish? Or Basque an English? Or Basque and some Chinese
language?

~~~
jarek
Nope! If we don't need different languages anymore, Basque only.

------
zalew
I have no doubt a few more long years will have to pass until those solutions
reach the mass market, but this is extraordinary especially for someone like
me, passionate about travelling. Our generation witnessed the shift towards
cheaper flights, easier accommodation booking and web/mobile tools growing
year by year becoming more helpful in organizing our visits and seeking
information about places and cultures we don't know. We, or the future
generation, will probably witness the fall of the language barrier, it's truly
amazing and one of the most important shifts in our global experiences.

------
Groxx
Skip to 8 minutes to hear the actual translation.

I'd love some comparison - that doesn't sound like the same voice to me
(awfully close to the 'standard' computer voice, IMO), but some of it is
crummy recording quality, and showing the flexibility would go a long way
toward convincing me.

~~~
mpdaugherty
I agree that it doesn't really sound like him, but the voice is far better
than most Chinese computer voices that I've heard and is totally
understandable.

Seems like my years of learning Chinese and living in China are about to
become useless...

~~~
boyter
That was my opinion as well.

Actually I stopped learning Chinese and living in China because I discovered
the following. They were learning English faster then I could learn Chinese,
and I only needed to know enough to let them know I wasn't culturally
insensitive.

Not sure where the quote was but it went along the lines of "Don't try to talk
in their language, because you will make a hash of it and they will have the
advantage."

------
ctingom
Now imagine this on Skype as a premium feature.

------
polshaw
Near real time speech to speech translation is awesome[1], but the voice
sounded more like how i would picture ASIMO speaking (ie a 1980s speech
synthesis) than 'his' voice.

1\. is anyone here fluent in mandarin to assess the quality of the output?

~~~
ebzlo
Mandarin speaker here. I'm more impressed that it was able to reconstruct the
sentences properly (where traditional translation tools typically fail). The
output was fine. Obviously doesn't sound like a human speaker, but the tones
are correct.

------
joering2
Mark my words; some good changes are happening at MSFT. There are some
indicators that suggest this may be a come back. Surface seems to gain
momentum, while future of OS Windows will be freemium plus ads (as you can
imagine hundreds of millions of "screens" plugged in).

------
orjan
Original post:
[http://blogs.technet.com/b/next/archive/2012/11/08/microsoft...](http://blogs.technet.com/b/next/archive/2012/11/08/microsoft-
research-shows-a-promising-new-breakthrough-in-speech-translation-
technology.aspx)

------
mmuro
It's pretty incredible how far language-to-language technologies have come and
how far they still have to go.

Very cool stuff.

------
hammock
Putting on my tinfoil hat here, if all it takes to build a speech model to
impersonate someone's voice is an hour's worth of them talking... what happens
when the wrong person gets that? For example, the government or a corporation
(a internet phone service, maybe) uses to fabricate evidence of conversations
that never really happened; could also be used to aid in identity theft

~~~
evoxed
To indulge you just a little bit, I think it most likely result in an a rapid
expansion of forensic industries. While I have no legitimate experience in
signal processing, I imagine there would be ways to deduce whether or not such
impersonations were credible to some degree. Whether or not that would stop
tech-savvy marketers and con artists out of scamming grandma, I don't know.
We'll have to wait and see what 21st century holds for future firewalls. Of
course, if someone with any knowledge on the subject would like to step in and
point out how stupid my response sounds to them, I'd be glad to become more
informed!

------
aw3c2
Aggregator spam, direct link is
[http://blogs.technet.com/b/next/archive/2012/11/08/microsoft...](http://blogs.technet.com/b/next/archive/2012/11/08/microsoft-
research-shows-a-promising-new-breakthrough-in-speech-translation-
technology.aspx)

------
scep12
My Android phone already does voice-to-text better than the system demoed in
that video. Looks like Microsoft's research needs a bit more tuning before it
can be declared 'amazing'

------
bobwaycott
Damnit. This is pretty much the very idea I had in college around 12 years
ago. At the time, there was nowhere near the required technology to pull this
off. Over the last few months, I'd begun rethinking through the idea again,
feeling the time was right to pull this off as a killer idea. Even began
trying to investigate how to pitch this to create a startup focused solely on
this problem.

Now it seems the time may be too late. Rats.

~~~
hnewser1
You are a bit naive, eh? Also:

<http://en.wikipedia.org/wiki/Universal_translator>

~~~
bobwaycott
Naive in what regard exactly?

EDIT: Oh, I see. You think I think I had this idea first and out of nowhere.
Wrong.

~~~
epaga
In the sense that this is an age-old dream of mankind (see Tower of Babel),
not an idea someone grabbed away from you before you could create a startup.

~~~
bobwaycott
You erroneously conclude I think I am the first person to think of this. I do
not. I know my sci-fi _and_ biblical and human history (20 years of studying
intellectual and cultural history, thank you). I've read about the tower of
babel and grew up on wanting universal translators.

My comment was a bit of bittersweet admiration at a very _specific_ success in
implementation. Can we just assume that the average person on HN knows wtf a
universal translator is, and not every time someone says, "I had this very
idea" that they are making accusations that company/person X is stealing the
idea away? Christ. Not a single word was spoken that this idea was "stolen" or
"grabbed away". Nor were words employed signifying that "I thought of this
first and completely independently".

I was saying it looks like it's too late to be the person who gets to be first
to demonstrate it. And by "very idea" I meant, _quite specifically_ , how cool
it would be to be able to translate my voice into another language in near-
real-time. And by "Rats", I meant "Oh snap, looks like others have been
working on doing that same thing and impressively pulled the damn thing off."

If you spent a moment looking at my comment history, you'd see I am usually
quite clear in stating what I mean directly. Had I meant to imply an idea was
"grabbed away from [me] before I could create a startup", I'd have used those
_exact_ words. Instead, I said that I'd had this very idea years ago and only
recently thought about how the tech might be there today to make it happen,
and maybe make for an interesting startup. I didn't even try to imply it would
be _my_ startup.

Nonetheless, point taken. I'll ensure I more specifically couch any future
statements about having ideas with the qualifier that I am, in fact, _not_
implying that I had it first or independently in the whole of human history.

~~~
epaga
Sorry to offend, I was just clarifying why the original guy said you sounded
naive - just clarifying what specific implementation you had in mind would
have helped in your original comment. I can see how it sounded as if you meant
the idea "universal translator", can't you?

~~~
bobwaycott
Honestly, it'd never even cross my mind that anyone on HN would possibly
comment that _s/he_ had come up with the idea of a universal translator _12
years ago_.

Anyway, my apologies either way. Your comment read a tad snarky with the
"grabbed away before you could create a startup" bit. I was left thinking, "Oh
great. Relegated to the company of some kid who thinks his startup idea was
stolen." My reply could have been tempered a bit more.

------
tsahyt
This is really impressive, especially the speech recognition part. I can't
really judge anything else, since I don't speak a word of mandarin. The speech
recognition though is easily the best I've ever seen. This is almost the kind
of recognition rate needed for voice controlled interfaces to finally work.
Exciting stuff.

------
ffk
It looks like a translation we can hear occurs around 8:10. Is anyone able to
verify the correctness of the speech? (Also, remembering it's a demo and it
has probably been tested multiple times for that phrase).

------
telecuda
It's amazing that they can do this, yet there is still no high-quality
solution for changing a male voice into a believable female voice. (For
example, making a 40 year old man sound like a 16 year old girl)

------
ronyeh
Pretty cool stuff! Jump here to listen to the demo:
[http://www.youtube.com/watch?v=Nu-
nlQqFCKg&t=7m55s](http://www.youtube.com/watch?v=Nu-nlQqFCKg&t=7m55s)

------
pedalpete
I'd like to see them demo this using a few different voices. The voice still
sounded very computerized, but maybe I'm just not used to hearing this
speakers voice.

------
leke
I think the Linux hacker, MetalX1000 on youtube did this a while back using
various [ speech to text -> google translation api -> text to speech ] tools.

------
sown
I seem to remember on _Scientific Frontiers_ with Alan Alda, back in the mid-
late 1990s, him demonstrating something similar but not as performant. Neat.

------
zmmmmm
As good as this is, it doesn't seem too much of a quantam leap from where
Google Translate is, with conversation mode, installable in every Android
phone.

I didn't hear the inflections in his voice superimposed on the Chinese voice,
so it is just modeling his voice characteristics and reflecting that into
output voice. From what I understand the voice modeling happened off line, not
in real time, so it is not nearly as sexy as if it was happening as he spoke.
In other words, a nice touch, but I don't see it as revolutionary (unless I'm
missing something).

~~~
nahname
I would gladly have an hour long conversation with my computer once to be able
to speak in another language.

------
edwinyzh
Very impressed! The translated Chinese is far better than those translated by
any other translation tools I've seen on earth in my entire life!

------
neotek
Every day, in every way, we get better and better.

------
andrewcooke
you can jump to 6 minutes and read the text to save time. chinese voice starts
at about 7:30.

it is pretty neat.

------
tummybug
Couldn't this be easily hacked together using google translate and a text to
voice program?

~~~
bfung
I had the same idea, but it looks like Google translate needs a TON of work.
For example:

English: "I hope you enjoy the rest of the presentations today"
[http://translate.google.com/#en/zh-
TW/I%20hope%20you%20enjoy...](http://translate.google.com/#en/zh-
TW/I%20hope%20you%20enjoy%20the%20rest%20of%20the%20presentations%20today)
Translated into Chinese: "I hope you enjoy rest (as in take a break)/resting
introduction/presentation"

Bing translator is much better using the same English input:
<http://www.bing.com/translator>, the Chinese that comes out is grammatically
correct in Chinese. The reverse translation comes out correct as well.

But yes, seems that if the translator is good enough, it would be simple to
reproduce the process from the video.

------
lifeisstillgood
Of course, they cannot do Mandarin to English yet - that's 2.0 :-)

~~~
swalsh
It's really unfortunate though because these languages are some of the hardest
for someone like myself to navigate through. For instance I can look at
"poulet et riz" and figure out its roughing something chicken. But "雞肉和米飯"
means nothing to me.

~~~
jarek
But does "kurczak z kaszą" mean anything?

~~~
tjr
"kaszą" looks like "kasha", but in the context of a whole arbitrary sentence
(rather than in the context of talking about guessing at "chicken and rice" in
other languages) I'm not sure I would have thought of that.

------
yaix
Looking forward to the Android app. Five years maybe?

------
barista
The big question is when do we see this in a real product. Many interesting
innovations have come out of Microsoft research. It will be nice if they reach
end users as well

~~~
lifeisstillgood
Got a Kinect for your XBox yet?

~~~
jorts
True, but the Kinect is poorly implemented. I find the Kinect incredibly
frustrating and usually have it unplugged. When you're watching a show that
has a lot of dialogue it will interpret the talking from the show as commands.
These commands include things like "Fast forward 40x" or "Rewind 20x".

~~~
daeken
Interesting -- I've never had that happen. You may want to recalibrate the
microphones on your Kinect; it does some crazy analysis to filter audio from
the speakers out from the input stream.

