
KTH and Wikipedia develop first crowdsourced speech engine - conductor
https://www.kth.se/en/forskning/artiklar/kth-hjalper-wikipedia-borja-prata-1.631897
======
hacker_9
Interesting stuff. As a side note, I've always thought that if speech
synthesis got good enough we could make much more immersive gaming worlds.
Playing games such as Fallout, Witcher, GTA etc you keep hearing the same 10
sentences uttered no matter where you go and it really breaks down the
immersion when it becomes so obvious how wooden the NPCs are.

So if the game instead contained thousands of lines of written statements,
then a speech synthesizer could choose a line at random and play it and just
by sheer probability you'd be unlikely to hear the same thing twice.

With all these recent advances in neural nets, could speech synthesis be made
to sound more human by giving it a huge set of training examples? You could
have 10-20 different voice actors, each training a different NN by reading
through a dictionary out loud, and then at the end you could synthesis any
sentence with one of the NNs and it would sound like the voice actor actually
said it.

~~~
blixt
By the time we have generative neural networks capable of replicating human
voice with emotion and nuance (way more difficult than neutrally reading a
text on Wikipedia), I think it's fair to assume we'll also have decent
"thought vector" networks that – much like how neural networks can turn words
or even sentences into vectors and back (translation) – can turn the meaning a
character wants to convey into multiple sentences arranged in unique ways.
Basically taking your example a bit further.

~~~
aphelion
An algorithm understanding enough about the text to infer the correct
emotional inflection to give a speech may edge into the category of strong AI,
but I would have guessed that it would be easier to create a neural network
that, given a text spoken in one voice, could transform it into another with
the correct stress, intonation, etc. Perhaps even that's a more difficult task
than I assumed, although speech generation seems to receive a lot less
academic and industrial attention than speech recognition and understanding.

~~~
michael_h
Yeah, it's a bit more difficult of a task than you've assumed.

Speech synthesis receives a lot of attention, but it's _hard_ , so you rarely
hear any news about it. People are throwing DNNs at it at the moment, but
nothing earth shattering has come of it (yet). I have a couple of
'naturalness' filters that use DNNs and about 30% of the time, they drop all
of their tones and I end up with an angry whisper as output. I don't work late
too often.

~~~
TuringTest
For people interested in how hard it is, I recently read this [1] NYT article
providing a comparison of synthetic speech that IBM experts tested for Watson
in the Jeopardy competition.

[1] [http://www.nytimes.com/2016/02/15/technology/creating-a-
comp...](http://www.nytimes.com/2016/02/15/technology/creating-a-computer-
voice-that-people-like.html?_r=0)

------
takno
I've started looking at using the pretty impressive speech synthesis in Chrome
on android. Hopefully this work will feed into having other platforms and
browsers get to the same state of usability, as well as directly benefitting
Wikipedia.

I'm really interested in the higher level - how to manage navigation, and
represent things like tabular and graphical info, and how to get articles to
be written with the spoken alternative in mind.

I guess the ultimate for me on Wikipedia is a more multimedia presentation
format, where articles are text for people who want that, and more like
hitchhiker's guide to the galaxy where that works better

------
bobajeff
I'd like to imagine a future where Morgan Freeman is reading me the recipe off
of Allrecipes.

~~~
mdorazio
I think this wouldn't actually be terribly difficult if a company hired Morgan
Freeman for a few hours of studio time, similar to how you can get Mr. T
directions on TomTom. There aren't terribly many phrases or ingredients used
in most recipes, so the total set of required recordings in manageably small.

~~~
vmorgulis
In the Congress, Robin Wright sells a digital version of herself to a studio.

[https://en.wikipedia.org/wiki/The_Congress_%282013_film%29](https://en.wikipedia.org/wiki/The_Congress_%282013_film%29)

------
andrey_utkin
BTW Some of recent voices of Festival speech synthesis engine are awesome.

Try "Nick - 2 (English RP)" and "Peter (English RP male)" at
[http://www.cstr.ed.ac.uk/projects/festival/morevoices.html](http://www.cstr.ed.ac.uk/projects/festival/morevoices.html)

They are not publicly available, though.

------
aaron695
It's pretty amazing if they pull this off. It could mean education is
revolutionised. A very time consuming part of educational videos is dialog and
more importantly correcting errors. This is no small thing.

------
Animats
Speech out, not in.

------
th-ai
anyone know of an open TTS speech synthesis engine that provides delay
(duration) timings per word spoken? thanks

~~~
nshm
Any of them have this information available in some form - openmary, festival.
For example in Festival, you can access synthesized utt markup with
utt.something functions like this: (utt.save.segs (utt.synth (Utterance Text
"Hello world")) "out.seg")

------
m00dy
I love <3 my university.

