Hacker News new | comments | show | ask | jobs | submit login

It doesn't sound too different from a voice coming over a walkie talkie or some kind of intercom.

The problem might be that high frequencies, especially overtones, aren't properly constructed, but I'm certain that can be improved.

The main problem is that the algorithms don't yet know what to stress in a sentence. The problem is semantic, and not so much about the sound of the voice itself.

You can synthesize someone's voice perfectly, but if it's stressing words incorrectly or not at all, it's not going to fool anyone.

Then again, that's probably easier to work around by having humans annotate the sentences to be read.

> Then again, that's probably easier to work around by having humans annotate the sentences to be read.

Or by starting with a recording of someone else reading the sentence. Then you get the research problem known as "voice conversion", which has been studied a fair amount, but mostly prior to the deep learning era - and mostly without the constraint of limited access to the target person's voice. (On the other hand, research often goes after 'hard' conversions like male-to-female, whereas if your goal is forgery, you can probably find someone with a similar voice to record the input.)

Anyway, here's an interesting thing from 2016, a contest to produce the best voice conversion algorithm, with 17 entrants:


pragmatic*, placing stress is less a problem of word meaning than it is of speaker adaptation for listener comprehension, emphasis, and prosodic tendencies.

Even then, I don't believe the issue is with stress. I believe that the voices sound robotic because they are using, and also admitting because it makes their results impressive in some sense, very few samples, "less than a minute" they claim. Triphones are usually what speech systems are trained on. The amount of triphones (3-phoneme-grams) to cover a language's phonemic inventory is huge (50 phonemes = 50! triphones, which could mean a few hours of audio, although many will not occur within the language given the phonotactics of the language).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact