Hacker News new | comments | show | ask | jobs | submit login

While all of these vec2speech type models are impressive, I get the feeling that most of the comments didn't listen to any of the samples. It's still distinctly robotic sounding, probably has quite a bit of garbage output that needs to be filtered manually (as many of these nets often have) and is a far cry from fooling a human.



I've been editing audio professionally for ~20 years now. This is short of perfect, but it's already good enough to be used in some kinds of editing emergencies - coincidentally exactly the ones where you're likely to have audio problems in the first place.

For example, in a movie with a busy action scene involving gunfire/helicopters/storms the production audio is usually useless because there were fans and other loud machinery on the film set to create visual illusions of powerful winds and so on. In this situation the audio is either not recorded or used only as a guide track to match timing on a high-quality track recorded later in the studio. Actors hate re-recording dialog by lip-syncing in front of a screen and producers hate paying for it. This solution is already good enough to use for incidental characters in scenes that are going to be noisy anyway. I give it about 5 years before it's good enough to use in ordinary dialog scenes - not for the intimate conversation between the two Famous Actors flirting over a romantic dinner, but fine for replacing the waiter or other background characters.


In the first clip, I'd say 80% of the soundbites were obviously robot-like, but one or two of the "Obama" quotes were startlingly clear - "The good news is, that they will offer the technology to anyone" - I can't hear anything wrong with that in the first clip at all. If they were all that quality I'd say we'd be easily fooled. As a proof of concept this is pretty big.


I can definitely hear issues with that phrase. It has quite robotic drop-offs.

Though coming soon: Neural networks to determine whether speech is NN-generated? :P


> Neural networks to determine whether speech is NN-generated?

I guess this would be an ideal use case for a generative adversarial network based approach.


This is likely part of how the speech-generating NNs are trained (ie. there's a generated-speech-detector and the network is trained to fool it, while it is also trained): https://arxiv.org/abs/1406.2661


And then the generators train their own generation NN against that :P


And you have invented Generative Adversarial Networks. They are the basis of all new ML findings, like pix2pix.


In a sort of Turing Test where I don't know who's a robot, or where I'm not even expecting a robot, it would probably be a bit harder.


The "Obama" material sounded quite good, but reverb can cover up a multitude of sins...


Huh.

That means that recordings of speeches/performances from concert halls are potentially suspicious.

Know of any other instances of reverb covering things up? This is interesting!


> "potentially suspicious"

History is now doomed. Crackly recordings are obviously fakeable. Children will listen to JFK's "We shall not go to the moon" speech, proof that the moon landings are a liberal conspiracy and all that grainy footage is just CGI with a noise filter.


This idea hit me harder than I expected -- I am reminded of the scene in Interstellar, where the main character's daughter's teacher asserts that we never went to the moon. I don't recall whether she genuinely didn't believe it, or whether she felt it was better to lie to the kids to keep them motivated in the present, but apparently we are getting much closer to having gatekeepers of knowledge be able to actively subvert the artifacts of history in ways which are indistinguishable to the audience. Scary stuff.


> I don't recall whether she genuinely didn't believe it, or whether she felt it was better to lie to the kids to keep them motivated in the present

She stated it as a simple fact and seemed to believe it. There wasn't any indication that she might think it was a fact of what one might call "political convenience". It made the scene just that much chilling to me: https://www.youtube.com/watch?v=MpKUBHz6MB4


This is nothing new.. "History is written by the winners..." comes to mind...


Wasn't it the main character's teacher?


There have always been Gate Keepers to knowledge. Technology is in many ways blasted down these gates for many.


Yeah, that's a real issue. You can already fake audio very very easily. People don't realize how constructed movie soundtracks are, and I don't mean in terms of special effects but just people talking in every day situations.


I haven't worked extensively in voice adaptation, but I learned working in text-to-speech that adding a bit of reverb is quite effective at covering up artifacts.

Something similar seems to be going on in live vocals. If you lack confidence in your own voice, adding a bit of reverb can make it sound much better. Not sure what's going on — whether the reverb jams the critical listening facilities in one's brain or something like that.


> Something similar seems to be going on in live vocals. If you lack confidence in your own voice, adding a bit of reverb can make it sound much better.

a.k.a. why your singing sounds way better in the shower.


Wow. TIL x 2!

Now I understand why I liked adding a bit of reverb when listening to old MOD/IT/S3M audio files - it covered up the "digitalness" of the song structure a bit.

Thanks for the live vocals tidbit too, that's definitely something to file away.

I wonder how far you could push that in a presentational context (ie, when giving speeches), or whether "who left the speakers in 'dramatic cathedral' mode" would happen before "I dunno what they did to the audio but it sounds great". Maybe if the presentation area was fairly open/large it could work; the question is whether it would have a constructive effect.


In my (admittedly limited) experience mixing for both recording and live settings, I can say that the "sounds great" comes a long way before the "dramatic cathedral mode". If you can listen to it and hear reverb (unless you're going for that effect) you're doing it wrong. What you want is a bit of fullness, slightly softer edges at the end of words/sentences.

It's similar to the difference between 24fps cinema and 60fps home video. The video/clean signal retains more of the original information and is "more correct", but 24fps/touch of reverb adds a nuance that keeps things from getting too clinical. As to why we interpret clean signal == clinical == bad.... I can't really speculate.


Brains are incredible differential engines not evolved to handle current technology.

A standard quality video is just a projection, a high quality video stream on a 4k set running full 122hz is a weird window from wich we don't get stereo depth clues. The brain constantly has to rethink it's not real as we shift the head and the pov doesn't adjust.


I tried an Occulus a while ago and found it to be quite unrealistic, digital and "fake." (And I only had it on for about 30 seconds but my eyes felt a bit sore afterwards!)

Once LCD density allows for VR with 4K (or, if needed, 8K) per eye... yeah :) we'll firmly be in the virtual reality revolution.

Obviously we'll also need tracking and rendering that can keep up but display density is one of the trickier problems right now.


Interesting.

> If you can listen to it and hear reverb (unless you're going for that effect) you're doing it wrong.

I was thinking precisely that; I figured it'd need to be subtle and just-above-subliminal to have the most effect.

Completely agree about the 24fps-vs-60fps thing. I think this is a combination of both the fact that the lower framerate is less visual stimulation, and that I'm used to both the decreased visual stress and the overall more jittery aesthetic of 24fps.

Regarding >24fps, I think how it's used is critical.

I remember noticing https://imgur.com/gallery/2j98Y4e/comment/994755017/1 (yes, a random imgur gif - discovering that imgur doesn't have an FPS limit was nice though). I think this particular example pushes the aesthetics ever so slightly, but still looks pretty good.

I don't know where I found it but I remember watching a 48fps example clip of The Hobbit some time back. That looked really nice; I completely agree 48fps is a great target that still retains the almost-imperceptible jitter associated with 24fps playback.

To me the "nope"/sad end of the spectrum is motion smoothing. I happened to notice a TV running some or other animated movie with motion smoothing on while in an electronics store a few months ago... eughhh. It made an already artificial-enough video (I think it was Monsters Inc University) look eye-numbingly fake (particularly because the algorithm couldn't make its mind up about how much to smooth the video as it played, so some of it was jittery and some of it was butter-smooth). I honestly hope the idea doesn't catch on; it'll ruin kids and doom us to having to put up with utterly unrealistic games.

But I can see that's the direction we're headed in: 144Hz LCD panels are already a thing, and VR has a ton of backing behind it, so it makes a lot of sense that VR will go >200Hz over (if not within) the next 5 or so years.

The utterly annoying thing is that raising the framerates this high almost completely removes the render latency margins devs can currently play with. Rock-steady 60fps (with few drops below ~40fps) is hard enough but manageable on reasonable settings in most games nowadays (I think?), but when everyone seriously starts pining for 144fps+ at 4K, it's going to get a lot harder to keep the framerate consistent - now that we've hit ~4GHz, Moore's law won't allow breathing room for architectural overhead as has been the case for the past ~decade, and with current system designs (looking holistically at CPU, memory, GPU, system bus, game engine) we're already pushing everything pretty hard to get what we have.

So that problem will need to be solved before 144fps+ becomes a reality. A friend who has a 144Hz LCD says that going back to 60Hz just for desktop usage is really hard because the mouse is more responsive and everything just "feels" faster and more fluid. I'm not quite sure whether the games he plays keep up with 144fps though :P

On a separate note, I've never been able to make the current crop of 3D games "work" for my brain - everyone's pushing for more realism, more fluidity, etc etc, and it just drives things further and further into the uncanny valley for me, because realtime-rendered graphics still look terribly fake. Give me something glitchy and unrealistic in some way any day.


I think what it does is mask/smear the fine detail - the texture of the sound - but in a way that we are used to, so still sounds natural.


Adding a bit of autotune has made whole careers...


Also - lowering the bit-rate can coverup other defects (e.g. phone call).

The cadence was a bit off/unnatural, but I'm sure that is not too hard to fix. Phone-in TV/Radio/web shows are about to get very interesting.


Most of them are extremely mechanic, to the point where it's almost impossible to understand, but others are actually quite convincing.

I think it primarily needs to learn to respect punctuation, and to translate them to a breathing pause that matches the target voice ("President having speech"-style long pauses vs. "Politician having their ass handed to them by journalist on TV"-no-air-needed pauses).


Absolutely - listening through the multiple samples with different intonation from both Obama and Trump, some of the samples are much more realistic, while others come off as robotic.

Maybe it would be possible to train the system to prefer certain intonations in certain cases by rating the realism of the speech in context. It would be interesting to analyzes pauses around words grouped by word2vec! Or choosing a "style" of intonations based on punctuation, parameters like words/minute, etc.


I personally like this sample the most [0]. Note that these samples are not cherry-picked - having worked with very related algorithms [1] once it is trained well it pretty much "just works".

There is a lot of room for DSP/hacks/tricks to improve audio quality - just the same as in concatenative systems, but the point of this demo is to show what is possible with raw data + deep learning. Also note that this is (as far as I am aware) learned directly on real data such as youtube, or recordings + transcripts. That is quite a bit different than approaches which require commercial grade TTS databases, which are generally professional speakers with more than 10 hours of speech each, and cost a lot of money.

[0] https://soundcloud.com/user-535691776/special-guest-at-iclr

[1] http://josesotelo.com/speechsynthesis/


Very impressive I velt -- though there were robotic artifacts here and there (well every several seconds), much of it was fairly natural and reasonably lo-fi quality convincing. There was a HN post a few weeks ago from Google of their most TTS algorithm which was the best I've heard I think, quite quite human (whose techniques were published and could be integrated into Lyrebird).


you got a link anywhere?




I missed that one, and it's the best I heard so far. Bit of noise in the background, but the voice itself is very believable. I can imagine in year or two the radio and TV will start using this tech instead of actual speakers. Also, translation and synchronization of any content will be so much cheaper.


Try listening to the NOAA weather radio. They've been using very good TTS for years.


Wow, they are. Very surprised.

This is so good I'm wondering whether this is actually a massive (massive) sound bank. I think it might be a sound bank.

A random radio receiver site I found that doesn't require Flash, tuned to NOAA for Akron OH: http://tunein.com/radio/NOAA-Weather-Radio-1624-s88289/

The list I got the above link from (^F "noaa"): http://tunein.com/radio/Weather-c100001531/

I suspect the warbling I'm hearing is not due to TTS imperfections but 64kbps artifacting.


The one I listen to will occasionally mispronounce something in a way that a human never would, or say the name of a punctuation mark.


Huh. I see. Do you happen to know what station you use? I'd kind of like to hear this for myself (for the sole reason that I'd like to get an idea of what it sounds like, since that does definitely sound like a TTS).


KZZ40,162.45 MHz, Deerfield NH. Note that the stations have several different voices they use for different reports. Now that I think about it, I'm not sure which one I heard the mistakes on - it might have been one of the older ones.

BTW, I like the Tom voice more than the newer Paul. Paul is more realistic, but is also more soft-spoken and monotonic. Tom has more inflection and sounds more...forceful. I know it's just my imagination, but sometimes Tom sounds annoyed at bad weather :)


> KZZ40,162.45 MHz, Deerfield NH.

I see. I can't seem to find an online receiver for that frequency, although I did find that WZ2500 uses or seems to have used that frequency (for Wytheville VA).

I had a look at SDR.hu (a site I may or may not have just dug out of Google for the first time), but unfortunately the RTL-SDR receivers I can find seem to focus entirely on 0-30MHz. There are a couple ~400MHz receivers but nothing for ~160MHz.

(I may have fired up the receiver I found in NH and fiddled with it, puzzled, for 10 minutes before realizing the scale is in kHz, not MHz... yay)

> Note that the stations have several different voices they use for different reports.

Right.

> Now that I think about it, I'm not sure which one I heard the mistakes on - it might have been one of the older ones.

That's entirely possible. (But hopefully not. I kind of want to hear. :P)

> BTW, I like the Tom voice more than the newer Paul. Paul is more realistic, but is also more soft-spoken and monotonic. Tom has more inflection and sounds more...forceful. I know it's just my imagination, but sometimes Tom sounds annoyed at bad weather :)

I just learned about this service, I have to admit (I'm in Australia). It sounds really nice to be able to have a computer continuously read out the weather conditions to you as they change. And I can completely relate to the idea of preferring the voice that sounds unimpressed when the weather's bad :D


It really is a useful service. Many people have battery powered radios that include AM, FM and weather radio.

I'm not surprised that they re-use the frequencies. These are local weather stations, only intended to serve a radius of a hundred miles or so (at least here on the east coast). In addition to my local station I can receive the one in Boston, about 50 miles south of me.


Do we know what they're using? I found a few references to AT&T Natural Voices.


According to Wikipedia, "In 2002, the National Weather Service contracted with Siemens Information and Communication and SpeechWorks to introduce improved, more natural voices. The Voice Improvement Plan (VIP) was implemented, involving a separate computer processor linked into CRS that fed digitized sound files to the broadcast suite.... Additional upgrades in 2003 produced an improved male voice nicknamed "Tom", which could change intonation based on the urgency of a product"

Also, here are some audio clips http://www.nws.noaa.gov/nwr/info/newvoice.html


It doesn't sound too different from a voice coming over a walkie talkie or some kind of intercom.

The problem might be that high frequencies, especially overtones, aren't properly constructed, but I'm certain that can be improved.


The main problem is that the algorithms don't yet know what to stress in a sentence. The problem is semantic, and not so much about the sound of the voice itself.

You can synthesize someone's voice perfectly, but if it's stressing words incorrectly or not at all, it's not going to fool anyone.

Then again, that's probably easier to work around by having humans annotate the sentences to be read.


> Then again, that's probably easier to work around by having humans annotate the sentences to be read.

Or by starting with a recording of someone else reading the sentence. Then you get the research problem known as "voice conversion", which has been studied a fair amount, but mostly prior to the deep learning era - and mostly without the constraint of limited access to the target person's voice. (On the other hand, research often goes after 'hard' conversions like male-to-female, whereas if your goal is forgery, you can probably find someone with a similar voice to record the input.)

Anyway, here's an interesting thing from 2016, a contest to produce the best voice conversion algorithm, with 17 entrants:

http://vc-challenge.org/summary.html


pragmatic*, placing stress is less a problem of word meaning than it is of speaker adaptation for listener comprehension, emphasis, and prosodic tendencies.

Even then, I don't believe the issue is with stress. I believe that the voices sound robotic because they are using, and also admitting because it makes their results impressive in some sense, very few samples, "less than a minute" they claim. Triphones are usually what speech systems are trained on. The amount of triphones (3-phoneme-grams) to cover a language's phonemic inventory is huge (50 phonemes = 50! triphones, which could mean a few hours of audio, although many will not occur within the language given the phonotactics of the language).


Sure, it's not clean and crisp but if you ignore the distortions and the somewhat arbitrary intonation it's pretty close to the real thing. And I would argue that it's already good enough to fool grandma for a phone scam (but then again scammers currently get away with practically no resemblance at all).


Scammers have much more success phoning Granny and claiming that they are Windows Support or "The Internet" and simply asking for what they want. There are far easier ways to con than this, and for very sensitive scams where you need spear phishing of this order, then this will fall flat on its face. "Good enough" doesn't quite cut it when it comes to this.

And fwiw, I think the intonations are actually impressively learned and not random. Trump's odd yet distinct intonations are capture quite well in their samples.


I think the editing manually is something people with resources and motives will do anyways.

I can see something like this, if not already used, for propaganda.

Tin Foil hat is on


I imagine that even at the point where it had been improved to such an extent that a human could not tell the difference it would still be possible to train a simple ai to tell which ones were non-human nevermind the original speaker.


It might just be about the quality of the source. It seems they used public speeches. Obama sounds like he stands in a stadium. Trump is closer like a TV debate. Hillary sounds off to me, though.


Hillary sounded very off, but what is new? Joking aside, Obama didn't sound too bad, but the Hillary voice was bad enough I wouldn't have included it in a demo.


Depends if you want to fool a person or a computer.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: