You are right that we'll have to worry about this soon though. Likewise, verifying the identify of people we think we're talking to over video calls for example.
...Is it normal to feel bad for the Johnnycab "driver" when Arnie destroys it?
Books like "Trust Me, I'm Lying" reveal the lengths at which deception can occur. Though this book discusses deception that starts at the textual level (e.g. blogs), it is inevitable that these tactics will be translated to the video level once the technology catches up.
Also, "at face value" - Ha! ;)
I suggest you rethink your assessment of what is happening here.
She had been a singer and strongly identified her self with her voice, she wanted to be able to use a speech synthesis system that had her own voice pattern.
Apologies if this was already mentioned, but it seems to be a use others here hadn't considered.
I could see him doing such a switch for personal interactions.
> "The voice I use is a very old hardware speech synthesizer made in 1986," he said. "I keep it because I have not heard a voice I like better and because I have identified with it."
I'm not sure how sentimental he is but he does seem quite tied to that voice since there have been lots of advancements in voice synthesis since he originally got this and yet he's chosen to keep this one.
Should be just as possible in the present or near future.
For example, in a movie with a busy action scene involving gunfire/helicopters/storms the production audio is usually useless because there were fans and other loud machinery on the film set to create visual illusions of powerful winds and so on. In this situation the audio is either not recorded or used only as a guide track to match timing on a high-quality track recorded later in the studio. Actors hate re-recording dialog by lip-syncing in front of a screen and producers hate paying for it. This solution is already good enough to use for incidental characters in scenes that are going to be noisy anyway. I give it about 5 years before it's good enough to use in ordinary dialog scenes - not for the intimate conversation between the two Famous Actors flirting over a romantic dinner, but fine for replacing the waiter or other background characters.
Though coming soon: Neural networks to determine whether speech is NN-generated? :P
I guess this would be an ideal use case for a generative adversarial network based approach.
That means that recordings of speeches/performances from concert halls are potentially suspicious.
Know of any other instances of reverb covering things up? This is interesting!
History is now doomed. Crackly recordings are obviously fakeable. Children will listen to JFK's "We shall not go to the moon" speech, proof that the moon landings are a liberal conspiracy and all that grainy footage is just CGI with a noise filter.
She stated it as a simple fact and seemed to believe it. There wasn't any indication that she might think it was a fact of what one might call "political convenience". It made the scene just that much chilling to me: https://www.youtube.com/watch?v=MpKUBHz6MB4
Something similar seems to be going on in live vocals. If you lack confidence in your own voice, adding a bit of reverb can make it sound much better. Not sure what's going on — whether the reverb jams the critical listening facilities in one's brain or something like that.
a.k.a. why your singing sounds way better in the shower.
Now I understand why I liked adding a bit of reverb when listening to old MOD/IT/S3M audio files - it covered up the "digitalness" of the song structure a bit.
Thanks for the live vocals tidbit too, that's definitely something to file away.
I wonder how far you could push that in a presentational context (ie, when giving speeches), or whether "who left the speakers in 'dramatic cathedral' mode" would happen before "I dunno what they did to the audio but it sounds great". Maybe if the presentation area was fairly open/large it could work; the question is whether it would have a constructive effect.
It's similar to the difference between 24fps cinema and 60fps home video. The video/clean signal retains more of the original information and is "more correct", but 24fps/touch of reverb adds a nuance that keeps things from getting too clinical. As to why we interpret clean signal == clinical == bad.... I can't really speculate.
A standard quality video is just a projection, a high quality video stream on a 4k set running full 122hz is a weird window from wich we don't get stereo depth clues. The brain constantly has to rethink it's not real as we shift the head and the pov doesn't adjust.
Once LCD density allows for VR with 4K (or, if needed, 8K) per eye... yeah :) we'll firmly be in the virtual reality revolution.
Obviously we'll also need tracking and rendering that can keep up but display density is one of the trickier problems right now.
> If you can listen to it and hear reverb (unless you're going for that effect) you're doing it wrong.
I was thinking precisely that; I figured it'd need to be subtle and just-above-subliminal to have the most effect.
Completely agree about the 24fps-vs-60fps thing. I think this is a combination of both the fact that the lower framerate is less visual stimulation, and that I'm used to both the decreased visual stress and the overall more jittery aesthetic of 24fps.
Regarding >24fps, I think how it's used is critical.
I remember noticing https://imgur.com/gallery/2j98Y4e/comment/994755017/1 (yes, a random imgur gif - discovering that imgur doesn't have an FPS limit was nice though). I think this particular example pushes the aesthetics ever so slightly, but still looks pretty good.
I don't know where I found it but I remember watching a 48fps example clip of The Hobbit some time back. That looked really nice; I completely agree 48fps is a great target that still retains the almost-imperceptible jitter associated with 24fps playback.
To me the "nope"/sad end of the spectrum is motion smoothing. I happened to notice a TV running some or other animated movie with motion smoothing on while in an electronics store a few months ago... eughhh. It made an already artificial-enough video (I think it was Monsters Inc University) look eye-numbingly fake (particularly because the algorithm couldn't make its mind up about how much to smooth the video as it played, so some of it was jittery and some of it was butter-smooth). I honestly hope the idea doesn't catch on; it'll ruin kids and doom us to having to put up with utterly unrealistic games.
But I can see that's the direction we're headed in: 144Hz LCD panels are already a thing, and VR has a ton of backing behind it, so it makes a lot of sense that VR will go >200Hz over (if not within) the next 5 or so years.
The utterly annoying thing is that raising the framerates this high almost completely removes the render latency margins devs can currently play with. Rock-steady 60fps (with few drops below ~40fps) is hard enough but manageable on reasonable settings in most games nowadays (I think?), but when everyone seriously starts pining for 144fps+ at 4K, it's going to get a lot harder to keep the framerate consistent - now that we've hit ~4GHz, Moore's law won't allow breathing room for architectural overhead as has been the case for the past ~decade, and with current system designs (looking holistically at CPU, memory, GPU, system bus, game engine) we're already pushing everything pretty hard to get what we have.
So that problem will need to be solved before 144fps+ becomes a reality. A friend who has a 144Hz LCD says that going back to 60Hz just for desktop usage is really hard because the mouse is more responsive and everything just "feels" faster and more fluid. I'm not quite sure whether the games he plays keep up with 144fps though :P
On a separate note, I've never been able to make the current crop of 3D games "work" for my brain - everyone's pushing for more realism, more fluidity, etc etc, and it just drives things further and further into the uncanny valley for me, because realtime-rendered graphics still look terribly fake. Give me something glitchy and unrealistic in some way any day.
The cadence was a bit off/unnatural, but I'm sure that is not too hard to fix. Phone-in TV/Radio/web shows are about to get very interesting.
I think it primarily needs to learn to respect punctuation, and to translate them to a breathing pause that matches the target voice ("President having speech"-style long pauses vs. "Politician having their ass handed to them by journalist on TV"-no-air-needed pauses).
Maybe it would be possible to train the system to prefer certain intonations in certain cases by rating the realism of the speech in context. It would be interesting to analyzes pauses around words grouped by word2vec! Or choosing a "style" of intonations based on punctuation, parameters like words/minute, etc.
There is a lot of room for DSP/hacks/tricks to improve audio quality - just the same as in concatenative systems, but the point of this demo is to show what is possible with raw data + deep learning. Also note that this is (as far as I am aware) learned directly on real data such as youtube, or recordings + transcripts. That is quite a bit different than approaches which require commercial grade TTS databases, which are generally professional speakers with more than 10 hours of speech each, and cost a lot of money.
This is so good I'm wondering whether this is actually a massive (massive) sound bank. I think it might be a sound bank.
A random radio receiver site I found that doesn't require Flash, tuned to NOAA for Akron OH: http://tunein.com/radio/NOAA-Weather-Radio-1624-s88289/
The list I got the above link from (^F "noaa"): http://tunein.com/radio/Weather-c100001531/
I suspect the warbling I'm hearing is not due to TTS imperfections but 64kbps artifacting.
BTW, I like the Tom voice more than the newer Paul. Paul is more realistic, but is also more soft-spoken and monotonic. Tom has more inflection and sounds more...forceful. I know it's just my imagination, but sometimes Tom sounds annoyed at bad weather :)
I see. I can't seem to find an online receiver for that frequency, although I did find that WZ2500 uses or seems to have used that frequency (for Wytheville VA).
I had a look at SDR.hu (a site I may or may not have just dug out of Google for the first time), but unfortunately the RTL-SDR receivers I can find seem to focus entirely on 0-30MHz. There are a couple ~400MHz receivers but nothing for ~160MHz.
(I may have fired up the receiver I found in NH and fiddled with it, puzzled, for 10 minutes before realizing the scale is in kHz, not MHz... yay)
> Note that the stations have several different voices they use for different reports.
> Now that I think about it, I'm not sure which one I heard the mistakes on - it might have been one of the older ones.
That's entirely possible. (But hopefully not. I kind of want to hear. :P)
> BTW, I like the Tom voice more than the newer Paul. Paul is more realistic, but is also more soft-spoken and monotonic. Tom has more inflection and sounds more...forceful. I know it's just my imagination, but sometimes Tom sounds annoyed at bad weather :)
I just learned about this service, I have to admit (I'm in Australia). It sounds really nice to be able to have a computer continuously read out the weather conditions to you as they change. And I can completely relate to the idea of preferring the voice that sounds unimpressed when the weather's bad :D
I'm not surprised that they re-use the frequencies. These are local weather stations, only intended to serve a radius of a hundred miles or so (at least here on the east coast). In addition to my local station I can receive the one in Boston, about 50 miles south of me.
Also, here are some audio clips http://www.nws.noaa.gov/nwr/info/newvoice.html
The problem might be that high frequencies, especially overtones, aren't properly constructed, but I'm certain that can be improved.
You can synthesize someone's voice perfectly, but if it's stressing words incorrectly or not at all, it's not going to fool anyone.
Then again, that's probably easier to work around by having humans annotate the sentences to be read.
Or by starting with a recording of someone else reading the sentence. Then you get the research problem known as "voice conversion", which has been studied a fair amount, but mostly prior to the deep learning era - and mostly without the constraint of limited access to the target person's voice. (On the other hand, research often goes after 'hard' conversions like male-to-female, whereas if your goal is forgery, you can probably find someone with a similar voice to record the input.)
Anyway, here's an interesting thing from 2016, a contest to produce the best voice conversion algorithm, with 17 entrants:
Even then, I don't believe the issue is with stress. I believe that the voices sound robotic because they are using, and also admitting because it makes their results impressive in some sense, very few samples, "less than a minute" they claim. Triphones are usually what speech systems are trained on. The amount of triphones (3-phoneme-grams) to cover a language's phonemic inventory is huge (50 phonemes = 50! triphones, which could mean a few hours of audio, although many will not occur within the language given the phonotactics of the language).
And fwiw, I think the intonations are actually impressively learned and not random. Trump's odd yet distinct intonations are capture quite well in their samples.
I can see something like this, if not already used, for propaganda.
Tin Foil hat is on
Good job, team Lyrebird. My feedback is that while the inclusion of ethics page is great, it could do with more content on your vision and what you will not let your tech be used for. I know others can develop similar tech, but it will be good to read about YOUR ethics.
[Edited for clarity]
Judging by the samples from the homepage there are audible artifacts in the recordings resulting from synthesis. I doubt these would pass scrutiny if presented as evidence in court. In some ways forging a voice is like forging a signature, truth can be exposed with enough effort.
Not just that, but ethical expectations on the users, backed up by legal policy, would seem important for this.
1. Open source voice-copying software
2. At worst, create entire market of voice-fraudsters, at best, very few voice-fraudsters but very high and very real perception of fear of such
3. Become leading security experts in voice fraud detection
4. Sell software / time / services to intelligence agencies, governments, law enforcement, news networks
Ethically I'm a bit concerned with (2), but realistically the team is right --- this technology exists, it will certainly be used for good and for bad, and they're positioning themselves as the leading experts.
I'm interested to see which VCs and acquirers line up here. Applying a voice to any phrase seems useful for voice assistants (Amazon Alexa, Google Home) but I don't think that's the $B model.
I do not think the technology involved artificially generated voice though, but simply morphing someone's voice into sounding as the target voice.
I believe technology would make a Judge's life really hard.
This is something Google has been working a lot on  and Baidu also recently posted about their results too . We're definitely pretty close to passing the human detectable level.
The human mind seems to be better at this than most creators credit it for.
Speech models usually use triphones, which turns out to be a huge amount of audio. This is particularly impressive because of how little data they need.
Google used their own datasets, which are most likely massive.
To me, at least, the voices I heard in Lyrebird's demo actually sounded more 'real' than Microsoft Sam for example.
The sooner we can get to a point where everybody knows stuff like this (voice impersonation) is possible, the sooner we can avoid real damages (of courts mis-judging with an impersonated voice recording as accepted evidence).
Yes, we lose an entire area of evidence that can be used in court (all voice recordings, possibly), but the tech was going to get here sooner or later and it was going to be a problem we'd have to deal with. I'd rather be at a place where everyone knows voice recordings are unreliable, than actually having harm done because of impersonated voices because people didn't think it was possible.
How we're going to fight against people believing whatever sound bytes from fake news they want to believe is a harder question...
Speech synthesis is one of those 90% problems - when you're 90% done, you find you only have 90% left to do.
This level of synthesis is relatively easy. Getting to the 'Can reliably pass for the real thing" level is going to take a huge amount of extra work.
It's not even about computational power - it's about the sophistication of the models, and their ability to parse words into phonemes correctly with some knowledge of social and linguistic context.
"Good enough for some applications" - like phone switchboard systems - is a simpler problem. Virtual impersonation is very much harder.
For example, as an individual with hearing problems, I may not be so easily able to determine a synthesized recording from an actual recording - for a short period of time. With longer recordings it may become more obvious.
For the examples given for various intonations from Obama/Trump, some intonations are much more natural than others. It would be interesting to decide how to parametrize a sentence for the intended intonation. (based on word2vec analysis of the words in the sentence, punctuation cues in the sentence, and perhaps a specified category of "emotional delivery").
It would be interesting at the sentence-level, but also at the macro speech-level to include the right "mix" of intonations for a specific context. On a related note, it would be interesting to study the patterns of intonations in successful vs unsuccessful outbound sales calls, for example, to learn how to best simulate a good human sales voice.
There's a transcript here:
One of the ads:
EDIT: See the NPR Planet Money podcast linked elsewhere in the comments. Apparently there are fairly specific laws to protect someone's likeness -- including their voice -- thanks to the entertainment industry and Frank Sinatra.
I guess he had a strong case because in all three cases he was (or so he claimed) approached to do a commercial and the advertisers went with a Waits-like surrogate after he declined.
Trademark/Copyright can't be disclaimed in this way but Passing Off requires active deception AIUI?
You're right that being clear to avoid confusion is a good preventative measure, but you're wrong that intent to defraud is required. It's enough that the public is (or is likely to be) confused.
See Reckitt v Borden (1990) judgement referring to the three-part test for claims of passing off (my emphasis):
"Second, he must demonstrate a misrepresentation by the defendant to the public (whether or not intentional) leading or likely to lead the public to believe that goods or services offered by him are the goods or services of the plaintiff"
It might be worth noting Mickey Mouse is a registered trademark and so doesn't need to use the weaker Passing Off law?
But also enabling the next gen of "Mom, I'm in Mexican jail. Quickly wire me $2,000 so I can get out." scams.
Want a video of a politician saying "Hitler was right" to cheering masses? Want a video about a president saying it's time to start Nuclear War One?
You can make that.
No it hasn't, not at all.
What do you mean? it most certainly has.
It is much harder to look at a photograph now and say for sure it's real. Before photoshop, there were ways of manipulating photos, but they were much harder and did not yield nearly as good results. You can photoshop people into photos they were not originally in, doing things they have never done. Even Trump is having his hands enlarged in photos.
Maybe you're speaking specifically about "evidence" in the legal sense. I can't speak to that. But there are countless examples where you can't (and shouldn't) believe what you see, because it's not real.
Most doctored images/videos get snuffed out by the media very quickly if they grow viral.
I have a friend who believes in aliens. They showed me a video that someone had put together of "footage". They were seriously showing it to me as evidence, to convince me that they are real, based on this video. I was shocked they were so serious - and having a hard time containing my laughter.
I later found a number of other videos debunking the videos and clips that video was based off of - but my friend still believes in aliens, based on those videos. And this is someone I regularly have "intelligent" conversations with.
This change, the ability to put words in someones mouth, is the next photoshop. And it WILL have consequences to it, good and bad. And we are not prepared for that, as a society - we haven't gotten over photoshop yet.
"Voice recordings are currently considered as strong pieces of evidence in our societies and in particular in jurisdictions of many countries. Our technology questions the validity of such evidence as it allows to easily manipulate audio recordings. This could potentially have dangerous consequences such as misleading diplomats, fraud and more generally any other problem caused by stealing the identity of someone else.
By releasing our technology publicly and making it available to anyone, we want to ensure that there will be no such risks. We hope that everyone will soon be aware that such technology exists and that copying the voice of someone else is possible. More generally, we want to raise attention about the lack of evidence that audio recordings may represent in the near future."
I'm glad the authors addressed this issue pretty forthrightly, but part of me wishes they'd written a bit more about exactly your point. Whether or not recorded speech will continue to be legally binding evidence, I think it's just as important to point out that many people are normally quite happy to take what they hear as solid evidence, especially when it aligns with their prejudices.
This is your first and only post, you have no submissions or favorites, and your account is 193 days old.
I'm very curious (and perplexed) as to why you have linked a video from elsewhere in this thread with no supporting context regarding its relevance other than "hf".
Like we can trust anything a politician says/has said today anyway!
Which would be great for people like Trump. "Trust me, that audio is fake news. Fake!"
Responding to multiple sibling comments.
A random relative; not so much.
If they really need a voice sample from a relative they could still impost, with voice or not, as insurance person, HR or police and acquire it by simple means if they have any idea about the social graph of their victim.
If you thought fake news was bad before wait until these 'secret' recordings start getting released and reported on.
Alice broke up with Bob. Bob grabs the YouTube videos from Alice and makes video and voice profiles. Bob then posts a video of Alice saying how breaking up was her biggest mistake and how she misses <list of every sexual thing you can think of> because Bob does it all best.
That could end badly really easily.
Because the market appears to have spoken on that one and it said "meh, I don't care" with an solid shrug of indifference.
By the same logic, one can see artificially produced vocal performance combined with artificial overlaying of photorealistic 3d reproduction as a way to cost effectively maximise the performer and crew expenses, and ensure the consistency of a performance. The results may even be better than what they could have done with the real performer in the case of some attractive actors who are not very good at the acting part of being an actor/modern celebrity.
That said, I'll definitely miss the days people had to actually be able to act, but then again I also miss the days people used to actually be able to play an instrument well and or sing well if they wanted to be a famous musician.
Also note that NPSS has some amount of post-processing, at least reverb and perhaps other common musical mixing - we don't really know how these samples are generated, and I have a hard time decyphering exactly what inputs are required, and what are generated from the paper alone. However, I really, really, really like NPSS - I just don't think the comparison you are making is valid here.
These features (f0, duration, pronunciation) are some of the most difficult things to learn to model from datasets of speech and text directly, and I am not sure how they got the subset used (I think only f0 and pronunciation/phoneme) for this NPSS model. Giving creators fine-grained control of the performance (as in NPSS) is quite cool, and if these systems can get fast enough I think the possibilities are really exciting. The same things could likely be done with lyrebird as well - there is no real "tech reason" you couldn't add more conditional inputs, with finer grained information/control.
The key part in my mind is deciding what amount of complexity to show to a user, and what amount to try and capture inside the model - some people may want to control (for example) duration and f0 directly for a performance, while others may want to just upload clips to an API and get reasonable results back, with less ability to control each sample (they can still curate themselves for the "best" samples). Lyrebird.ai is handling the latter case, while the former case would require quite a bit more intervention from the average user, almost becoming like an instrument ala the original voder . However, you could potentially have both approaches as a kind of beginner/advanced mode, but advanced mode needs a user interface, and probably near-realtime feedback.
I used to really strongly believe that the audio model was going to be the hard part of "neural" TTS (blame my background in DSP perhaps), but post-WaveNet the game has really changed a lot - conditional audio models are something we are starting to know how to do pretty well.
The text pipeline of most TTS systems is still the craziest part in my mind, check out a "normal" feature extraction of 416 hand-specified features ! These extractions can be upwards of 1k features per timestep/frame, and generally require a lot of linguistic knowledge to specify for new languages. It seems (given Alex Graves' demo , char2wav , tacotron) that we are making progress on learning this information directly from text, which in my mind is a key breakthrough for TTS in languages besides English, where lots of work on English pronunciation has been done already and is generally available.
Reading changelogs before deployment never sounded better!
I'll tell you what will get complicated, copyright holders complaining that their product was used as input for the training algorithm and demanding a slice of any profits because they made the famous individual more famous by casting them.
2. Is this better then what Google or Baidu are doing?
3. I remember reading Adobe has something similar.
4. Why ( What happened ) that all of a sudden we have 4 company making voice breakthrough tech like these?
5. What Happen to Voice Acting? Places like Japan where they highly value voice actor. Is Voice even patentable?
2) Possibly. Google and Baidu have compute resources far beyond a university. This method might be better, and do well with more resources.
3) Adobe's method required far more input data. This apparently requires only 1 minute of your audio to start sounding like you.
4) Deep learning has revolutionized vision, and language processing for a while. It was just a matter of time before people started applying those methods on speech data, with similar surprising results.
5) It will be hard to capture human elements like "emotion" and tone via generative models. Maybe in the future the work will become sophisticated enough to be indistinguishable from human speech, but right now there are some telltale signs that it is artificially generated.
WaveNet certainly changed the game in many ways, but approaches to TTS using RNNs have different roots. WaveNet and friends (incl. DeepVoice and NPSS linked elsewhere in this thread) are largely focused on audio modeling, and generally use something closely related to the "classic" TTS pipeline for text in the frontend. The audio modeling results are stellar, and really blew me away personally - basically changing my perspective on what is possible in audio modeling overnight.
RNN models try to tackle the whole problem (text + audio modeling) at once, though currently (all?) RNN and attention style models need intermediate / high level hints or pretraining from things like vocoder representations or spectrograms, versus WaveNet's approach using the waveform directly. So they are complimentary in many ways, and I am sure we will see people trying to combine them soon - char2wav has this flavor by using SampleRNN, our lab's take on raw waveform generation though we are still working on the fully end-to-end from scratch training, the inference path is truly end-to-end. Though there are still many details to work out as far as output quality, it seems possible that this will be a productive approach (though I am quite biased).
We see similar directions in neural machine translation (NMT) moving from word level representations to word parts or characters directly - one of the big reasons deep learning has come so far, so fast is that a lot of techniques from other subfields can be utilized for new domains, and I think there is a lot more fertile ground for crossover in both directions.
Heiga Zen has a great overview talk about how speech synthesis, as a field, overlaps between different approaches and factorizations . His work on parametric synthesis and TTS generally has laid the foundation for a lot of recent advances, and he was also a co-author on WaveNet!
I don't think this tech knows how to act. However, it could be used to increase the range of a voice actor.
As if human voice imitators have not existed and could not be paid for prior to this. For $5 you can get Stewie Griffin  or Barack Obama  to say whatever you want them to say. Any audio-only messages of well known figures should already be considered "compromised" and untrustworthy. Even without the technology to impersonate them.
This should be more concerning for "normal people". It isn't that you can no longer trust an audio-only recording of Obama, but that you may not longer be certain an audio recording is from your best friend. (E: Once the technology improves a bit more of course.)
How difficult is it to create/tune voices from parameters rather than training from an audio clip? I build software where people create fictional characters for writing, and having an author "create" voices for each character would be an amazing way to autogenerate audiobooks with their voices, or interact with those characters by voice, or just hear things written from their point of view in their voice for that extra immersion. Having an author upload voice clips of themselves mimicking what they think that character should sound like, but probably would keep traces of their original voice (and feel "fake" to them because they can recognize their own voice), no?
Can't wait to see how this pans out. Signed up for the beta and will definitely be pushing it to its limits when it's ready. :)
I built a toy concatenative Donald Trump speech system , but I don't have an ML background. I've been taking Andrew Ng's online course in addition to Udacity's deep learning program in an attempt to learn the basics. I'm hoping I can use my dataset to build something backed by ML that sounds better.
Is anyone in the Atlanta area interested in ML? I'd love to chat over coffee or join local ML interest groups.
It's impressive for what it is, but a lot of people here seem way too excited. This isn't any kind of breakthrough, and only the shortest hand-picked snippet would fool anyone.
But, you can certainly see where this is going and that's the worrisome part.
However, the day some shill tries to sell me travel insurance in departed nana's voice would be the day I start signing my voice convos' with a pgp key.
"Wife" sounds exactly the same in both places. All they did was copy the exact waveform from one point to another, like an automated cut and paste in Audacity. Nothing is being synthesized.
The word "Jordan" is not being synthesized. The speaker was recorded saying "Jordan" beforehand for this insertion demo and they're trying to play it off as though it was synthesized on the fly. That's incredibly dishonest. This is a scripted performance and Jordan is feigning surprise.
Again, the phrase "three times" here was prerecorded.
This was a phony demonstration of a nonexistent product. Reporters parroted the claims and none questioned what they witnessed. Adobe falsely took credit and received endless free publicity for a breakthrough they had no hand in by staging this fake demo right on the heels of the genuine interest generated by Google WaveNet. They were hoping they'd have a real product ready before anyone else.
If Adobe had a real product then they'd have proven it with a demo as alarming, undeniable, and straightforward as Lyrebird's. Instead they relied on aesthetics and flashy, polished, deceptive performance art with famous actors and de facto applause tracks to cover up the fact that they have nothing.
But you may have a point, and the ethics section makes it clear that they are indeed very aware of that this may be misused.
Plus, it doesn't even matter. You can write an article with fake quotes and people will believe it without even caring if there is an accompanying sound byte or not.