> "It seems that Adobe's programmers were swept along with the excitement of creating something as innovative as a voice manipulator, and ignored the ethical dilemmas brought up by its potential misuse," he told the BBC.
> "Inadvertently, in its quest to create software to manipulate digital media, Adobe has [already] drastically changed the way we engage with evidential material such as photographs.
> "This makes it hard for lawyers, journalists, and other professionals who use digital media as evidence.
I find this /reaction/ horrifying - Adobe isn't invalidating evidence so much as they are making the public infinitely more aware (as a sibling commenter points out) that evidence may already be compromised.
If Adobe doesn't do this, actors who care enough to forge evidence surely will, so better that the public trust in voice evidence collapse than people mistakenly continue to believe in it. :@
This is, um, not my favorite journalistic device.
"Wife" sounds exactly the same in both places, so all this did was copy the exact waveform from one point to another. Nothing is being synthesized. If this is all this app can do, it would be quicker and easier to do this manually.
That little "guh" noise at the beginning of the first "wife" could also be manually cleaned up and pitch/formant shifted to sound more natural with respect to its position in the sentence.
The word "Jordan" is not being synthesized. He was recorded saying "Jordan" beforehand for this insertion demo and they're trying to play it off as though this was synthesized on the fly. This is all a scripted performance. Jordan is phoning in his task of feigning surprise.
Here they double down on their lie. The phrase "three times" was clearly prerecorded.
If Adobe wanted to put the bare minimum effort into trying to convince anyone this was a real product that exists and is capable of synthesizing speech on the fly, then they'd toss a beachball around the audience and have them shout out words to type.
This is a fraudulent demonstration of a nonexistent product and an audacious insult to everyone's intelligence. Adobe is falsely taking credit and getting endless free publicity for a breakthrough they had no hand in. They are stealing the hype recently generated by Google WaveNet and praying they'll have a real product ready by whatever deadline they've set for themselves.
It is getting more tempting to join an industry where there are no customers (prop trading).
People say that a programmer should focus on meeting business needs, not on typing out code. But if business needs are this level of bullshit, it's really hard to focus on them and retain even a modicum of self-respect as an engineer.
Makes you think. What's been faked in the last year or two that we didn't know could be faked?
Years later, I edited dialog from a movie (for my own use) to shorten or rearrange sentences in a way that very few people can detect, to remove profanity from clips I wanted to show to people who can't handle it.
Also, AIUI Hollywood has been splicing dialog for decades in ADR. Star Trek TNG had an episode in which a vocal resynthesis device played a role, so the idea is old.
And the DoD's R&D budget for 2016 was $71.9 billion, plus whatever chunk of the classified intelligence budget of $80 billion went to R&D.
So roughly a magnitude more than Apple, each and every year.
I can think of a few ways of faking a timecode like that, but they can be counteracted to an extent.
On the other hand, if there was some way to add a watermark that was also a cryptographic signature, you could actually prevent editing.
I have no idea how you would implement such a signature system though.
Shouldn't be too noticeable.
A similar approach should work for audio too.
That's pretty cool. I wonder if this has ever been used to track down someone.
Is there a name for the fallacy of giving the first person to market all the credit for an idea whose time has, essentially, come, with or without them?
Surely this is an idea obvious enough that, had Adobe not done (and publicized!) this, someone else would?
I mean, seriously. It's been obvious for like 20 years now that stuff like this is going to be possible pretty soon. That someone finally packaged this capability in a nice form-factor doesn't change much. But then again, I guess people still didn't internalize the fact that photographs are like 15 years past being a reliable source, and videos probably 10 years past it too.
It's interesting how society will have to adapt to function with such technological capabilities, but - like others here pointed out - it's nothing really new, so I don't get the surprised concern.
He mentioned a project based on predicting marketing/political campaign reactions in things like social media. My hand immediately went up and I asked him what they thought about the ethical implications, and how it could be protected from abuse. "We aren't really thinking about those kinds of things." I'm not surprised to see this line up with the quote in tekacs reply.
I don't trust Adobe, and while this is certainly really neat-o tech, I just don't see its benefits outweighing the huge impact its abuse could precipitate.
I think we should definitely be thinking about those things. We have some pretty sharp technologists to point to who thought about the ethics of what they were doing as well. Those working on the Manhattan Project and the German equivalent and come easily to mind.
Specifically this is relevant for ZRTP protocol. There is a voice-based verification step which relies on knowing and recognizing the other party's voice and then verifying a short authentication string they see on their screen. Being able to mimic can lead to ability to conduct a man-in-the-middle attack.
To counter-act the strategy is to use dictionary words instead of just numbers for verification, say "pink salad elephant" instead of "1934" so then parties would maybe have a joke or say something referring to the ridiculous word combination combination. That would be harder to mimic.
This is a bit different though. The idea is that both parties should see the same authentication string and then verify verbally what that is.
Say they see "123" so Alice say "I have 123 as my SAS code, what do you have?". And "Bob says, yap I have 123 as well". If Eve is in the middle then she would show "123" to Alice but maybe "456" to Bob. She would have to fake Alice's voice and tell Bob the code is "456" and it's ok. Then to Alice in Bob's voice that the code is "123" and it's ok. All in such a way that Alice and Bob don't suspect anything (has to be realtime too).
So this kind of software might make that easier.
The way to bypass that is to use dictionary words instead of numbers but then also refer to them further in the conversation. (see silly example in a sister comment about it).
Alice sees "pink elephant salad" and Bob sees "piano lamps". They say the words, Eve has to reply in Alice's voice to Bob with "piano lamps" and then respond to Alice with "ping elephant salad".
However then Alice makes a joke wondering if elephants like to eat pink salad. Bob will wonder as perhaps he wanted to remark how so few people play pianos these days. So Eve has a bit of a harder job if say instead it was just "1234" vs "5678" being said.
Also, I'm pretty sure similar tech is already being used in production on TV already. For example, I can't be the only one who sees all the cgi in the latest BBC planet earth can I?
I threw together a really poor Donald Trump text to speech web app written in Rust , but had no idea he'd win. Now I'm interested in learning much more academic means of concatenative synthesis, or even the new Neural Network based approaches such as Google's WaveNet .
I've already bought and read large parts of Taylor's Text to Speech Synthesis , but I'd be interested in review papers on concatenative synthesis (normalizing and smoothing!), methods of building and automatically curating a large database of n-phones, methods of changing pitch and intonation, annotating text with intonation cues, neural network techniques for speech, etc. Does anybody have knowledge of good sources?
 Cambridge, 978-0521899277