Hacker News new | comments | show | ask | jobs | submit login
Adobe Voco 'Photoshop-for-voice' causes concern (bbc.com)
48 points by ThomPete 330 days ago | hide | past | web | 47 comments | favorite



> Dr Eddy Borges Rey - a lecturer in media and technology at the University of Stirling - was horrified by the development.

> "It seems that Adobe's programmers were swept along with the excitement of creating something as innovative as a voice manipulator, and ignored the ethical dilemmas brought up by its potential misuse," he told the BBC.

> "Inadvertently, in its quest to create software to manipulate digital media, Adobe has [already] drastically changed the way we engage with evidential material such as photographs.

> "This makes it hard for lawyers, journalists, and other professionals who use digital media as evidence.

I find this /reaction/ horrifying - Adobe isn't invalidating evidence so much as they are making the public infinitely more aware (as a sibling commenter points out) that evidence may already be compromised.

If Adobe doesn't do this, actors who care enough to forge evidence surely will, so better that the public trust in voice evidence collapse than people mistakenly continue to believe in it. :@


The article would be pretty boring as "cool new tech does stuff". Someone in the world can be found to attack the short-sighted programmers of ${technology} for not seeing the moral implications! So great! Let's get that guy, and make the article be about the conflict over this new _unreleased demo technology_ that really is pretty benign and perhaps beneficial in giving everyone some good info about where the state of the art stands, so we can avoid being blindsided by malicious actors. And let's make the article be about outrage instead.

This is, um, not my favorite journalistic device.

Edited


It's such a tired argument. Should we be horrified that humans decided to invent and evolve technologies that lead to anything possibly harmful?


Especially that any technology that's even remotely useful for something can be (and is) used for harmful things.


Exactly! This is one of those cases where "Well someone is gonna do it" applies. And they may as well also try to cement themselves into the workflow. It's a very strategic move.


https://www.youtube.com/watch?v=I3l4XLZ59iw&t=2m34s

"Wife" sounds exactly the same in both places, so all this did was copy the exact waveform from one point to another. Nothing is being synthesized. If this is all this app can do, it would be quicker and easier to do this manually.

That little "guh" noise at the beginning of the first "wife" could also be manually cleaned up and pitch/formant shifted to sound more natural with respect to its position in the sentence.

https://www.youtube.com/watch?v=I3l4XLZ59iw&t=3m54s

The word "Jordan" is not being synthesized. He was recorded saying "Jordan" beforehand for this insertion demo and they're trying to play it off as though this was synthesized on the fly. This is all a scripted performance. Jordan is phoning in his task of feigning surprise.

https://www.youtube.com/watch?v=I3l4XLZ59iw&t=4m40s

Here they double down on their lie. The phrase "three times" was clearly prerecorded.

If Adobe wanted to put the bare minimum effort into trying to convince anyone this was a real product that exists and is capable of synthesizing speech on the fly, then they'd toss a beachball around the audience and have them shout out words to type.

This is a fraudulent demonstration of a nonexistent product and an audacious insult to everyone's intelligence. Adobe is falsely taking credit and getting endless free publicity for a breakthrough they had no hand in. They are stealing the hype recently generated by Google WaveNet and praying they'll have a real product ready by whatever deadline they've set for themselves.


Additionally, they are likely trying to capitalise on the fact that everyone is calling this product (real or not) "Photoshop for voice", in order to keep the spotlight on their brand.


Is this common in the software industry... ?

It is getting more tempting to join an industry where there are no customers (prop trading).


Releasing half-assed trivial tech demos with shit-ton of hype generating marketing? I'd say it's bread and butter of the industry - startups and big companies alike. It's a reason I'm really starting to hate programming as a job.

People say that a programmer should focus on meeting business needs, not on typing out code. But if business needs are this level of bullshit, it's really hard to focus on them and retain even a modicum of self-respect as an engineer.


If this technology is being made available to consumers now, that means it's existed before now, probably for years, outside of the public's awareness.

Makes you think. What's been faked in the last year or two that we didn't know could be faked?


What makes you think it existed? I dont buy this assumption that military is 10+ years head on everything. They just dont have the budget, nor can they sequester everything of interest companies produce. Think Apple spends $10bn on R&D yearly and they've been averaging one distinctly new product every ~7 years + refinements. And while I dont doubt the military have some amazing/terrifying tech hidden it seems unrealistic is off to assume the military has that much commercial goods developed ahead of the private sector.


Growing up, the first thing my friends did when they got audio software on PC was edit recordings of people to make them sound stupid. It was basic swapping answers from different questions.

Years later, I edited dialog from a movie (for my own use) to shorten or rearrange sentences in a way that very few people can detect, to remove profanity from clips I wanted to show to people who can't handle it.

Also, AIUI Hollywood has been splicing dialog for decades in ADR. Star Trek TNG had an episode in which a vocal resynthesis device played a role, so the idea is old.


Because it's blindingly obvious this was within technological capability of mildly sophisticated people for the last 10+ years? It may not had an obvious business use, so that's why it's late to the market compared to image and video manipulation techniques. But it's not that people discovered signal processing last year.


> Think Apple spends $10bn on R&D yearly

And the DoD's R&D budget for 2016 was $71.9 billion, plus whatever chunk of the classified intelligence budget of $80 billion went to R&D.

So roughly a magnitude more than Apple, each and every year.


We have had Photoshop for decades, and Audio tools for decades. This really isn't news... there is nothing really new here other than Adobe is building audio manipulation tools.


Maybe people concerned about having their voice manipulated can carry a device that generates an acoustic timecode whenever they speak, or at least some kind of non-repeating tone generator. Supposedly the 60Hz or 50Hz hum in recordings can be correlated with variations in the electrical grid frequency. It would make sense to go further and deliberately tag live audio with a timestamp to ensure continuity.

I can think of a few ways of faking a timecode like that, but they can be counteracted to an extent.


It would be trivial to filter out any kind of watermarking. Watermarks get photoshopped out of images all the time.

On the other hand, if there was some way to add a watermark that was also a cryptographic signature, you could actually prevent editing.

I have no idea how you would implement such a signature system though.


Cryptography sign the image sans the least significant bits, then add the signature in said position:

http://www.lia.deis.unibo.it/Courses/RetiDiCalcolatori/Proge...

Shouldn't be too noticeable.

A similar approach should work for audio too.


Interesting. While I think this would make it harder to edit someone's voice, if you're clever enough to design software to read the timecode, you're able to remove it from the spectrum and insert your own with similar room ambience.


Not that it's a panacea, but it could be signed with a private key to make it easy to validate whether it's a fake timecode or not. You could include crude, say, cepstral coefficients in with the timecode as well to prevent the timecode from being copy pasted elsewhere.


>Supposedly the 60Hz or 50Hz hum in recordings can be correlated with variations in the electrical grid frequency.

That's pretty cool. I wonder if this has ever been used to track down someone.


Do this by autotuning everything I say and I'm sold.


"It seems that Adobe's programmers were swept along with the excitement of creating something as innovative as a voice manipulator, and ignored the ethical dilemmas brought up by its potential misuse," he told the BBC."

Is there a name for the fallacy of giving the first person to market all the credit for an idea whose time has, essentially, come, with or without them?

Surely this is an idea obvious enough that, had Adobe not done (and publicized!) this, someone else would?

Edit: sense


I don't really understand the concern. Digital media has always been edited and manipulated. I was actually surprised to see so much news about it - professional video/audio productions edit speech all the times. I'm sure with a lot of dedication any amateur can already collect enough unique sounds and and snippets from prominent people to fake sentences. It's just easier now.


> I'm sure with a lot of dedication any amateur can already collect enough unique sounds and and snippets from prominent people to fake sentences

https://www.youtube.com/watch?v=hX1YVzdnpEc




About 17 years ago I saw a short piece on the old Headline News featuring a professor who had developed something like this. They played a recording of Whoopi Goldberg saying something made up. I never heard about it since and assumed it was sucked into some black program.


:D :D :D :D. That's about the only thing I can say about this.

I mean, seriously. It's been obvious for like 20 years now that stuff like this is going to be possible pretty soon. That someone finally packaged this capability in a nice form-factor doesn't change much. But then again, I guess people still didn't internalize the fact that photographs are like 15 years past being a reliable source, and videos probably 10 years past it too.

It's interesting how society will have to adapt to function with such technological capabilities, but - like others here pointed out - it's nothing really new, so I don't get the surprised concern.


When I was in grad school an Adobe representative presented at a colloquium, showing us their new-at-the-time content aware photo editing (think removing a person from a photo and filling in the trees/buildings behind them).

He mentioned a project based on predicting marketing/political campaign reactions in things like social media. My hand immediately went up and I asked him what they thought about the ethical implications, and how it could be protected from abuse. "We aren't really thinking about those kinds of things." I'm not surprised to see this line up with the quote in tekacs reply.

I don't trust Adobe, and while this is certainly really neat-o tech, I just don't see its benefits outweighing the huge impact its abuse could precipitate.


First hand anecdotes like that are some of my favorite HN things - thanks! But do you think we as technologists should be thinking about things like that? That feels icky to me. Maybe I'm wrong.


Technologists are people, too!

I think we should definitely be thinking about those things. We have some pretty sharp technologists to point to who thought about the ethics of what they were doing as well. Those working on the Manhattan Project and the German equivalent[1] and come easily to mind.

[1]: http://germanhistorydocs.ghi-dc.org/pdf/eng/English101.pdf


Intelligence agencies would be interested in this but they probably have something like it already.

Specifically this is relevant for ZRTP protocol. There is a voice-based verification step which relies on knowing and recognizing the other party's voice and then verifying a short authentication string they see on their screen. Being able to mimic can lead to ability to conduct a man-in-the-middle attack.

To counter-act the strategy is to use dictionary words instead of just numbers for verification, say "pink salad elephant" instead of "1934" so then parties would maybe have a joke or say something referring to the ridiculous word combination combination. That would be harder to mimic.


This is literally the plot of Sneakers, the 1992 hacking movie. I would hope voice signature systems are beyond "Hi, my name is Werner Brandes. My voice is my passport. Verify Me."


Ha, good point. I forgot about Sneakers.

This is a bit different though. The idea is that both parties should see the same authentication string and then verify verbally what that is.

Say they see "123" so Alice say "I have 123 as my SAS code, what do you have?". And "Bob says, yap I have 123 as well". If Eve is in the middle then she would show "123" to Alice but maybe "456" to Bob. She would have to fake Alice's voice and tell Bob the code is "456" and it's ok. Then to Alice in Bob's voice that the code is "123" and it's ok. All in such a way that Alice and Bob don't suspect anything (has to be realtime too).

So this kind of software might make that easier.

The way to bypass that is to use dictionary words instead of numbers but then also refer to them further in the conversation. (see silly example in a sister comment about it).


...and now I realize that the voice print part of Uplink was a reference to a movie.


This software is specifically mentioned to work at the phoneme level, not the word level. So it shouldn't be any harder to mimic words that weren't in the original corpus, as long as the phonemes are known.


The idea with the man-in-the-middle attack there is that the words would be different, say Alice and Bob want to talk but Eve MitM-ed them, as she always does.

Alice sees "pink elephant salad" and Bob sees "piano lamps". They say the words, Eve has to reply in Alice's voice to Bob with "piano lamps" and then respond to Alice with "ping elephant salad".

However then Alice makes a joke wondering if elephants like to eat pink salad. Bob will wonder as perhaps he wanted to remark how so few people play pianos these days. So Eve has a bit of a harder job if say instead it was just "1234" vs "5678" being said.


I assume that the words "three times" must have appeared at some point during the long speech, correct? So it's not quite generating new sounds, but intelligently rearranging them?


I know nothing about this product but based on its name and what it's doing it's probably a fine-grained vocoder tuned with machine learning. I's not quite trivial but it's an obvious idea. We've been building analog and digital vocoders for decades (e.g. the Skylons' voice from the original Battlestar Galactica from the 70s). It takes some hefty processing to do it with lots of bands as they're probably doing here, but you could probably do it with a desktop machine and a good GPU.


So does this mean Disney will be able to keep making movies with Darth Vader's "real" voice until the heat death of the universe?


I hear Disney's lawyers are currently working on extending copyright beyond that. Steamboat Willie is expected to initiate the next Big Bang.


Welcome to the Bin Laden tapes years ago. Remember, the NSA/CIA have tech (they used to say 20) 10 years beyond the public.

Also, I'm pretty sure similar tech is already being used in production on TV already. For example, I can't be the only one who sees all the cgi in the latest BBC planet earth can I?


This is amazing! Audio stuff has always scared me. I want to know more details though. Does it transcribe automatically? Is there some learning involved? Will it fill in background noises?


Can't wait to noodle around with this, amazing creative potential.


I'm curious about the applicability this has for fandubs.


[flagged]


I'm so excited for this technology!

I threw together a really poor Donald Trump text to speech web app written in Rust [1], but had no idea he'd win. Now I'm interested in learning much more academic means of concatenative synthesis, or even the new Neural Network based approaches such as Google's WaveNet [2].

I've already bought and read large parts of Taylor's Text to Speech Synthesis [3], but I'd be interested in review papers on concatenative synthesis (normalizing and smoothing!), methods of building and automatically curating a large database of n-phones, methods of changing pitch and intonation, annotating text with intonation cues, neural network techniques for speech, etc. Does anybody have knowledge of good sources?

[1] http://jungle.horse

[2] https://deepmind.com/blog/wavenet-generative-model-raw-audio...

[3] Cambridge, 978-0521899277




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: