
Faux Rogan - mgdo
http://fakejoerogan.com/
======
sho_hn
My reliable tells for the Faux Rogan were a certain amount of noise/distortion
to the audio, and the real Rogan doing more complex things with
intonation/prosody. The latter requires a direct comparison, and the former
must be fixable with filtering? That makes it pretty close to good enough for
an original reading.

~~~
codetrotter
Same here. Have only watched one or two episodes of his podcast in the past,
so was basing my judgement on what sounded natural in general rather than any
specific recollection of what he is supposed to sound like.

I voted on each clip without listening to other clips first. I misidentified
two that were real as being fake. Wonder if those might have been clips where
he was reading from a script with a flatter voice than usual like another top-
level comment ITT talks about.

If audio of this quality was being presented in a context where I didn’t know
it might be fake, I probably would have chalked it up to issues with the
recording mainly, and would have believed that the audio was real.

~~~
sho_hn
An interesting question (ignoring how you need Real Rogan to train Fake one
for a moment) is whether the Fake Rogan could be as popular, or if the extra
personality the real voice has is vital to making him a popular speaker.

I suspect that question strikes to things like gaining the tools to transpose
speaker affect independent from voice characteristics etc. Imagine a remix
culture where a future popular podcast voice is a mashup of different older
popular speakers and things like that ...

~~~
soulofmischief
Reminds me of Aldous Huxley's _Brave New World_ when the characters attend a
synthetic orchestra. The book goes into vivid detail about the level of
craftsmanship achievable by a machine composer and orchestra, yet somehow it
all seems so shallow, like the soul of it has long gone.

------
petrochukm
For context, it's important to know that these are probably cherry picked
samples. The authors make no mention of attempting randomly select these
samples. For as long as text-to-speech has existed, there have been impressive
demos backed by cherry picking.

The 3 Dessa team members did not in 3 months of work create anything
innovative probably. Rayhane Mamah, one of the Dessa team members, had
previously published a Tacotron 2 (Google's 2017 research) implementation
([https://github.com/Rayhane-mamah/Tacotron-2](https://github.com/Rayhane-
mamah/Tacotron-2)) that has similar noise/distortion and intonation/prosody
issues as their "RealTalk model".

Following on the above, Google's TTS research already demonstrated human-
parity as measured by MOS score in early 2018. That research was deployed as
Google Duplex in mid 2018.

Google's TTS research also showed the deficiencies of this technology. Without
the invention of AGI, the TTS models do not understand the underlying text;
therefore, it'll be unable to do more "complex things with
intonation/prosody". Furthermore, the models suffer from overfitting. The
model performance degrades significantly when performing TTS on text not
typically seen in the training data.

------
nwallin
I went 6/8.

However, I call shenanigans. Some of the real clips were him reading from a
script for the advertisements at the beginning of his show. Joe has a _really_
flat/unnatural delivery when he's reading from a piece of paper.

In addition, if they were including the advertisement reading in the training
material, that would definitely mess up the final model. Advertisement Joe
doesn't like Joe.

~~~
wallace_f
I got them all correct easily. You can hear the emotion in the real voice.

Someone should test this on criminal psychopaths and see if they are
significantly better or worse at it than the general pop. I would bet there
will be two major populations of psychopath in solidly each category.

~~~
hn_throwaway_99
Uh oh, I got nearly all of them wrong. I hope I'm not a psychopath :(

Somewhat in my defense, although I've seen a few clips with Rogan, I'm not
very familiar with him. I agree with the GP's comment, I thought the real Joe
Rogan reading ads sounded very staccato and robotic. Near the end I got a
little better with the "OK, robotic sounding Joe is the real one."

That said, I'm an introvert and I don't enjoy social situations with people I
don't know, so while perhaps not a psychopath it's plausible the my poor
ability to infer emotional intonation is at play both here and in social
situations.

~~~
ydb
I only missed 1 in 8, and I attribute that to a bit too much wine on a
Saturday night. I heard "farm fresh" as "farm fish" and thought I was a genius
for catching an algorithm error...

However, as I mentioned in this comment[1] I don't think you have much to
worry about. The correct criteria for diagnosing fake vs real is actually _not
being robotic_. I knew immediately what was real because there was emphasis
and affectation, as opposed to the monotone counterpart.

Finally, a minor quib: "psychopathy" is not about an ability to infer
intonation (quite the opposite). I'm an introvert like you (Hell, I'm probably
considered a hermit) and I was easily able to identify the doppelganger.

[1]
[https://news.ycombinator.com/item?id=19951679](https://news.ycombinator.com/item?id=19951679)

~~~
ramblerman
> Finally, a minor quib: "psychopathy" is not about an ability to infer
> intonation (quite the opposite). I'm an introvert like you (Hell, I'm
> probably considered a hermit) and I was easily able to identify the
> doppelganger.

What does introversion and being a hermit have to do with psychopathy?

~~~
ydb
Not sure, ask the parent commenter. I was merely attempting a rebuttal to
their remark.

------
pippin
Hi HN, I'm one of the creators from Dessa of this project.

If you haven't listened to it, we just released a longer clip of the RealTalk
model[1]. In my opinion it's even more compelling than these clips.

One of fascinating parts of building this has been the questions we received
while showing it to people. I'll note a few anecdotes specifically:

"What is the difference between this and a real voice?"

"Can I learn to discern fakes over time?"

"Would we relate differently to a generative voice model posthumously,
compared to current media forms like videos?"

These aren't questions that we necessarily have answers to yet, but they're
important discussions to have.

[1]
[https://www.youtube.com/watch?v=DWK_iYBl8cA&feature=youtu.be](https://www.youtube.com/watch?v=DWK_iYBl8cA&feature=youtu.be)

~~~
failrate
All of these generated works are very impressive.

I think it is completely irresponsible to advance the state of the art in this
field without simultaneously developing techniques to demonstrate that the
generated work is artificial.

Please develop validation tests while developing your generative techniques.

~~~
gridlockd
Haven't read what they're doing, but chances are they are using an adversial
neural network.

The job of the adversarial network is to tell apart real from fake. The job of
the neural network is to fool the adverserial network. Both are trained in
tandem.

One could imagine training another adverserial network that isn't used to
train the network itself, and so will pick up on nuances that the original
adversary doesn't pick up on. Anyone could do that, I don't think it's the
author's responsibility.

Somewhat related:

[https://keenlab.tencent.com/en/2019/03/29/Tencent-Keen-
Secur...](https://keenlab.tencent.com/en/2019/03/29/Tencent-Keen-Security-Lab-
Experimental-Security-Research-of-Tesla-Autopilot/)

~~~
petrochukm
Doubt it.

Generative-adversarial models have had a lot of success in image generation;
however, the same cannot be said for speech synthesis.

Unless they have figured out a new technique, they are probably using Tacotron
2 ([https://ai.googleblog.com/2017/12/tacotron-2-generating-
huma...](https://ai.googleblog.com/2017/12/tacotron-2-generating-human-like-
speech.html)). Google's Tacotron 2 already achieved human-parity TTS without
adversarial training as measured by MOS.

------
WestCoastJustin
How can I do this with my voice? Any suggestions? I run a site called
[https://sysadmincasts.com/](https://sysadmincasts.com/) with tons of
voiceovers and would love to automate this (without it sounding like crap).
There are always places where I need to update the audio with small tweaks. I
imagine it is still pretty computerized today but I think we'll eventually get
there. The google WaveNet stuff is pretty good but still not there yet [1].

I imagine eventually, you could have some type of transcript that's annotated
with speech synthesis markup language (SSML). Then, you have a CI/CD pipeline
that would run this text-to-speech engine and regenerate the audio. I could
then pair this up with the video. I honestly wonder if we are a year or two
away from this being possible.

[1] [https://cloud.google.com/text-to-speech/](https://cloud.google.com/text-
to-speech/)

~~~
blhack
[https://lyrebird.ai/](https://lyrebird.ai/)

~~~
CharlesW
Wow, the samples on their front page are absolutely awful. Do they have good
ones?

~~~
james_s_tayler
Yeah, I feel like this has to be a joke compared to the level of quality Faux
Rogan is pumping out.

------
zuminator
I was able to get them all correct, but only because I knew to listen for an
artificial voice. The faux Rogans clipped their consonants in a just-barely
inauthentic manner, and real Joe's voice slooows down mid-phrase for dramatic
emphasis. By the way I am not a follower of his, and was initially surprised
as I was expecting the voice of actor Seth Rogen.

~~~
ryanjshaw
Same, 8/8, voted after listening for at most 1 second to each clip and never
having listened to the guy before. The clipping makes the rate of speech sound
jarringly unnatural to me.

------
daenz
So when are we going to get an audiobook service that reads books in your own
voice (or a celebrity narrator of your choosing)?

~~~
puranjay
Audiobooks in your own voice would be terrible because your voice in your own
head is always deeper and richer than your actual voice

~~~
soulofmischief
Alright, audio books where the cast is my high school circle of friends.

------
aembleton
I got them all correct! It felt like the cadence of the Faux Rogan was just
'off' and I could tell from that. Everything else seemed spot on.

------
remir
Impressive stuff, and this is only 2019 tech. Now, imagine this kind of
technology in 10, 20 or 30 years. Things will get weird for sure.

~~~
OvidNaso
People keep saying this, but we've had Photoshop in the general public's hands
for decades now and I can't remember even a single fake photo causing any sort
of ruckus. Everyone seemed to realize that photos can be perfectly faked at
about the same time photos could be perfectly faked.

~~~
remir
Because photos aren't the same thing as feeding text to a machine that can
spit out dialogue in the voice of someone else. Same thing with "deep fake".

------
rangibaby
It’s easy to tell, the computer generated voices have what sounds like MP3
compression artifacts. I noticed the same thing on the bus in Kyoto

~~~
cs02rm0
I thought the same, though the second one along seemed to have dodgy audio too
yet that was real.

------
starpilot
Got 1 wrong, and I've never listened to Rogan. Seemed the fake one talked too
monotone.

~~~
daenz
The "warbling" factor stood out to me. It's like you could hear the
interpolation between audio samples.

~~~
mrmondo
Yeah, there was some strange timing between sentences and words at times.

------
kepler
Solid work they both sound exactly the same to me. The Faux Rogan is a
slightly faster speaker, so it was pretty easy to differentiate between the
two.

------
peterwwillis
You could do a pretty good update of 1984 with this tech. A person goes to a
rally where the President gives a speech decrying the evil of global warming,
and then later online all the reviews have clips of him saying global warming
doesn't exist, and the viewer just changes their mind to believe what they saw
in person wasn't real. An occasional hacker-led SEO attack will push up the
real video from the original speech, but because it's so rarely published,
everyone just assumes it's a fake designed as propaganda. Virtually everything
we fear could be turned into something good for us, and media generated by AI
becomes an engine that transforms people's thoughts. A bug in the AI software
accidentally makes everyone believe we should all destroy our computers to
prevent getting some virus, only later to discover that the computers were
distorting our thoughts and that nothing was as it seemed.

------
larsiusprime
I got 8/8\. Intonation was an easy give away for me, the fake recordings had
obvious tells because they didn't pause naturally or emphasized words in a
weird way.

EDIT: Still impressive tho. I imagine the typical use case for this sort of
tech is not just trying to fool people who have been explicitly forewarned.

------
scanr
This suggests that we are pretty close to not being able to rely on voice
biometrics.

Does anyone in the field know more?

------
craftinator
I didn't do so well. I was 60% correct. The non-word noises were really good
with FR, pauses and breaths did a great job of convincing me. The one thing
that I picked up on quickly though, was that FR speech was a bit more slurry
and fast than JR speech. Interesting...

------
kapilkaisare
I felt the Faux Rogan spoke just a mite quicker than the Real Rogan.

~~~
is0metry
Agreed. That was how I was able to tell.

------
agrinman
By submitting our guesses are we helping them train a fake joe rogan?

~~~
throw20102010
Not likely helping with the training- unless I'm underestimating the scale of
people visiting the site. Also unlikely because they had to have a well
trained model to launch the site and make the YouTube video that everyone has
seen. It's possible that they will do fine tuning in certain areas depending
on if some generated clips get consistently answered the same way.

More likely is that you're helping them write the results section of their
research paper, with a sentence like this, "In a blind trial of 5,000 internet
users, over 93% of people were unable to tell the difference between generated
audio and real audio at a statistically significant level (P<=0.035)."

Note: if we assume a binomial distribution, you need to get 7 or 8 correct to
reach the magical P<=.05 barrier. If you assume that there are a fixed 4
generated clips and 4 real clips, then you need to get them all right
([https://en.wikipedia.org/wiki/Fisher%27s_exact_test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test)).
I think it's fair to use the binomial distribution because the website does
not tell you the number of real and generated clips before you take the test.

------
mrmondo
Good job,

However - I guess the answer correctly with all but one - “You are much less
likely to injure yourself if you do it correctly” - which I can now hear the
different (or perhaps confirmation bias).

There’s something artificial about the timing with the pauses between
sentences on the ML based ones that was the main giveaway for me, also while
his voice is damn near spot on - there are some vocal inflections or perhaps
“emotion” in a few words that also hinted to me.

Hope that helps.

~~~
wisty
I got all but that one correct to.

I think the real tell is that Real Rogan has a lot of variance, Faux Rogan
sounds too much like the average Rogan.

------
siedes
It was easy to figure out which are the fakes, but I am very impressed and
spooked at the idea of this technology improving, then getting into the wrong
hands.

~~~
sho_hn
I'm not _too_ spooked. I think as a society we're just going to learn to
distrust recorded+reproduced media by default. Already we don't trust photos
much for their potential to be a manipulation. You could call this a loss,
sure - but maybe we should just never have in the first place. Photo
manipulation is a very old practice.

There's already businesses that sell camera apps (or cameras? my memory is a
bit dim) that save a photo along with a cryptographic hash to prove
authenticity. Their customers are for example insurance companies, which
require their clients to take pictures of damaged property etc. for claim
filings that way.

~~~
cco
Do you not see the problem in societies being unable to share trusted and
accurate information about reality? I mean that is 100% required for any
interpersonal relationship between two people let alone a community.

~~~
sho_hn
I do see the problem. I'm saying that ship sailed long before the computer
age. Photos have been manipulated with analog means before. Currently we're in
a state of still placing _undue_ trust.

What I'm hoping is that we'll find means to actually build trust, e.g. with
signing as we discussed here. I'm not entirely confident in that, but wouldn't
it be _nice_ if we engineers had a hand in building some useful tools for
society? :)

------
rahimnathwani
This repo has an implementation of a similar model, and a very clear set of
usage instructions:

[https://github.com/syang1993/gst-tacotron](https://github.com/syang1993/gst-
tacotron)

I haven't tried it yet, but if you're looking to do something similar, this
repo (and the papers on which the implementation is based) might be a good
starting point.

------
j7ake
It'd be cool if the algorithm is being tweaked dynamically and it is
progressively getting harder and harder for people to get 8/8.

~~~
tomlu
This is probably the hardest setting, yet most people seemed to have no
trouble getting 8/8.

------
kyberias
Each of the real Rogan samples had some detail that didn't sound like AI could
reproduce it. The AI samples were much more monotonous.

------
Hoasi
This is really well done. I could tell which was Faux Rogan by listening to
the pacing of the sentences, not so much to voice itself.

------
Fnoord
Not sure, but I suppose this person [1] is the Real Rogan. (I didn't know this
person so maybe it helps someone.)

[1]
[https://en.wikipedia.org/wiki/Joe_Rogan](https://en.wikipedia.org/wiki/Joe_Rogan)

------
drenvuk
8/8\. There's distortion in every single faux rogan recording. The real ones
are smooth.

------
dogma1138
Got all right it seems that the faux Rogan talks a bit too fast but anyone who
doesn’t listen to his podcast regularly likely won’t notice it.

I wonder how long it would take until Joe brings those guys to his podcast
that is something I’ll be waiting for.

------
milleramp
I went 5/8, Perhaps it’s because I listen to the podcast on 1.4 speed??

~~~
chrisfinne
Me too. Hearing his voice at 1x seems weird.

------
rtkwe
I actually had an oddly easy time of it... but not because of any audio
differences all the Real Rogan's were talking about food and the rest were
talking about something else.

------
smolder
I got them all right. There are some noticeable artifacts in each of the
generated clips that give them away.

If it weren't for that, they sound pretty natural, so it would be difficult.

------
arebours
Damn it's good. I was able to spot most of the fakes but hadn't I known there
are any I don't believe I would have. I'm curious about longer samples.

------
chriselles
I have listened to about 10 Joe Rogan podcasts.

I only got a 4 out of 8.

I’m assuming the longer the audio clips the easier it would be to detect the
fake/AI?

------
DavidSJ
I got 7/8 and I don't even listen to him.

~~~
markdown
Same here. Only got the "Healthy local ingredients, farm fresh ingredients"
one wrong. And I've only ever listed to this person once (the Musk interview).

------
djohnston
8/8\. you can detect the faux rogan because there are slight cracks in the
audio, almost like a record pop but much more subtle

------
__s
Identity is a private key. Sign everything you wish attributed to your
identity

Unfortunately the UI/UX just isn't there yet

------
m0zg
8 out of 8. Look for intonation and you'll guess correctly every time.
Machines don't have it (yet).

------
chubot
I went 6 for 8 -- two times it convinced me that Faux Joe Rogan was real.

------
overgard
Gotta admit, I listen to his podcasts sometimes and I only hit 50%.

------
sebringj
there were slight audio anomalies in the fake ones, easy to spot.

------
mitchtbaum
XCKD: PGP

[https://xkcd.com/1181/](https://xkcd.com/1181/)

[https://www.explainxkcd.com/wiki/index.php/1181:_PGP](https://www.explainxkcd.com/wiki/index.php/1181:_PGP)

------
55555
> "It's important that we listen to what others have to say about this work."

Who are you, my father? Am I the only one who finds this trend really cringey?

