
Nvidia Vid2vid: High-resolution photorealistic video-to-video translation - davedx
https://github.com/NVIDIA/vid2vid
======
iforgotpassword
I'm probably being captain obvious here, but if this is what's being released
for free, I wonder how much better a polished commercial version does, and
when we reach the point where we can't trust anything we see anymore. It
doesn't even have to be super perfect, even reaching the point where it takes
experts about two weeks to determine if something's real or not might already
be long enough to do great damage.

From a technical standpoint I think this is very impressive, and I'm also
interested in creative/artsy use of this. Their "replace trees by houses"
example is pretty dull but gives a good glimpse at what can be done.

~~~
satori99
It could also lead to crimes like Blackmail becoming extinct. It would be hard
to hold incriminating recordings of anyone over them if near-perfect audio and
video synthesis was common.

Especially for public figures with lots training data available.

~~~
deytempo
Yeah... The problem is try explaining deepfakes to your significant other when
they are randomly sent what looks like a video of you cheating on them. Sure
it’s possible but not likely.

~~~
thanatropism
It's all a matter of cultural awareness. Everyone now thinks when seeing an
unlikely photo -- "Photoshop?"

This stuff even has the catchy name "deep fake".

------
huling0
From the paper: "we have to use all the GPUs in DGX1 (8 V100 GPUs, each with
16GB memory) for training. We distribute the generator computation task to 4
GPUs and the discriminator computation task to the other 4 GPUs. Training
takes ∼10 days for 2K resolution."

As I don't have a DGX1 here, training the 2K resolution net for 10 days on a
p3.16xlarge instance (also has 8 V100 GPUs) would cost USD 5875 on AWS.
(USD24.48 per hour on-demand pricing * 24 hours/day * 10 days)

~~~
sp332
The DXG-1 costs $129,000 so AWS is cheaper unless you need to do it 22 times.
And you can have multiple instances running at once and get all of your
results in ten days, instead of waiting ten days again for each run.

~~~
p1esk
Or you can build your own 8x2080Ti rig which will have 80% of performance for
1/10th of the cost.

~~~
twtw
DGX 1V has $7500 worth of CPU alone to feed the gpus. Throw in 8 TB of nvme
ssd for training data and you're looking at something more like 1/5th the
cost.

V100 has ~50% higher memory bandwidth than 2080Ti, so you probably wouldn't
get 80% of the performance. Also, only two 2080Ti can be connected via nvlink.

------
aethr
Seeing the example of one facial pose video transcribed to three different
looking women, I'm imagining a future where Netflix does a/b testing on its
shows, using similar tech to swap out different "actors" to find which one
resonates with audiences best.

They could even generate a new "cast" for each market, after only shooting the
show once.

~~~
iamgopal
Porn industry will benefit the most.

------
Findeton
Some applications for this kind of tech:

\- Porn, yeah, first application you can think of, there are already some
startups doing it.

\- Doubling actors, and applied to sound, maybe you could translate from one
language to another but kind of keeping the accent and tone.

\- Propaganda and misinformation. Now you can get your enemy to say and do
whatever you want, on video.

\- Photo-realistic games. Create a rough 3D model of an scenario and train the
AI for it. Instead of photo-realistic rendering with math, render it with the
AI based on a rough render, in real-time.

~~~
magnat
> Photo-realistic games. Create a rough 3D model of an scenario and train the
> AI for it. Instead of photo-realistic rendering with math, render it with
> the AI based on a rough render, in real-time.

According to last month's nvidia rtx presentation/launch event [1], they are
going to do something similar quite soon. Games will ship with DNN pre-trained
offline on extremely high quality renderings. Game itself renders at lower
resolution (limited by performance needed for proper raytracing) and uses DNN
to upscale it.

[1]
[https://youtu.be/Mrixi27G9yM?t=51m15s](https://youtu.be/Mrixi27G9yM?t=51m15s)

~~~
antpls
I wonder, since the NN cores of the GPU are used for real-time raytracing,
will they be able to run custom NNs possibly not related to visual stuff in
parallel to the ray-tracing stuff?

Edit : found the answer on Internet, apparently the RT (raytracing) cores are
different and separated from the Tensor (NN) cores on the RTX

~~~
21
I think this is about trading storage for computation - you replace terabytes
of model/texture data with a compute heavy NN.

------
lambdadmitry
I am a bit surprised how shallow the comments on this one are.

Look closely, while it does generate videos of a passing similarity, they
aren't "photorealistic" in the slightest. They _are_ good locally across time
and space domains, but globally they are as far from realism as Doom 2 was.

The only explanation for the attitude I see in this submission is that most IT
people trained themselves to spot CGI by looking at local artifacts, assuming
that global artifacts won't happen because stuff on the scene is _reasonable_.
There is no "stuff on the scene" with those videos, it's just mindless vector
manipulation with no underlying world model. Cars wave around, trees grow a
feet from each other and behave in a way incompatible with 3D perspective.

Relax, it'll require at least another AI/ML revolution (or even several) to
achieve photorealism.

------
pjeide
What media would someone collect now to be used in the future to reproduce the
likeness of loved ones? Video clips of them moving? Talking? Different poses
of pictures? Reading the dictionary out loud to get vocal patterns?

Heck with impersonating the POTUS. What about a lost friend, sibling or
parent?

~~~
pbhjpbhj
Black Mirror did this I think - if we can make video, then why not VR (down
the line if processing catches up), Second Life iterated.

Heaven on Earth?

~~~
crtasm
Also see the film Marjorie Prime.

[https://en.m.wikipedia.org/wiki/Marjorie_Prime](https://en.m.wikipedia.org/wiki/Marjorie_Prime)

------
gggggggg
Am i reading that right? Its making the videos that look real, from the
simplistic input?

If so, that is amazing.

And if so, how do I turn a video I have into a simple/line version, to be able
to then put a different 'skin' on it?

~~~
Erlich_Bachman
The level of realism can be gauged from the examples they provide right there
on the page. Of course your results may vary basing on the initial bulk of
data of realistic source images you use.

You have the code right there on Github, just install it on some PC with
powerful GPUs (or rent one), tune some parameters, train the network and you
can do the same things.

~~~
marmaduke
The hardware they used costs tons of money, and so does doing on a cloud
provider. It’s not something to do in a weekend with your gaming card.

~~~
twtw
It is if you drop the resolution.

------
SeriousM
I'm scared. I can't trust anything I haven't took myself. The problem is that
other people not even know that this technology exists and if you tell them
about it, you're a lier.

~~~
mcintyre1994
I don't think there's going to be a long period where this technology is being
used and isn't widely known about. It'll be used extremely quickly to abuse
people using their Facebook pictures, and once there's a Facebook angle the
media will be able to run with it and everyone will get it.

~~~
posterboy
It's already being used for quite a while and you talk as if you aren't aware
of the fact. Quite Ironic.

------
ivanb
This is why some people believe that Assange had been dead for more than a
year.

------
dolzenko
Cant wait someone to put Simpsons through this

------
dplgk
Is there a way that a person could create a QR code made from one-way private
key? The QR code would contain date and time and other Metadata that is
decrypted with a public key that proved the QR code in the video was made by,
for example, the person in the video? This would "prove" the video was real? I
guess the speech would still be manipulated. So the whole transcript would
have to be encrypted in this way...

~~~
lucb1e
I don't know why you're talking about QR codes and transcripting/encrypting
the audio when you can just sign the file?

    
    
        gpg --sign video.mp4

~~~
misnome
I assume they are talking about signing/watermarking a video in a way that
survives video encoding/lossy transmission. A QR code probably wouldn’t work
for various reasons but is an easy mental analogy to think of.

~~~
lucb1e
For embedding, you should just put it in the metadata. Encoding it in the
video itself... I don't really see the point.

------
Grasshoppeh
It would be interesting to see the applications of this technology with the
use of thermal cameras. Extracting environments from thermal imaging would be
nice.

------
huling0
One of the example translates a full human pose to a video of a dancer. If the
network would be trained on the facial pose(?) / features only, would that
recreate something like the facial reenactment in
[http://niessnerlab.org/projects/thies2016face.html](http://niessnerlab.org/projects/thies2016face.html)
(source code for face2face is not public)?

~~~
TomMarius
Look into Deepfake, that's the tool 4chan is using for face swapping in their
fake porn

~~~
huling0
Yes, but it just cuts out the face and pastes it on a different
person/background. It does not do full reenactment where you keep the entire
target video environment.

------
jcims
If you have any doubts that the face synthesis one is faked (faked fake?),
watch the face of the woman in the bottom left as it loops.

[https://github.com/NVIDIA/vid2vid/blob/master/imgs/face.gif](https://github.com/NVIDIA/vid2vid/blob/master/imgs/face.gif)

~~~
bronxbomber92
I'm not following - can you explain?

~~~
jcims
Her(?) face morphs slightly in the first few frames.

------
tvdo
Does anybody know a) the performance (e.g. introduced latency) and processor
requirements for the client/input (e.g. is real time canny edge detection good
enough and how fast would that run)? and

b) what the latency impact on the NN side to build the images (e.g. how many
ms are we talking about?)

Thank a lot!

------
gok
Kind of surprised this stuff hasn't totally blown away traditional video
compression techniques yet.

~~~
make3
I suspect that a combination of the two is where it's at, ie, store a lossy
classical compressed version, then remove the artifacts/dream up details with
deep learning

~~~
rasz
or pretrain the network on the movie you want to compress, then ship cartoon
compressed version + trained network.

~~~
gok
Right that was exactly what i was thinking

------
ksec
As someone not into AI / Machine Learning.

Can we expect CUDA to be the x86 on PC and Servers? Literally all works are
defaulting to CUDA and Nvidia's library. I don't even see a contender trying
compete. I don't even see AMD's ROCm being used or even mentioned anywhere.

------
jokoon
I'm curious what would happen if somebody tried to impersonate the US
president using this.

He would say it's fake, but who would believe him?

How exactly can computer scientists explain deepfakes to laymen?

~~~
dan-robertson
Well I’m this case the results are pretty good locally but have pretty obvious
artefacts too. Especially the synthesised road videos, look at the trees or
even more at the Lane change in the linked video.

~~~
puranjay
Can't you easily obfuscate that by making the video intentionally grainy and
low-res and passing it off as "caught by CCTV" or "found footage"?

~~~
IanCal
Probably much simpler to get a lookalike, and that's been possible for a long
time.

------
Improvotter
Are there any high resolution examples available? Am I just not finding them
in the README?

------
rawoke083600
holodeck programming step 1.

------
ai_ja_nai
ouch, this hurts

