
Deep image prior 'learns' on just one image - singularity2001
https://dmitryulyanov.github.io/deep_image_prior
======
cs702
Wow:

"In this work, we show that, contrary to expectations, a great deal of image
statistics are captured by the _structure_ of a convolutional image generator
rather than by any learned capability. This is particularly true for the
statistics required to solve various image restoration problems, where the
image prior is required to integrate information lost in the degradation
processes.

To show this, we apply _untrained_ ConvNets to the solution of several such
problems. Instead of following the common paradigm of training a ConvNet on a
large dataset of example images, we fit a generator network to a single
degraded image. In this scheme, the network weights serve as a parametrization
of the restored image. The weights are randomly initialized and fitted to
maximize their likelihood given a specific degraded image and a task-dependent
observation model.

We show that this very simple formulation is very competitive for standard
image processing problems such as denoising, inpainting and super-resolution.
This is particularly remarkable because _no aspect of the network is learned
from data_ ; instead, the weights of the network are always randomly
initialized, so that the only prior information is in the structure of the
network itself. To the best of our knowledge, this is the first study that
directly investigates the prior captured by deep convolutional generative
networks independently of learning the network parameters from images."

PS. This makes me wonder whether and to what degree the _structure_ of the
brain's connectome might be a necessary prior for AGI.

~~~
rullelito
Can someone explain this in a way so that an ordinary mortal computer
scientist can understand it?

~~~
vbarrielle
> Can someone explain this in a way so that an ordinary mortal computer
> scientist can understand it?

I'll try.

Instead of the common approach that tries to search for image pixels to
minimize e.g. a denoising objective, they realize that they can instead search
for the weights of an image generator network such that the generated image
matches the objective.

They argue that the structure of the network then constitutes some prior
knowledge over what a natural image should look like.

My (probably wrong) interpretation: since a convolutional neural network
essentially works by looking for some spatial patterns at different
resolutions, their optimization process boils down to finding the high and
mid-resolution patterns that best match the input image, and then re-using
that information to fill in the missing information (or replacing "noise" that
does not match the extracted pattern).

~~~
rdlecler1
What do they mean by ‘the structure of the network’ is this the network
topology?

------
vbuwivbiu
Remember fractal image compression ?

[https://youtu.be/AjdogjBxfco?t=260](https://youtu.be/AjdogjBxfco?t=260)

~~~
ska
This can also be looked at as the original source of patch based denoising,
etc. In the end it's about capturing the scaling properties and self
similarity of natural images. This is also, for example, why wavelets were so
effective as a basis.

David Mumford particularly did some great work on this sort of thing a couple
of decades ago, along with many others. I hope when people are rushing around
trying to apply convolutional nets to everything they aren't losing these
insights.

~~~
fjsolwmv
The benefit of CNNs is like the benefits of SVMs -- they generalize all the
great old techniques so you don't have to understand them all, you just throw
more CPU at the optimization problem.

~~~
ska
I don't think that's true particularly in this case.

This paper is pointing out that you can encode a structural prior in a CNN -
but knowing the "great old techniques" will help you design the right network
architecture to do that.

SVMs were a surprise when they came out, no so much a generalization as a
challenge (at least at first)

------
eximius
Huh. This seems to boil down to 'noise is higher information entropy than
realistic content; partial learning will learn realistic content before
learning noise' or something like that.

~~~
eref
I think that can be mainly attributed to the fact that the last few
deconvolutional features are overfitted to features in the image and are
somewhat robust to noise. The network does not even learn features to produce
e.g. white noise as output. This is probably much less magical than the paper
makes it seem to be.

------
cdumler
Red Dwarf - "Uncrop" \-
[http://www.dailymotion.com/video/x2qlmuy](http://www.dailymotion.com/video/x2qlmuy)

~~~
sarreph
Actually... How cool would it be to have an NN that could extend an image's
background with plausible scenery? Not just photoshop 'smart' fill, but for
example if it detected a building on the right side of an image, it could draw
the rest of it? :)

~~~
nl
This exists.

This approach takes other photos of the same scene to extend a cropped pic
(see also MS PhotoSynth):
[http://grail.cs.washington.edu/projects/sq_photo_uncrop/](http://grail.cs.washington.edu/projects/sq_photo_uncrop/)

This uses a GAN to fill in missing parts of a pic. Those parts could be on the
edge of the picture (although the paper doesn't explore that):
[http://hi.cs.waseda.ac.jp/~iizuka/projects/completion/data/c...](http://hi.cs.waseda.ac.jp/~iizuka/projects/completion/data/completion_sig2017.pdf)

~~~
dahart
Here's another earlier example of un-crop, that uses only the source image
(doesn't need gps or an internet database of photos), and does both out-
painting and in-painting.

[http://graphics.cs.cmu.edu/people/efros/research/EfrosLeung....](http://graphics.cs.cmu.edu/people/efros/research/EfrosLeung.html)

------
bitL
Wow, so just the weight sharing architecture does so much already? I am
wondering if the same could be done with LSTMs on sequences or CNNs on
voice...

~~~
cs702
I'm wondering the same thing too.

Note also that this finding strongly suggests that _neural net architecture_
actually is quite important, possibly even more important than having more
data -- which contradicts the conventional wisdom!

~~~
jacquesm
There is some pretty strong evidence for this: all the toddlers in the world.
You only need to show them something _once_ and they'll immediately be able to
recognize more examples of the same thing from different angles and even when
it is partially hidden. All they have to guide them is the structure of their
brains, not the quantity of data they have been exposed.

~~~
rahimnathwani
"All they have to guide them is the structure of their brains, not the
quantity of data they have been exposed."

A typical toddler (say 12 months' old) has spent 4000-5000 hours with open
eyes. Even if you assume a low frame rate (10fps), resolution (1080p), and a
1000:1 compression ratio, that's still 1TB of training data.

~~~
fjsolwmv
Consider language acquisition:

[https://en.m.wikipedia.org/wiki/Poverty_of_the_stimulus](https://en.m.wikipedia.org/wiki/Poverty_of_the_stimulus)

------
andbberger
This shouldn't really be surprising. Machine learning is specifically not
magic. The reason CNNs have seen so much success is precisely because they
build in translation-invariance, which massively cuts down on parameters while
forcing the final function to have the desired structure regardless of
wherever gradient descent takes the weights.

Also why most papers in deep learning are network architecture innovation.

~~~
andbberger
One more relevant note - (Olshausen and Field, 1997) showed that the filter
employed by V1 simple cells could be learned using some simple assumptions
about sparse coding and a single image. Translation invariance built in by way
of the sampling scheme of the image, small patches.

The filters learned by the first layer of CNNs is usually of the same type,
Gabor filters. Not a coincidence.

That was twenty years ago. What's old is new?

~~~
eref
What do Gabor filters have to do with this?

~~~
ska
Parent is saying Gabor filters are typically effectively recapitulated by the
first layer of of the network anyway, as they are a natural representation.

~~~
eref
But what does have to do with smoothness and translation invariance which this
paper is a demonstration of? You even learn Gabor filters with local
connectivity without spatial weight sharing.

------
JumpCrisscross
Can someone break this down for this layman?

~~~
flyingspork
Not an expert so take this with a grain of salt; I could be misinterpreting
the paper.

It seems that the current accepted method is to train a network with distorted
images as the input and the correct undistorted images as the targets. Then
after training you can feed a new distorted image into the trained network and
get the estimated "fixed" image.

However this team actually uses the distorted image as both the input and the
target to the net. So if they were to let the training go on for too long the
network will produce an exact copy of the distorted input image. But for some
reason, the structure of the network means that the estimated output learns
realistic features first, and then overfits to the noise afterwords. So if you
stop the training early, you get an image that incorporates realistic features
from the distorted image, but hasn't had time to "learn" the noisy features.

~~~
jboggan
This is fascinating because I've been running into something similar with
sequence to sequence models translating natural language into Python code. I
got better results stopping "early" when the perplexity was still quite high,
I thought it was a little crazy.

~~~
samgd
[https://en.wikipedia.org/wiki/Early_stopping](https://en.wikipedia.org/wiki/Early_stopping)

------
raz32dust
This is incredible! Can't help but wonder if the brain does something similar
to fill up "gaps" in reality. e.g how we fill up our perception (not only
vision, but general mental intuition) based on just context.

~~~
randcraw
That's exactly what happens when the retina is damaged. The brain fills in the
void imperceptibly so you aren't distracted by the deficit. But I think
biology doesn't fill in the void synthetically. It's more likely that the
brain "turns a blind eye" toward the missing 'pixels' and directs your
attention elsewhere, perhaps to those remaining regions with high saliency
(useful detail).

------
blauditore
As I'm not an expert in the field; what exactly does the term

    
    
      min_x E(x; x0) + R(x)
    

mean?

I thought that E(x; x0) would denote the error/difference between original and
corrupted images, and R(x) be the (searched-for) correction. But this doesn't
seem to make sense with the next parts of their explanation.

~~~
urgoroger
x is actually the generated image they are testing against x0. A lower E(x;
x0) means an image which fits well towards the objective based on the original
image (depends on the task). The paper gives some examples. For example, for
the task of image denoising, E(x; x0) is just the squared distance of the
generated (denoised candidate) image x to the pixels to the original image x0.
Obviously you would want this to be low in the generated version since it
should still look close to x0.

R(x) is a regularization term to avoid overfitting. For example, in the
denoising example, it could be a measure of the variation in color of x.
Clearly, just taking x = x0, the squared error (E(x; x0)) is 0, but it will
have high R(x) because of all the noise. That's why they try to minimize both
quantities combined, so we get min_x E(x; x0) + R(x), to get close to the
objective but also not overfit.

[https://en.wikipedia.org/wiki/Regularization_(mathematics)](https://en.wikipedia.org/wiki/Regularization_\(mathematics\))

------
zellyn
Would a plausible explanation of this be that one part of an image tells you a
hell of a lot about another part of it? And that ConvNet structure captures
that really well?

~~~
psergeant
Yes

------
BatFastard
How can it possibly know what was in the white areas of the library? Is there
a residual image?

Seems impossible that it guesses correctly.

~~~
azeirah
It doesn't guess "correctly" at all. Zoom in on the image, and focus on the
filled-in areas, they look _really_ blurry. It just doesn't look very bad from
a birds-eye view.

~~~
knolan
Somewhat similar to content aware fill in Photoshop [0]. The untrained network
can latch onto frequent patterns and match them to holes in the data.

Why doesn’t it paint everything white? Are these actually transparent images
or are they somehow tagged?

[0][https://helpx.adobe.com/photoshop/using/content-aware-
patch-...](https://helpx.adobe.com/photoshop/using/content-aware-patch-
move.html)

~~~
jmmcd
Yes, for the inpainting, the parts to be painted (big white deleted areas) are
supplied as masks, so it doesn't try to match them.

~~~
BatFastard
But its generating unique content in those areas...

~~~
jmmcd
Yes, every "run" of the network is generating pixels in those areas, but
they're not being compared against the white (deleted) pixels. On the final
run of the network, they're still not being compared against anything, except
by us, visually, when we look at those pixels.

------
frihani
Maybe it's the presentation of the restoration process, but I'm particularly
impressed with the inpainting sample.

The idea of not running the simulation `past` the realistic interpretation and
using that result makes sense but the results are way beyond what I would have
expected!

Great work on the write up.

------
zo7
I feel like this is related to the information bottleneck idea that's been
floating around for some time [1]. I only really understand both of these at a
superficial level, but from what I think I understand one thing that they
observed is that there are two phases when training a deep learning model: a
phase which maximizes the mutual information (?) between the input and the
output, and a compression phase which compresses the learned representation.
In that light this work makes sense, since artifacts are essentially noise
that the network would filter out in the fitting process.

Very cool work though.

[1]: [https://youtu.be/bLqJHjXihK8](https://youtu.be/bLqJHjXihK8)

------
tgb
I don't see how any choice of a function g(theta) could have the property they
desire, ie could eliminate R(g(theta)). Can anyone explain?

~~~
sanxiyn
It's expressed somewhat awkwardly, but what's going on is that R(x) is zero if
x is in range of g, and infinite otherwise. Choice of g is such that natural
images are in range of g, and non-natural images aren't.

~~~
tgb
Sorry I still don't understand. They require g to be surjective. Edit: in
their paper they call it f_theta and it's explicitly not surjective. Dunno why
their writeup is so confused.

------
ac2u
Question to anyone that knows this area in depth.

I'm assuming that this impressive feat has a disadvantage (over the
traditional example intensive technique of training a CNN with something like
ImageNet) in the form of taking a long time to generate the corrected image.

With that assumption in mind, could this new technique be reversed? As in
feeding in sharp images, getting back out corrupted ones, for the purposes of
generating data sets where there isn't much data to begin with?

You could then take that data and use it to train a more traditional CNN to
sort of amortize the results of the technique in the paper and have the
process happening faster.

~~~
wonderous
Do you mean generative adversarial networks?

[https://en.m.wikipedia.org/wiki/Generative_adversarial_netwo...](https://en.m.wikipedia.org/wiki/Generative_adversarial_network)

~~~
ac2u
No not really, this isn't generation for the sake of classification. It's
generation in order to generate a dataset to train a new network that exhibits
the characteristics of the network in the paper that can be used quickly and
in a more general fashion.

~~~
wonderous
Meaning that your assumption is that there is an optimal generalized network
structure for generate a neural network as described in the paper and that as
such, what is missing from the research is how to create an optimal way to
generate a malformed source/target seed.

Is that correct?

~~~
ac2u
I'm not suggesting it's missing from the research as I don't have anywhere
near the area of expertise to make that call.

But you're correct in your summary: I'm wondering if that's a possible follow
up.

------
machiaweliczny
Isn't it simply hardcore overfitting of network?

I haven't read the paper yet but it seems to me that it shows you can use
overfitting for output interpolation with great results.

Same should be possibile for example for interpolation of video images instead
od pixels.

------
gbrown
So what kind of generative stuff does it "learn" from really really sparse
data, or from random noise? Trained generative models can produce some crazy
things, I wonder if something similar could work here.

Recursive geometric patterns perhaps?

------
loser777
An interesting question this raises is--what determines when the training has
overfitted to the noise/error in the input image? If that's another NN that
provides a perceptual model of quality, how was it trained? ;)

------
eutectic
This is as close to magic as humanly possible.

~~~
sarreph
Safe to say that Arthur C Clarke would count this as 'sufficiently advanced
technology' then I guess :)

------
scentoni
Enhance. [http://knowyourmeme.com/memes/zoom-and-
enhance](http://knowyourmeme.com/memes/zoom-and-enhance)

------
giobox
Some of these image restoration techniques are extremely impressive, but its
odd to think how much visual history will be faked in tiny ways if these "fix"
techniques become widespread in commonly used photo software.

I suppose this is no different really than the extent to which images are
edited anyway now, in that many images are never really a fastidiously
accurate representation of the original scene, but interesting to consider
none the less.

------
fpgaminer
I'm finding it hard to put into words what I find wrong with this paper, but
... here goes nothing.

So, the novel thing here is that an encoder-decoder network applied to an
image can learn enough from a source image to be useful. In some ways that's
obvious, but the effectiveness of it on reconstruction tasks is certainly
surprising.

I have two problems, though. One is that I would take the reconstruction
results with a grain of salt. The examples are clearly lab queens, where the
occluded regions are not particularly interesting/challenging.

Two is the conclusion the authors reach. Somehow the authors go from the novel
discovery I describe above, to saying that somehow the architecture of the
network is a prior.

Well ... I mean, yeah a network's architecture _is_ a prior. But it's not
actually significant.

See, in the dark ages of machine learning we had only fully connected
networks. They sucked. They'd always overfit and underperform or were
impossible to train. Then we finally got convolutional networks, and suddenly
a whole slew of machine learning problems became easier and that hurdled us
into the current renaissance.

But, you see, convolutional networks weren't the _only_ reason for the dawn of
this new age. Rather it was three major things: 1) Convolutional layers, 2)
more data, 3) more computing power.

Some time after the "discovery" of convolutional layers we found out that,
hey, our old fully connected networks actually _do_ work. If you give them
enough data and enough computational power, you can get them to perform as
well as state of the art convolution networks. The great thing about fully
connected networks is that they assume nothing. That means A) you can
theoretically get better results and B) you don't have to spend time designing
an architecture.

So we already know that architecture isn't ultimately important. You can have
a giant, fully connected network, and it _will_ work, if you feed it enough
data and have the computational power necessary to train such a beast.

Convolutional layers are just simplifications which make training easier. They
are priors in the sense that we know a fully connected layer in image
applications would just devolve into a convolutional layer anyway, so we might
as well start with a convolution layer. That "design" is the prior. But it's
not mandatory; the network would still function without that "prior".

So ... I'm not sure how the authors are taking their research and using it to
come to the conclusion that their results are because of some magical property
imbued into the network by the "priors" of the architecture.

They apparently tried other architectures and got poor results, and so they
use that to claim that architecture is the only reason their technique works.

That's like if you started with ResNet for a classification problem, tried
other architectures, saw that they performed worse, and then published a paper
saying that Residual Networks somehow embody the fundamental forces of natural
images in their architecture, and that's why they work. When the truth is that
ResNets aren't special, they are just easier to train.

Another example from the annals of machine learning history: time and time
again when there is a breakthrough in architectures, it's usually followed a
few years later by a simplification of the architecture. For example we
started with networks like VGG which are these big, hand crafted
architectures. Slowly over time architectures have become _less_ exotic,
instead opting to simply define a basic building block repeated N times.

The reason for this is because in the intervening years we gather more
training data, better training techniques, and more computational power. So we
can instead use a more homogeneous architecture which has _less_ assumptions
(less priors) and at the end of the day we get _better_ results.

I'll repeat that. We put _less_ priors into our networks and we get better
results.

So on the one hand we have _all_ of machine learning history telling us that
priors in architectures are _bad_. On the other hand we have this paper which
makes some really weird logical leap from "we tried a few architectures, they
were worse, so architecture is _key_ to machine learning and it's important
because we need good priors built into the architecture."

Anyone remember hand crafted feature vectors? I do. Those were priors. Guess
what happened when we got rid of them and used generic networks feeding
directly from the raw data? Oh right, all of modern machine learning...

~~~
warsheep
> Convolutional layers are just simplifications which make training easier.
> They are priors in the sense that we know a fully connected layer in image
> applications would just devolve into a convolutional layer anyway, so we
> might as well start with a convolution layer. That "design" is the prior.
> But it's not mandatory; the network would still function without that
> "prior".

As far as I know this is incorrect. Can you point to a paper that shows this?
If by "easier to train" you mean that the models do not overfit training data,
then that's the whole point of using correct priors / hypothesis classes.

I'm not sure what bugs you in this paper, but the point is that they decouple
the prior architecture from the training/optimization mechanism, and that
seems interesting.

------
chestervonwinch
This is unbelievable that it works so well.

------
freeflight
That Zebra one looks like a "before and after" image with really good MSAA. I
realize the techniques are quite different, it's still interesting how the
result seems so similar on a superficial level.

------
zerostar07
This is interesting news to neuroscientists as well. Even without plasticity
cortical networks might be performing useful functions. Plasticity may then
perform some different function.

------
ReverseCold
I tried to run their ipynb but it wants something called skip. I don't think
it's the "skip" on PyPi (that doesn't work anyway) - so what could they be
using?

~~~
sanxiyn
skip.py is in the repository under models directory.

------
albertTJames
ok.. this is revolutionary. Using the architecture as a way to capture an
image prior hints at how network structure and captured invariance are
related. By analogy it leads to thinking of brain areas as both hard coded
prior knowledge through their natural arrangement and plastic learning
structure. Turning the problem on its head shines a new light to how we could
conceive network architectures.

------
CamperBob2
Is there a straightforward explanation of how to run this anywhere, for those
not steeped in the arcane ways of Jupyter?

------
justonepost
Cool but I think a learned method is more interesting, especially how it can
be used for image compression.

------
Quanttek
Can someone put their results and explanation in laymen's terms please?

------
therealmarv
how do I get the code running easily? I'm familiar with python but not with
ipynb at all. Maybe somebody can give me some easy to follow instructions,
thx.

------
trhway
i find it impressive how it placed a lamp over the library window. After that
i was expecting a vase with flowers to be placed on the table in the next,
palace shot :)

~~~
nametube
could you highlight which part this is ?. I think it can only fill in patterns
that already exist in the image. I see a big artifact over the window but no
lamp.

~~~
trhway
well, probably it is that deep (i hope) neural network inside my skull that
interpreted the artifact as the lamp.

------
foota
I wonder if you could do something similar with audio

~~~
JoeCoder_
And turn crappy music into good music.

------
foota
This is amazing.

------
artur_makly
can someone please summarize this for us luddites? also wasnt there an HN
summary project out there?

------
m3kw9
Am I missing something here? How does this apply to DNNs that tried to
recognize images? Does it mean we can train a Net with just image now?

------
alexasmyths
Truly congratulations! Amazing.

------
0x53-61-6C-74
CSI:Miami...I'm sorry, I take it all back.

------
killjoywashere
The intel and espionage communities are going to be all over this. This makes
the Soviet photo retouching look like child's play.

Let's say you want to start a war, and need some evidence of chemical weapons.
Now, you can drop in some images of chemical weapons and claim a GAN found
them. Sample press releases:

"We believe this photo was retouched to hide the chemical weapons. Using a
GAN, we recovered clear photographic proof of the chemical weapons."

"We believe they are using this tin-roofed building to hide the chemical
weapons. Using a GAN on our own satellite images, we recovered clear
photographic proof of the chemical weapons."

~~~
ac2u
Can't see "we used a GAN" as an explanation that would fly. To counter it you
can simply say "well, the GAN could be wrong."

The technique outlined above is useful because we as humans can eyeball the
results and say "the improvement on that blurry zebra photo is great!", but we
know that it's an improvement based on a networks hueristics that's good
enough for us, not an exact replica of the information lost through noise and
compression.

