
PixelNN – Example-Based Image Synthesis - pentestercrab
http://www.cs.cmu.edu/~aayushb/pixelNN/
======
otp124
I used to roll my eyes at crime television shows, whenever they said "Enhance"
for a low quality image.

Now it seems the possibility of that becoming realistic are increasing with a
steady clip, based on this paper and other enhancement techniques I've seen
posted here.

~~~
ACow_Adonis
Except, and this is really the fundamental catch, it's not so much "enhance"
as it is "project a believable substitute/interpretation".

You fundamentally can't get back information that has been destroyed/or never
captured in the first place.

What you can do is fill in the gaps/information with plausible values.

I don't know whether this sounds like I'm splitting hairs, but it's really
important that the general public not think we're extracting information in
these procedures, we're interpolating or projecting information that is not
there.

Very useful for artificially generating skins for each shoe on a shoe rack in
a computer game or simulation, potentially disastrous if the general public
starts to think it's applicable to security camera footage or admissible as
evidence...

~~~
sweezyjeezy
> Except, and this is really the fundamental catch, it's not so much "enhance"
> as it is "project a believable substitute/interpretation".

I would argue that this _is_ a form of enhancement though, and in some cases
will be enough to completely reconstruct the original image. For example, if I
give you a scanned PDF, and you know for a fact that it was size 12 black
Ariel text on a white background, this can feasibly let you reconstruct the
original image perfectly. The 'prior' that has been encoded by the model from
the large amount of other images increases the mutual information between
grainy image and high-res. The catch is that uncertainty cannot be removed
entirely, and you need to know that the target image comes from roughly the
same distribution as the training set. But knowing this gives you information
that is not encoded in the pixels themselves, so you can't _necessarily_ argue
that some enhancement is impossible. For example with celebrity images, if the
model is able to figure out who is in the picture, this massively decreases
the set of plausible outputs.

~~~
trevyn
> The catch is that you need to know that the target image comes from roughly
> the same distribution as the training set.

When humans think about "enhance", they imagine extracting subtle details that
were not obvious from the original, which implies that they know very little
about what distribution the original image comes from. If they did, they
wouldn't have a need for "enhance" 99% of the time -- the remaining 1% is for
artistic purposes, which this is indeed suited for.

It'll be interesting to see how society copes with the removal of the
"photographs = evidence" prior.

> when enhancing celebrity images, if the model is able to figure out who is
> in the picture this massively decreases the set of plausible outputs.

This is an excellent insight.

~~~
ZenPsycho
Do you think knowing which state the license plate is from is enough prior
knowledge?

------
nl
To paraphrase Google Brain's Vincent Vanhoucke, this appears to be another
example where using context prediction from neighboring values outperforms an
autoencoder approach.

If 2017 was the year of GANs, 2018 will be the year context prediction.

------
maho
I hope some day this will generalize to video. I don't care about the exact
shape of background trees in an action movie - with this approach, they could
be compressed to just a few bytes, regardless of resolution.

~~~
stepik777
Except that it can put trees somewhere where there were no trees but something
similar to them. Or it can put face of a more popular actor instead of an
actual less popular one because it was more often present in the training
dataset. No, thanks.

~~~
TuringTest
Well, isn't that basically how Hollywood makes blockbusters?

------
laythea
I wonder if this could be applied to "incomplete" 3D models and the work
shifted to the GPU!?

------
joosters
I don't understand how the edges-to-faces can possibly work. The inputs seem
to be black & white, and yet the output pictures have light skin tones.

How can their algorithm work out the skin tone from a colourless image.
Perhaps their training data only had white people in it?

~~~
dahart
You never saw edges2cats I take it?
[https://affinelayer.com/pixsrv/](https://affinelayer.com/pixsrv/)

> I don't understand how the edges-to-faces can possibly work. The inputs seem
> to be black & white, and yet the output pictures have light skin tones.

The step you're missing is that an edge detector is run on the entire database
of training images to produce a database of edge images. The input edge image
is run against that corpus of edge images in order to find which edge images
match, then sample the corresponding original color images and synthesize a
new color image.

~~~
joosters
Thanks for that link, I'd never seen that before. In fact, the edges2shoes
sample on that page exactly summarises the issue I have: You start with what
effectively appears to be a rough line drawing sketch of a shoe, and the
algorithm 'fills in' a realistic shoe to fit the sketch. The sketch never had
any colour information and so the algorithm has to pick one for it. In their
example output, the algorithm has picked a black shoe, but it could just as
realistically chosen a red one. The colouring all comes from their training
data (in their case, 50k shoe images from Zappos). So in short, the algorithm
_can 't_ determine colour.

But shoes and cats are one thing; reconstructing people's faces is another. I
know the paper & the authors are demonstrating a technology here, rather than
directly saying "you can use this technology for purpose X", but the
discussion in these comments has jumped straight into enhancing images and
improving existing pictures/video. But there is a very big line between
'reconstituting' or 'reconstructing' an image and 'synthesising' or 'creating'
an image, and it appears many people are blurring the two together. Again, in
the authors' defence, they are clear that they talk about the 'synthesis' of
images, but the difference is critical.

~~~
dahart
> So in short, the algorithm _can 't_ determine colour.

That's right. But with the caveat that a large training set can determine
plausible colors and rule out implausible ones. This is more true for faces
than for shoes! The point is that there is _some_ correlation between shape
and color in real life. The color comes from the context in the training set.
This is what @cbr meant nearby re: "skin color is relatively predictable from
facial features (ex: nose width), it should be able to do reasonably well."

There are CNNs trained to color images, and they do pretty well from training
context:
[http://richzhang.github.io/colorization/](http://richzhang.github.io/colorization/)

> there is a very big line between 'reconstituting' or 'reconstructing' an
> image and 'synthesising' or 'creating' an image, and it appears many people
> are blurring the two together.

Yep, exactly! Synthesis != enhance.

------
imron
Seems to have a thing for beards.

------
jokoon
I have a large collection of images, many being accessible through google
image search.

I wonder if there could be a way to "index" those images so I can find them
back without storing the whole image, using some type of clever image
histogram or hashing-kind function.

I wonder if that thing already exist, since there are many images, and since
most images have a lot of difference in their data, could it be possible to
create some kind of function that describe an image in a way that entering
such histogram redirects to (or the closest) the image it indexed? I guess I'm
lacking the math, but it sounds like some "averaging" hashing function.

~~~
dannyw
That's perceptual hashing. Check out
[https://www.phash.org/](https://www.phash.org/)

~~~
mlevental
so will this do something like image recognition? ie does it work as well as
surf/sift?

~~~
aub3bhat
Perceptual hashing is useful for copy detection. Its not robust to
changes/transformations nor do the hashes encode any semantic information.

------
ChuckMcM
Is anyone in the FX business playing with this stuff? I'm thinking
generational backdrops with groups of people/stuff/animals in them without a
lot of modelling input.

------
XorNot
So is there an analagous process that would apply to audio I wonder?

~~~
jerrre
What would the lo-res starting point be? Low sample-rate, bit depth, ...?

~~~
eru
Look up compressed sensing for audio.

(Eg first result:
[http://sunbeam.ece.wisc.edu/csaudio/](http://sunbeam.ece.wisc.edu/csaudio/))

------
tinyrick2
This is amazing. I especially like how the result can somewhat be interpreted
by showing from what image the part of the generated image is copied (see
Figure 5).

------
deevolution
Apparently you grow a beard after using their nn model?

~~~
XnoiVeX
I noticed that too. I hope it is just a documentation error.

------
throwaway00100
No code available.

~~~
jszymborski
Which is sadly par for the course in this field, or at least my experience.
You can always email the group...

~~~
sosuke
I spent too long trying to get RAISR to work when that paper came out. You can
try it out from some Github repos but no one has been able to recreate the
results Google presented. I would be hard pressed to say my hires photos
looked any better than the originals when scaled up on my iPhone screen.

I do wish they would release the code AND any related training images they
used to get those results.

------
Wildgoose
Very clever. I wonder if something like this could be used for other forms of
sensor data as well?

~~~
dispo001
Ah like, what do I look like I want to eat?

------
verytrivial
A pair of the inputs in the edge-to-edge faces are swapped. I have nagged an
author.

~~~
verytrivial
... and I followed up with an annotated screenshot. I tried, I really did!

------
the8472
All those examples are fairly low-resolution. Does this approach scale or can
it be applied in some tiled fashion? Or would the artifacts get worse for
larger images?

------
tke248
Does anything like this exist on the web would like to send a blurry license
plate picture through this and see what it comes up with..

------
kensai
OMG, now the "enhance" they say in investigative TV series and movies will
actually be reality! :p

------
nathan_f77
This is cool, but in the comparison with Pix-to-Pix, it seems like Pix-to-Pix
is the clear winner.

------
smrtinsert
"Enhance" is real. When will this stuff trickle into lower level law
enforcement?

~~~
asfdsfggtfd
Hopefully never. This does not enhance the image - it makes up a plausible
imaginary image.

EDIT: Furthermore the range of plausible imaginary images that match a given
input is high (infinite?).

~~~
smrtinsert
Why not? A recreation that leads to an identification should be enough for a
warrant that could be used for a continued investigation.

~~~
asfdsfggtfd
We could also just pick a random person off the street and punish them - it
would be similarly accurate and fair (actually probably fairer - if this is
trained on pictures with a certain bias it will return pictures with that
bias).

This paper does not demonstrate an enhancement technique but a phenomena which
those using inverse methods called "overfitting".

------
ScoutOrgo
Can we use this to identify the leprechaun and find where da gold at?

------
yazanator
Is there a GitHub repository link?

------
avian
I found the title somewhat misleading. I was expecting some clever application
of the nearest-neighbor interpolation. But this seems to involve neural nets
and appears far from "simple" to me (I'm not in the image processing field
though).

~~~
jampekka
AFAIU it actually seems to be sort of "just" a clever application of the
nearest-neighbor interpolation. The CNN is used to come up with the feature
space for the pixels (weights of the CNN), and then each pixel is "copy-
pasted" from the training set based on the nearest match.

It seems that this could be used in theory with any feature descriptors, such
as local color histograms, although the results wouldn't probably be as good.

Edit: Being a nearest neighbor probably also carries the usual computational
complexity problems of the method. If I understand it correctly, they ease
this by actually first finding just subset of best matching full images using
the CNN features and then do a local nearest neighbor search just in those
images.

~~~
tgb
I think the confusion is that the term "nearest neighbor approach" has a
different meaning in machine learning than in image interpolation.

[https://en.wikipedia.org/wiki/Nearest-
neighbor_interpolation](https://en.wikipedia.org/wiki/Nearest-
neighbor_interpolation)

versus

[https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)

~~~
gfredtech
+1, the exact conclusion i came to(K-nearest neighbors) when I saw this post.
Thanks for pointing this out

------
imaginenore
It almost looks like they mixed training and testing data in some of the
examples. The bottom-left sample in the normals-to-faces is extremely
suspicions.

~~~
jj12345
I was looking at this as well, but I'm willing to suspend my disbelief because
the normal vaguely looks like it has a good deal of information (in a basic
fidelity sense).

~~~
jameshart
seems astonishing that the normal information includes enough detail to tell
you which direction the eyes are pointing, though?

------
sgtAtom
Enhance.

------
debuggerpk
Hollywood got it right!!!

