Inside Google's Deep Dreams project

jmcmahon443 · on Dec 12, 2015

Really enjoyed the humble approach to describing the engineering mindset.

Also a great look into how awesome Google management is for letting their software engineers explore spontaneous interests.

daveguy · on Dec 12, 2015

Also a great look into how awesome Google management is for letting their software engineers explore spontaneous interests+.

+ at 2am at home :)

michael_h · on Dec 12, 2015

My company lets me spend 66% of my time on whatever I want.

pearjuice · on Dec 12, 2015

Try self employment, you will approach 100%.

jacquesm · on Dec 13, 2015

Not if you want to stay afloat. 100% of your time will go to serving your customers, saving for a downturn, doing admin, customer acquisition, staying in touch with old customers who are not currently in the market but who may ping you for the occasional question and so on. The other 100% of your time you can spend on whatever you want.

lawpoop · on Dec 12, 2015

Question for the HN audience: what exactly would Deep Dream do it if were trained on 'everything'?

When it came out, people asked why its output was full of puppyslugs, and the answer was "Because it was trained primarily on a corpus of dog pictures."

Well, suppose that it was trained on a corpus of pictures of 'everything'. What would its output look like then? Would they look more or less like the input image?

dgacmu · on Dec 13, 2015

Translating this to: If it was trained on a uniform corpus of pictures sampled across the space of "likely things people take pictures of" and then used on photos from that same distribution, what would you get?

The answer is: The same kind of things, but with a more balanced distribution across puppies and "other things."

Possibly the easiest way to think about how a deepdreamed image might look is to look at the microstructure of an image in one isolated patch, and then look at what that patch seems most like. Is a close-up, 50% crop of a button on a shirt like... a button? an eye? a donut? And then magnify the high-level concept evoked by that microstructure.

The key in getting the psychedelia of the deepdreamed images is to go up the DNN to the right level. If you go all the way to the top of an Inception-style classifier, you're left with a single (or small set of) outputs: "surfer." "dog" "cat." Letting the network propagate back down based upon that gets you just an output of what the network's canonical input is for that -- see, for example, what Deepvis outputs: http://yosinski.com/deepvis

If you stop at the first level, you basically get a lot of fancy edge detectors and sharp/unsharp masks. (The first layer is a lot of convolutions, so think about the effects you get by applying a convolutional effect in Photoshop.) Can look nice, but not exciting. See the examples in the Wikipedia article on kernels: https://en.wikipedia.org/wiki/Kernel_(image_processing)

But if you stop in the middle, you get things that are mid-way between "pure concept" and "concrete feature", and that's where the cool happens. It's still localized within the image, but it's able to propagate up to entirely different types of images within that local region. So eyes in hands that are still kinda shaped like hands.

DD trained on a uniform set of images would have much more diversity in what it "imagines" in images. Christmas trees in forests; UFOs in pancakes; and lots and lots of human faces showing up everywhere (because real photos are heavy in human faces). Faces in faces with eyes in more faces in more eyes in more faces. ahh!

daxfohl · on Dec 13, 2015

Great explanation, but wouldn't it be hard (essentially impossible) to train on "likely things people take pictures of"? The alternative would be "pure static".

Though if you tried, my guess would be that it would associate various color-change gradients and frequency of hard edges etc with real photos.

If so, if you fed it a real photo and said "interpret", it would just give you the same photo back. If you fed it static, then you'd probably get some kind of abstract art back.

The question becomes interesting if you start with "things a baby would see" (probably mostly happy faces and eyes (wonderfully geometric) up close), which would train for certain aspects, and then moved on from there.

lawpoop · on Dec 14, 2015

I suppose you could have people just walking around with go-pros on, capturing whatever falls in its field of 'vision'. That way, the corpus isn't the subjects that people find interesting, but something more like everyday things and background.

thenewwazoo · on Dec 12, 2015

I'm not a CNN guy, but your question can't really make sense given my understanding of neural networks. In short, the output step of a NN has to include a classifier. That is, there's a matrix of probabilities that the NN generates. The higher a given probability, the more likely the input is to match the category associated with that value. For example, a NN trained to distinguish "light" from "dark" may output a matrix with value [0.3, 0.7].

To train a CNN on "everything" means you have to have an arbitrarily large output matrix. Can you list every possible category of everything? Probably not. Even if you could take a swing at it, it's hard to get enough data (and time!) to train the net on each category. Small datasets result in overfitting of the training data, and poor overall performance. How big a data set would you need to properly train a sufficiently-sized CNN with an arbitrarily-sized classifier?

pigscantfly · on Dec 13, 2015

I think I understand what you're getting at - that network architectures have a fixed size output - but you're incorrect in saying that the final layer must be a classifier. In general, you can optimize any differentiable function with gradient descent, and the output does not have to be a probability distribution.

The original poster's question does make sense; he's asking what would happen if you trained the network on something like the ILSVRC dataset.

nl · on Dec 13, 2015

The question makes perfect sense.

Word2Vec does a pretty good job of encoding the entire English language.

Autocoders are a thing, and back in 2012 Google trained one against YouTube and it learnt to recognise cats. The state of the art has progressed a lot since then. See https://googleblog.blogspot.com.au/2012/06/using-large-scale...

daxfohl · on Dec 13, 2015

You'd get the input image. A NN requires negatives. If you train it to accept "everything" as a positive, then you just get the original back. You need to more precisely define your negatives, to get anything interesting back.

Joof · on Dec 12, 2015

My guess is that the network wouldn't recognize as much stuff effectively and might not look as identifiable. Like an ADHD brain with too much stuff going on.

Might look really cool actually; I'd try it.

nl · on Dec 12, 2015

You'd still get puppyslug-style things, but they would be combinations of everything the network knows about. You might get car-eyes in one place and tree-rockets in another.

It tries to converge onto something it recognizes.

sabujp · on Dec 12, 2015

tldr; it helps us to perhaps understand how we recognize the world around us, esp. when we're dreaming or on psychadelics