Hacker News new | comments | ask | show | jobs | submit login
Turning two-bit doodles into fine artworks with deep neural networks (github.com)
327 points by coolvoltage on Mar 10, 2016 | hide | past | web | favorite | 56 comments

These are really cool. Though if you were, like me, puzzled how could some really complex and coherent features come from those simple drawings / masks, have a look at the original paintings that were used as sources and compare them with generated images:

Original #1:


Generated #1:


Original #2:


Generated #2:


So those new generated images are structurally very similar to the original sources. Neural net seems to be good at "reshuffling" of the sources. That's probably how things like reflections on the water got there, even if not present in the doodles.

Thanks for clarifying, I'll update the README. The research paper does a better job of explaining this with its figures!

The algorithm can only reuse combinations of patterns that it knows about, it can do extrapolation but it often ends up being just like a blend. However, you can give it multiple images and it'd borrow the best features from either—for example drawing from all of Monet's work. (Needs more optimization for this to work though, takes a lot of time and memory.)

As for the images, as long as the type of scene is roughly the same it'll work fine. The fact it can copy things "semantically" by understanding the content of the image makes it work much more reliably—at the cost of extra annotations from somewhere. With the original Deep Style Network it's very fragile to input conditions, and composition needs to match very well for it to work (or you pick an abstract style). That was part of the motivation for researching this over the past months.

So if I understood well, this GIF shows you - human being - exploring possibilities / limitations of your method, hand tweaking it for one particular image?


That is, the final image, the one that looks the best, is the result of you doing tweaks to doodles to get something that neural net can then fill-in convincingly?

Or are these a different runs of the same method based on the same inputs, that have some natural variability, and you selected the one that looked the best?

Or are these progression steps in one run of the automated algorithm?

Language in the blog post is kinda ambiguous, not sure which steps were done by algorithm and which by a human being.

Exactly, the doodling is done by humans and the machine paints the HD images based on Renoir's original. I've edited the blog post to clarify.

That part was clear :)

What still isn't clear to me is how exactly that "workflow" demo (and consequently the "money-shot" final generated images) happened.

There is a progression of generated images with increasing quality. Who did which steps in those iterations?

Blog post uses ambiguous language: "N-th image tries / removes / fixes", etc.

It's not clear though if it was:

1) algorithm steps (keep computing more till generated image looks good), or

2) human being tweaking inputs to fixed algorithm (keep painting new input/output doodles till generated image looks good), or

3) human being tweaking algorithm itself (change code till generated image looks good).

The algorithm does the same thing every time (it's triggered on request), only the input is changed by the human modifying the doodle—as shown in the video.

The output gets better because through iteration the glitches are removed incrementally, and it converges on a final painting that looks good!

Aha. I was wondering about that.

I've done some experimentation with neural network-based style transfer (this one: https://github.com/jcjohnson/neural-style_), and the results that I got pointed strongly to the same effect: it works well if the two images (source for style and source for content) are very similar in framing, composition and subject, and very badly if they're wildly different.

Having said that, this algorithm seems to be MUCH better than the one I tried at transferring style. I'd have expected those paintings to transfer to the doodles much worse than they did.

But don't expect to take a portrait doodle and a landscape source and have it come out well :)

The "Semantic" tag is misleading, because human perception parses lighting and textures cues in 2D images as 3D hinting.

Representational art is all about modelling, highlighting and/or transforming the hinting, depending on the level of abstraction. E.g. if you look at portraits, the pen/brush strokes usually emphasise 3D structures.

This code does a little of that, but the model is extremely crude compared to the models the human brain uses.

For genuine semantic perception you'd have to duplicate - and maybe improve - the human model. I doubt you can do that in 2D, because the human model is trained by years of genuine 3D perception.

That's not to sound negative - I think this is very impressive visually. But it could be taken further.

> This code does a little of that.

Actually, the code does none of that ;-) All of the semantics are provided by the users: either as manual annotations or by plugging in an existing architecture for semantic segmentation / pixel labeling. It's designed to be independent of the source of the semantic maps, so we can continue to work on both problems separately.

It works for basic color segmentation already, and here are some of the papers we're integrating currently: http://gitxiv.com/search/?q=segmentation

(Author here.)

For details, the research paper is linked on the GitHub page: http://arxiv.org/abs/1603.01768

For a video and higher-level overview see my article from yesterday: http://nucl.ai/blog/neural-doodles/

Questions welcome!

You should make an app for that (seriously)!

I can't believe he posted it up on github before doing this - there's so much potential for this to go viral once it's packaged with a doodling app.

EDIT: Actually reading more closely I guess 10 minutes on a machine with a decent GPU is a lot of server load :|.

The research is based on work I did writing and improving @DeepForger (http://twitter.com/deepforger), an online service for "basic" style transfer. The GitHub is a standalone version for learning and education, which doesn't do HD rendering as well yet and uses a bit more memory. The positive side, however, is that opening up the source code makes these ideas progress faster!

We'll try to integrate the idea of semantic style transfer into @DeepForger in the future, but this require quite a bit of work to get it to reliably understand portraits or landscapes without anyone's intervention. The fact it does require these semantic maps for all images makes it less straightforward to release as a service.

One question, is the semantic map created on the fly at the same time as the final image is composed, or are the maps pre-computed?

The semantic map remains static during the optimization, so it can be provided as a pre-computation (e.g. pixel labeling, semantic segmentation, etc.) or done by hand. The ones in the repository are done manually, but now experimenting with other algorithms. Anything that returns a bitfield or masks can be used!

Or an online generator!

Why CUDA instead of OpenCL? Asking because I'm about to start a GPU compute driven project.

Almost nobody in deep learning uses OpenCL. All the DL frameworks primarily focus on CUDA and that's where you get the best performance. OpenCL is off the beaten path and you pay for it in every way: support, performance, reliability.

AMD is going to support CUDA somehow too, I think that's a sign they admitted defeat on OpenCL for this.

I'd love/dread to see this this kind of work (neural nets run in reverse mode) applied to voices and accents.

You could credibly put any words in the mouth of anyone.

This basically already exists. Siri and similar TTS voices today are generated off of a lot of recorded speech from a person. There's a lot to get right for it to sound natural, not just hit the phonemes. You have to deal with the transitions between phonemes, declination, etc.

I've even seen a demo converting one person's voice to another (without going through text) trying to preserve the pattern (pauses, stresses, etc.). It was kinda cool, but you wouldn't think it was the other person in a genuine way.

Do you know of any projects on GitHub that do this?

like pretending one person pretending another

On the positive side you might end up with a Culture type situation where it's impossible to blackmail anyone due to it not being possible to verify the authenticity of any evidence.

Unless an all seeing, trusted Mind steps in and vouches for it.

NSA for notary?

It already exists for faces https://www.youtube.com/watch?v=ladqJQLR2bA

I'd like to see it done for linguistics... e.g. Subtly change the wording of your own speech so that it matches the heuristics of another's.

That's pretty cool. Might speed up asset creation for games by orders of magnitude: Train with concept art, generate the variations via these networks; adds consistency to the output and helps loosen the asset bottleneck / content treadmill esp for smaller studios/individuals.

No, it's a gross misunderstanding of what is concept art. A concept art piece is about the idea, not the style; if you take a famous protagonist, say batman, you can have it drawn in a medieval, realistic, sci-fi version, drawn in a stylized, realistic, cartoon way; in each case you will recognize him because the idea, the shape language, have nothing to do with the style of the drawing.

Even for illustrative work, where it can give you a good base, it still sucks, because for actual painters this step (thumbnailing) is actually the quickest; most of the time-consuming painting process is 'finishing', or 'detailing' the rough.

However, where it's great is in giving the ability to inexperienced people to paint well. The hard part of the painting is getting the lightning, color scheme, perspective right, but the finishing process is quite mechanical. So it could ease the outsourcing of some art assets creation.

yeah you are right in terms of concept art; I figured that unfortunately more often than not the lines between concept, mood, detailing etc tend to be blurred depending on who looks at it; Also this is not going to replace individual character design or other specific assets, but might remove scalability issues w/ project that require a huge number of different backgrounds, texture variations.

Of course the originial craft to producing high quality output is still needed - and just one image is not going to be sufficient anyway.

As you mentioned, I can also think of giving lesser experienced folks the ability to tinker with scene setup, dimensions, ratios etc and get faster 'final' results, although the 'old school' approach to getting those right before actually detailing something is quite important.

>yeah you are right in terms of concept art; I figured that unfortunately more often than not the lines between concept, mood, detailing etc tend to be blurred depending on who looks at it; Also this is not going to replace individual character design or other specific assets, but might remove scalability issues w/ project that require a huge number of different backgrounds, texture variations.

You can already do that in 3D if the overall concepts have been decided and some basic assets are there. Just switching the textures, lightning conditions, and some predefined building blocks do basically the same thing as this algorithm, except it's already part of the pipeline, and it gives you much more since it's 3D.

The other thing is that these algorithms look indecently good when you see a thumbnail, but very bad if you're looking at it too closely. These cool 'concept art' pieces that look good are 90% of what people see, but they do not represent 10% of the concept art work; most of it are boring details of joints, how the blade is strapped to the costume, how windows open, unsexy stuff as can be (that you can't do with such algorithms).

beginner artist/programmer combo here, I am sure this type of tools can be very powerfull for prototyping the moods of 2d games, if not parts of 3d games, as for its limitations this method do get the general scene mood rather well. Anyway far better than that shitty programmer art placeholders. The process could let game devs be many times faster and better at finding the right mood in a game, just swapping out the images the program learns from to find a new mood theme for the game got an level of intuitive logic to it, that I am sure can make it very accessible to non technicals. A well made tool of this type would be very good in the early prototyping pipe line for 2d games and more. Would not make even really bad artists jobless, but in the case of specific jobs it would be a powerful tool.

Not directly related, but here are some results from a similar algorithm[1] combining the Starcraft map Python with various Google Maps screenshots: http://i.imgur.com/EgFpqRA.jpg

[1] CNNMRF, Neural Style plus Markov Random Fields: https://github.com/chuanli11/CNNMRF

I played with something similar for a while, https://github.com/jcjohnson/neural-style

What I've found so far is that it takes a while to get good results like something that looks like its own creation instead of an overlap of pictures. There's no exact way to do this. If you modify existing artwork it works well enough since the source is already somewhat divorced from reality but photos are difficult. When it works it's amazing though.

From that perspective, this research is two steps further than Neural Style, I wrote about it yesterday here: http://nucl.ai/blog/neural-doodles/

First, the paper I call "Neural Patches" (Li, January 2016) makes it possible to apply context-sensitive style, so you have more control how things map from one image to another. Second, we added extra annotations (which you can specify by hand or from a segmentation algorithm) that helps you control exactly how you want the styles to map. We call that "semantic style transfer" (Champandard, March 2016).

You're right about it being hard otherwise, it was for many months and that's what pushed this particular line of research! Try it and see ;-)

This reminds me of "If Edison didn't invent the light bulb, someone else would have: there were thousands of other engineers experimenting with the exact same thing, a natural next step after electricity came about" (-- paraphrased from Kevin Kelly)

One was called Swan, Edison tried to sue him for patent infringement but Edison's lawyers warned him about prior art, so instead he negotiated a joint venture.

You may remember the "Mazda" brand of bulbs


Oh, absolutely. It's an idea whose time had come ;-)

Looked at the images and honestly thought that someone had posted an April fools joke a few weeks early. Amazing.

Very interesting! The thing that amazes me most about these neural network projects is how small the source usually is compared to what they're doing. Your doodle.py is only 453 lines.

I imagine between SciKit[0], Thenano[1] and Lasagne[2], the total size is a little north of 453 lines.

[0] http://scikit-image.org/

[1] https://github.com/Theano/Theano

[2] https://github.com/Lasagne/Lasagne.git

What data has been used to train the neural network?

It's a pre-trained network on image classification dataset from 2014 called ImageNet. The network is called VGG, paper is here: http://arxiv.org/abs/1409.1556

There's no additional training apart from that. The neural network is used to extract patterns (grain/texture/style) and a separate optimization tries to reproduce them as appropriate.

Interesting. If A is the input image, and B is the style image, then from which of those two images is the NN extracting patterns? And how is the other image used to get the desired effect?

Just trying to get a birds-eye view of the algorithm :)

Both images have their patterns extracted by the NN, and the optimization then tries to match the best patches from one image with the other, performing gradient descent to adjust the pixel values from a random start image.

In coming years this will create a very strange reality combined with improving VR tech...

Have you run children's paintings through this yet?

Another suggestion: try running a copy of Tolkien's Middle-earth map to transfer the style to a more detailed USGS-style map [1].

[1] e.g. https://www.google.com/search?q=usgs+map&safe=active&client=...

No, do you have any good ones? As long as entire sections are colored (not just lines), and those colors match with the annotations of another image, it should work fine!

Exciting! Where can we find image databases for this?

You can use any image as source, but to create annotations you have to do that yourself currently. Using simple segmentation libraries (or clustering) can do a good job for certain images, or look at better solutions for semantic segmentation: http://gitxiv.com/search/?q=segmentation

This project should be named Bob Ross.

Now please combine this with TiltBrush

Finally a way to draw things without learning how to draw. I'll be famous!

Would this work with photos?

You can specify two pairs of images (content+annotation) and it'll transfer the style from one to another as consistently as possible. The down side is that you need to find an algorithm, neural network, or person to create the annotations. (We're working on training one for portraits only.)

These examples are in the paper above, direct link for convenience:



Yeah, here's a similar neural network image analogy one that used photos with interesting results: https://github.com/awentzonline/image-analogies

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact