However real black and white photos represent the actual intensity of the light waves. Black and white photos might be sensitive to light humans can't even see, or weight different "colors" very differently. Even two colors that appear the same to the human eye might be different intensities and frequencies, and so produce a different black and white photo.
I don't know how much of an effect this would have. But real black and white photos definitely look different than greyscaled color photos, and it might lead the NN to guessing the wrong color information.
sqrt(0.299r² + 0.587g² + 0.114b²)
There's a big difference, like you said, but it's not an impossible problem; we should be able to create a greyscaling formula for any photographic process with a known response curve. (And honestly, I have no reason to think this hasn't already been done in the creation of an Instagram filter at some point.)
No. Black and white negatives represent the intensity of the light (possibly with some lens filters in between) multiplied by some light frequency sensitivity distribution of the film, under some nonlinear time/intensity response function composed with another nonlinear time/chemical strength/exposure function from the chemical development of the negative.
Then to get from negative to print, black and white photos undergo another pair of multiparameter nonlinear functions (representing the exposure and chemical development process), possibly including intentional manipulation by a human operator.
Depending on the film, filters, development process, and printing techniques used, this can be relatively close or quite far from a naively converted 3-channel color picture.
For details cf. for example Ansel Adams’s books The Negative and The Print.
More subjectively, I think most digital pictures converted to B&W look kind of dull, whereas actual film looks very exciting to me. I haven't done any detailed research into this, but I'm not 100% convinced that collecting luminance through red, green, and blue filters can capture all the data that panchromatic B&W film captures.
(Even more of a tangent, one of the joys of B&W photography is that you can outright lie about colors and the photo still works. Try a red filter and watch the blue sky become black!)
Even if you're given a perfect probability distribution over the space of images the solution wasn't obvious for me, mostly because we're used to thinking of a "best estimate".
The first thing you think of is giving the least-squares estimate (the average), but MMSE exhibits the problem shown.
So you might instead try a maximum likelihood estimate; but this too has problems: imagine every car is a sightly different shade of blue (none are quite the same, maybe the manufacturing is unreliable), except 1 in a million cars are red, but the red is very consistent. The ML estimate will pick the red car, which of course is unrealistic.
The optimal solution is simply drawing from the underlying distribution, instead of relying on a deterministic "best estimate": an outside observer won't be able to distinguish your generated samples from the true distribution. That's why the "Adversarial discriminator" should work.
I wonder if there exists a cost function that directly promotes sampling from the underlying distribution without needing the adversarial approach...
Somewhere, the neural net needs to decide "this car is going to be blue" and then be consistent with that. Adversarial nets allow that, by having random inputs. One of the inputs to the NN is a random number, and that random number might determine if the car is going to be blue or red this time.
The cool thing about this is that it allows you to generate multiple samples. You can generate 10 different images and select the best one. And the adversarial nets should learn to approximate the true distribution as closely as possible. And I don't think there is any other method that can do that.
Another idea would to just have a loss function that doesn't punish it for getting a wrong color. But rewards it only when it gets very close to the right color. This way the algorithm doesn't worry about producing muddy brown colors when it isn't' sure, it just goes with a best guess.
You do need a source of entropy to perform the sampling. This amount should be more than a minimum given by how precisely you want to sample from the continuous source, related to the Kullback-Leibler divergence of the distribution.
That's a smart idea. Master painters used to do this. Do the broad outlines, choose the colour and style then unleash the underlings (painters in training) to do the rest.
There's even a language mapping that you want, you want it to recognize image parts and associate it with a word, which also isn't simple, because you'd have to have a lot of labels (candles, hands, hairs, cars, trees, licence plates etc.)
It is a harder problem than the one in the article.
State-of-the-art scene labeling is still not good enough (close to 80% accuracy) but I believe it's due to lack of data because algorithms used combine neural networks with joint learning approaches such as conditional random fields to extract the regions.
The wolves are rendered well, the flowers not. Wolves are camo. Flowers are the opposite. They want to stand out from the background. So the machine doesn't handle them well. The green stripe on the truck also fits this.
To take this idea forward, look at the image of the puppies against the grass. They are not camo. Their colour is the product of breeding, therefore they do not render so well as the wolves. There might be something useful here to measure whether or not an animal is being viewed in its natural environment.
There is a simpler explanation. Wolves only come in a few different colors. Flowers come in a variety of colors. Therefore, there are only a couple of correct answers for coloring a wolf, but a wide variety of completely incompatible answers for coloring a wolf.
You are right that flowers come in a variety of colors because they want to stand out. But I don't think the neural net understands that. It just knows that a gray flower could be any color while a wolf is confidently going to be some kind of brown.
This article is about colorization, which means taking shades of gray and selecting a hue and saturation for them. The brightness is effectively fixed because, guess what, a black and white image can convey brightness already.
Obviously, black and white coloration on wolves falls outside of this because those are more or less already correct in the black and white image.
Now look at that picture you linked. What do you see? White: doesn't need much coloration. Black: uh, also doesn't need much coloration. Slightly brownish gray: like I said, wolves are all some kind of brown.
Show me a blue wolf, or a green wolf, then we'll have something interesting to talk about. But most wolves, like almost all mammals, have coloration pretty much limited to dull warm colors and tints and shades of those. Here's a picture for you:
What do you see?
No, we all literally did not.
Shameless plug: I built an online tool to colorize photos using WebGL. It's all manual but it's easy to get started and doesn't require any additional software. http://www.colorizephoto.com
It's the right mix of "paper" and "blog post." It's an experiment that sort of flops and there will be a variety of "yay for tech!" and "I can't wait to see a movie where various body parts of people remain black and white!" in the comments here regardless.
Presenting it as an experiment, clearly explaining what you tried, then detailing some future thoughts and saying "it kind of works" was refreshing and honest. Thank you.
For example, the stripe on the truck should be bright and saturated, but the actual color doesn't matter.
The HSV colour space could work if the difference between colours is calculated with some kind of circular arithmetic.
Brilliant post too with an excellent write up. Can't believe people hadn't already been thinking about this.
Totally all over the map. So I thought this would be such a fantastic, fantastic way to compare my knowledge of the world with a punk algorithm's.
I didn't read the key/legend/explanation - as soon as I saw the first set of 3 images, I knew the middle lamp was the true color, and the right lamp was generated; because lampshades overwhelmingly look like the middle picture (in my mind) - https://www.google.com/search?q=lampshade not that weird blue color.
"Ha, stupid algorithm", I thought. "Who has a blue lampshade". this algrithm doesn't even come close.
Then I kept scrolling with that assumption, and it got worse and worse - wow that photographer's color is like blood orange, the algorithm doesn't even know it's a person! This is terrible. Where does that truck get that green trim, nobody would choose that, straight out of left field.
Until I got to the field with wolves. Why are the flowers' colors missing from the middle picture? This doesn't look right at all.
Then I read the caption. The middle images are the generated ones; the right-hand images are reality.
For five 5 of 6 images, I thought that the generated image was "obviously" an actual photo, and much more plausible in colors than the right-hand real photos. Continuing to scroll, for the park bench also I think the middle image is much closer to how I imagine it.
So we are at the stage where an algorithm generates a much more plausible view of reality, with rare exceptions, than actual reality. This is pretty impressive.
I think it's a good illustration of the limits of statistical approaches, and why "It's harder than it looks" applies.
This is about as good as it's going to get without genuine object recognition, knowledge of real-world lighting and colour, and awareness of photographic styles.
It might be possible for a system to learn all of the above, but it's going to need a bigger and probably pre-partitioned training set, and a much more complex model.
You'd want the training to be done on the server, not locally. Setting up a full Torch/Caffe/Theano stack is not easy because there are so many libraries and moving pieces which must interact with Nvidia's proprietary blobs and libraries and ever-changing GPUs, that you can follow all the directions and either work or fail with an utterly inscrutable error. (For example, I'm running on an old Ubuntu because the newer Ubuntu is not officially supported, and my usual OS, Debian, just does not work no matter what I try.)
Libraries we've started from in my lab:
But just the fact it guesses the right colors at all is really cool. Previous automatic colorizations I've seen were very very crappy or required lots of human input. Or both.
And while these colorized photos do look a bit dull, I like them better than black and white. Something about black and white photography makes it look fake to my brain. It doesn't register the same way. Even really bad colorizations make images feel more real. I once saw very badly colorized video of WWI, and it was really fascinating. I actually felt like I was watching a real event that had actually happened. The same is true for these images.
While I'm sure this is interesting to many people, the results here are extremely poor compared to modern techniques.
Could you link a demonstration of the much superior results?
Take a look at Levin 2004 to start.
Edit: Thank you.
When I selected from the validation images, most had these blue splotches that were not shown on the web page. Obviously a model is not expected to work 100% of the time, but I think the link misrepresents the results by not even showing a single instance of this common failure.
edit: further scrolling shows it's less common than I thought.
vegetation texture -> greenish
sky texture -> blueish
everything else -> brownish
Am I right that the "Convolution" part only refers to the speed by which the models can be trained, and not to any other quality of these models?
Convolutional kernels allow you to use many fewer variables to perform the forward layer operation; and CNNs tie these trainable variables across layers. Training is not only faster, but also more robust because you have less parameters to learn.
From the published results, I think it's likely a 50:50 mix of human and ML in this case would likely yield the most naturalistic result in 90% of cases.
Optimizing over joint loss (maximizing the probability of the full image colorization) would work extremely well.
Tools like vowpal wabbit can easily be adapted to learn a chain classifier over colors and it should work insanely fast.
I'd imagine the output would also have to be ordered, but this is easy -- just use the EM spectrum as an imposed ordering for the palette, and say, cap it at size 3 max with potential null hues for things that really only have one (or less than max) color(s).
the model is not available!
On a related "deep learning" topic:
We do a lot of work with 3D modeling tools and scanning large objects, including people, at our facility.
One thing we realized that should be possible with "deep learning" is taking a standard human computer model, and configuring it to match the position and shape of a scanned human model, or a photo of a human being -- in order to add back missing bits. Imagine a website where you can upload a swimsuit image and get back a computer generated nude image with the obscured body parts replaced. This should be very doable today, and would make a very popular website!
I'd say it's highly unlikely, since these images are downsampled to 224x224 pixels. That would average out any residual Bayer pattern (which is pretty hard to detect in the first place).