It raises shades of the Xerox copiers which helpfully "compressed" images by deciding that 6s, 8s and 9s looked similar enough and using them interchangeably. (http://www.bbc.co.uk/news/technology-23588202)
I completely agree with this, and it's more and more dangerous as the resulting images appear more and more realistic. On a related tangent, this also showed up recently: http://fortune.com/2018/04/24/nvidia-artificial-intelligence...
A lot of people I know -- intelligent people who are familiar with machine learning and image manipulation -- were confused as to how this approach was "recovering" data.
It's not recovering data at all; it's guessing and filling in blanks, but doing so in such a realistic fashion that apparently it's poking around in some blind spots because the result is so convincing that you think it's the "real" image. I feel like the same blind spot would be attacked with "seeing in the dark" as well.
Realistic image inpainting & synthesis has been going on for decades, so I’d guess the main confusion is due to reading the title of Fortune’s article, rather than the paper’s title “Image Inpainting for Irregular Holes Using Partial Convolutions”. BTW, Kudos to Fortune for actually linking to the paper. I felt like it was pretty obviously inpainting, and suggesting arbitrary training data just from watching the video, so maybe the confusion was from reading the PR title only, and not diving any deeper?
Here’s my favorite inpainting paper, partly because the author is a friend, but also because it’s able to hallucinate written text, which most inpainting algorithms since then haven’t been able to do. It’s not a neural network though, and the training data comes from the single input image itself. http://graphics.cs.cmu.edu/people/efros/research/EfrosLeung....
> it’s more and more dangerous... I feel like the same blind spot would be attacked with “seeing in the dark” as well.
It’s possible, yes, but it does depend on what the authors did, how the network was trained, whether they allow reconstruction from pure noise, etc.. I would agree that this paper title is a bit provocative, and suggests assuming the output is realistic. The problem might be the title, and not the technique.
While it is important to understand that NNs are hallucinating output with training data, it’s also a good idea to reflect on the history of analog & digital photography & photoshop, and recall that this slippery slope of danger against fake realism has been warned against multiple times before. There are lots of legitimate uses for inpainting (movies, ads). As someone who’s worked in film, I’m excited about the possibilities that NNs bring in terms of new techniques and reduction of labor.
Reduction of labour is nice. Elimination of labour is problematic. To photoshop something someone has to actually photoshop it. We don't set our cameras to automatically photoshop things as we take them. The techniques in the OP are dangerous because they could be employed in situations where the photographer doesn't realize. A camera/robot trying to edit something automatically fabricating a false reality in the eyes of humans. Imagine a camera used to capture evidence of crime. You don't want such a thing filling in details on its own.
Interesting link. But funny that when in the face of a person one eye is masked out, the algorithm doesn't match the hallucinated eye with the unmasked eye. I'd hope that a smart ML network would understand the constraint that both eyes should look the same.
For example: http://web.engr.illinois.edu/~cchen156/SID/examples/book.htm...
What this net seems to be much better at is identifying which parts of the image are a result of noise and what the true intensity is. I'd love to see this used in astrophotography.
Remember that RAW data is generally more than 8-bits of dynamic range and image sensors are quite sensitive. You might not be able to tell the difference between 0-5 counts on an 8-bit image, but at 12-bit that's 60 counts difference.
I did some work during my PhD on laser spot detection in images (12-bit mono in my case). Even when I set the camera exposure down to several thousandths of a second, in a dark room, you could almost always recover some/all of the structure (tables, other objects, calibration patterns, etc).
EDIT: If I were a reviewer on this paper I would ask for histograms of the input/output images, or at the very least something like mean counts in the image. 1/30s at 0.1 lux - great, do you know how many counts that is on the arbitrary sensor that was used? I sure don't. You could check this in the supplementary data, but it's an annoying omission in the paper.
Your brain already does this in may ways. Last year I watched a fascinating series, The Brain with David Eagleman , and one of the main points I got from that is how the visual system works. Your eyes are not connected to your brain like a camera, instead they feed into a 'processor' which creates a model of what it thinks your eyes are seeing and it's this model you see. Your brain also feeds back into this area.
1) You can't see your blind spot, that is filled in.
2) When you move your eyes they effectively shut down but you continue to think you can see. This is demonstrated by the "count how many times the ball is passed" type video where there is also a gorilla walking about which you don't notice. This is also the cause of a lot of road crashes, you think you are scanning but your eyes 'miss' the cyclist.
3) Psychosis 'visions' are caused by the brain writing whole images into this processor, this model is then fed back into the brain. To you it's as real as if your eyes had seen then.
4) Witnesses are incredibly unreliable. People not only see things differently but also remember things differently.
With your brain though, you can usually slow down and study an image or scene to keep pulling more real detail out, with this digital processing though you can't as all you've got is the machine constructed image.
I read Peter Watt's Echopraxia three years ago and still keep finding holes in his "hard SciFi".
For example, sentry zombies employed by vampiress Valerie rolled their eyes constantly and explanation given by one of them (or by author) is that it extends their visual area and allows for more visual information to be processed. And now your remark allowed me to see how it is, really, not quite real.
Thank you very much!
I consider Watts a bad author and his series a bad ones (they incredible dull and heroes look like badly written puppets), but they allow me to not pass over information like yours and have more connected world-view. Maybe, Watts is not as bad as I like to think after all.
Similarly for this approach, it looks like it does a great job, but some simple image processing would help smooth out the noise of the "traditional" pipeline and provide at least a more mathematically deterministic output. On the other hand, this points out that our brain only really cares about a few bits of information out of an image (e.g. the title of the book is readable in the processed image). So if it fills in the dark portions of the image with junk data, perhaps it isn't all that bad in practice.
One feature seems that low-level granular noise is smoothed away. One can imagine many instances of noise that would be appropriately represented by a smooth gradient. But there are also noisy patterns that average out to a gradient, that shouldn't be represented by a gradient, because they form some other pattern.
Unfortunately, an objective, rigourous measure of whether this system is semantically injective or not wpuld require a formal model of image interpretation... pretty much strong AI.
Ignoring optical aberrations and losses from the lens, there's always additive noise that does not depend on the number of photons incident on the sensor. That could be dark/thermal noise, noise in the readout electronics, noise coupled in from external RF sources, etc. You could expose the detector with exactly the same number of photons 1000 times and get 1000 different images.
This isn't a criticism of your point per se, because it is just the way things are. But it's relevant in this case, because we can see this sort of enhancement as something like taking "a cat" (the low light, low-information image), and presenting something to us that looks like an high-information image. Although it is not spanning anywhere that many bits in the process; this is a much more gentle enhancement.
But it is in some sense a quantitative difference rather than a qualitative one. There's a whole lot of interpolation and compression an re-expansions of the compressed data going on all over the place in the visual space. This algorithm doesn't strike me as a significant change in that space versus what brains are already doing. But it is something to keep an eye out on, as computer vision continues to improve in its ability to plausibly fill in data. I can see a world coming where you can feed an algorithm a 2x2 24-bit color picture and it'll "enhance" it into some perfectly plausible picture. And if you change one of the pixel inputs by one, some completely different plausible picture.
> The resulting images look nice but there's no guarantee that all the extra detail visible in those images is genuine detail and not just "believable" data filled in by the net
It depends on the error metric specified as the objective. If the network is minimizing squared error to ground truth, then it won't generate believable but untrue data, since its objective wouldn't reward this behavior. If the network is trained against adversarial distinguishability however (with the noisy prior), then this becomes a more or less inevitable issue.
So both can be chosen depending on what you want: certainty that the content is semantically equivalent, or just good looking images.
Demo example (one of many examples - drag the middle slider from left-to-right): http://web.engr.illinois.edu/~cchen156/SID/examples/16.html
I think what would be almost even more interesting is an algorithm like this that is specifically trained to video, and takes into account previous and next frames when recreating lost data.
Relevant excerpt from :
“If you just apply the algorithm frame by frame, you don’t get a coherent video — you get flickering in the sequence,” says University of Freiburg postdoc Alexey Dosovitskiy. “What we do is introduce additional constraints, which make the video consistent.”
While I understand the choice of using a downsampled input with 4 channels I'm wondering why you went with a downsampled output instead of going to the original resolution directly where the 3 color channels are separate.
Also, did you investigate "faking" the training data by taking a single well exposed image, making it darker using conventional methods and using the resulting image as an input to the workflow?
The book cover details simply are not there on the dark image (if you scale up the brightness then there is only blocky noise.
So either this is not the right dark image or their network dreamed it up.
Yes, the net is "dreaming" the details, based on what it learns from those mappings. I'd say this nets are very specialised on the sensor, and maybe even lens choice. Simply put, what they did is compressing a full pipeline of processing into a deep net that consumes RAW files and spits out natural looking images.
Could it not just be really good are recreating the image it was trained on or is it generally doing this with novel images in this case?