Hacker News new | comments | ask | show | jobs | submit login
Learning to See in the Dark (illinois.edu)
184 points by isp 8 months ago | hide | past | web | favorite | 34 comments

I'm always concerned when this type of deep learning image processing is presented. The resulting images look nice but there's no guarantee that all the extra detail visible in those images is genuine detail and not just "believable" data filled in by the net. Maybe fine for happy snaps but it's very important that users of the camera know that its output is just an "artist's impression."

It raises shades of the Xerox copiers which helpfully "compressed" images by deciding that 6s, 8s and 9s looked similar enough and using them interchangeably. (http://www.bbc.co.uk/news/technology-23588202)

> The resulting images look nice but there's no guarantee that all the extra detail visible in those images is genuine detail and not just "believable" data filled in by the net. Maybe fine for happy snaps but it's very important that users of the camera know that its output is just an "artist's impression."

I completely agree with this, and it's more and more dangerous as the resulting images appear more and more realistic. On a related tangent, this also showed up recently: http://fortune.com/2018/04/24/nvidia-artificial-intelligence...

A lot of people I know -- intelligent people who are familiar with machine learning and image manipulation -- were confused as to how this approach was "recovering" data.

It's not recovering data at all; it's guessing and filling in blanks, but doing so in such a realistic fashion that apparently it's poking around in some blind spots because the result is so convincing that you think it's the "real" image. I feel like the same blind spot would be attacked with "seeing in the dark" as well.

Reminds me of the old joke, back when people had to manually touch up photos. Man goes to the photographer with an old family photo. Says, "This is the only photo we have of Grandfather, but I don't like that he's wearing a hat. Can you remove it?" The photographer says, "Sure, what sort of hairstyle did he have?" And the guy says, "Won't you find out when you take off the hat?"

> On a related tangent, this also showed up recently: (NVIDIA’s inpainting). A lot of people I know -- intelligent people who are familiar with machine learning and image manipulation -- were confused as to how this approach was "recovering" data.

Realistic image inpainting & synthesis has been going on for decades, so I’d guess the main confusion is due to reading the title of Fortune’s article, rather than the paper’s title “Image Inpainting for Irregular Holes Using Partial Convolutions”. BTW, Kudos to Fortune for actually linking to the paper. I felt like it was pretty obviously inpainting, and suggesting arbitrary training data just from watching the video, so maybe the confusion was from reading the PR title only, and not diving any deeper?

Here’s my favorite inpainting paper, partly because the author is a friend, but also because it’s able to hallucinate written text, which most inpainting algorithms since then haven’t been able to do. It’s not a neural network though, and the training data comes from the single input image itself. http://graphics.cs.cmu.edu/people/efros/research/EfrosLeung....

> it’s more and more dangerous... I feel like the same blind spot would be attacked with “seeing in the dark” as well.

It’s possible, yes, but it does depend on what the authors did, how the network was trained, whether they allow reconstruction from pure noise, etc.. I would agree that this paper title is a bit provocative, and suggests assuming the output is realistic. The problem might be the title, and not the technique.

While it is important to understand that NNs are hallucinating output with training data, it’s also a good idea to reflect on the history of analog & digital photography & photoshop, and recall that this slippery slope of danger against fake realism has been warned against multiple times before. There are lots of legitimate uses for inpainting (movies, ads). As someone who’s worked in film, I’m excited about the possibilities that NNs bring in terms of new techniques and reduction of labor.

>> I’m excited about the possibilities that NNs bring in terms of new techniques and reduction of labor.

Reduction of labour is nice. Elimination of labour is problematic. To photoshop something someone has to actually photoshop it. We don't set our cameras to automatically photoshop things as we take them. The techniques in the OP are dangerous because they could be employed in situations where the photographer doesn't realize. A camera/robot trying to edit something automatically fabricating a false reality in the eyes of humans. Imagine a camera used to capture evidence of crime. You don't want such a thing filling in details on its own.

Isn't Google Photos / Google Camera App doing that, automatically or semi-automatically? I recall they had a feature that could magically merge multiple photos into one that has a combination of details from the source images.

> On a related tangent, this also showed up recently: ...

Interesting link. But funny that when in the face of a person one eye is masked out, the algorithm doesn't match the hallucinated eye with the unmasked eye. I'd hope that a smart ML network would understand the constraint that both eyes should look the same.

It's the attention economy where you compete for the attention and thus funding by showing pictures.

The images are actually not that degraded. There's a large amount of colour noise when amplified, but there's plenty of detail. Photographers can pull stuff out of shadows all the time, but properly fixing noise is a pain in the backside.

For example: http://web.engr.illinois.edu/~cchen156/SID/examples/book.htm...

What this net seems to be much better at is identifying which parts of the image are a result of noise and what the true intensity is. I'd love to see this used in astrophotography.

Remember that RAW data is generally more than 8-bits of dynamic range and image sensors are quite sensitive. You might not be able to tell the difference between 0-5 counts on an 8-bit image, but at 12-bit that's 60 counts difference.

I did some work during my PhD on laser spot detection in images (12-bit mono in my case). Even when I set the camera exposure down to several thousandths of a second, in a dark room, you could almost always recover some/all of the structure (tables, other objects, calibration patterns, etc).

EDIT: If I were a reviewer on this paper I would ask for histograms of the input/output images, or at the very least something like mean counts in the image. 1/30s at 0.1 lux - great, do you know how many counts that is on the arbitrary sensor that was used? I sure don't. You could check this in the supplementary data, but it's an annoying omission in the paper.

> not just "believable" data filled in by the net

Your brain already does this in may ways. Last year I watched a fascinating series, The Brain with David Eagleman [1], and one of the main points I got from that is how the visual system works. Your eyes are not connected to your brain like a camera, instead they feed into a 'processor' which creates a model of what it thinks your eyes are seeing and it's this model you see. Your brain also feeds back into this area.

1) You can't see your blind spot, that is filled in.

2) When you move your eyes they effectively shut down but you continue to think you can see. This is demonstrated by the "count how many times the ball is passed" type video where there is also a gorilla walking about which you don't notice. This is also the cause of a lot of road crashes, you think you are scanning but your eyes 'miss' the cyclist.

3) Psychosis 'visions' are caused by the brain writing whole images into this processor, this model is then fed back into the brain. To you it's as real as if your eyes had seen then.

4) Witnesses are incredibly unreliable. People not only see things differently but also remember things differently.

With your brain though, you can usually slow down and study an image or scene to keep pulling more real detail out, with this digital processing though you can't as all you've got is the machine constructed image.

[1] https://www.bbc.co.uk/programmes/b06yjrdp

>When you move your eyes they effectively shut down but you continue to think you can see.


I read Peter Watt's Echopraxia three years ago and still keep finding holes in his "hard SciFi".

For example, sentry zombies employed by vampiress Valerie rolled their eyes constantly and explanation given by one of them (or by author) is that it extends their visual area and allows for more visual information to be processed. And now your remark allowed me to see how it is, really, not quite real.

Thank you very much!

I consider Watts a bad author and his series a bad ones (they incredible dull and heroes look like badly written puppets), but they allow me to not pass over information like yours and have more connected world-view. Maybe, Watts is not as bad as I like to think after all.

Along the lines of adversarial inputs, how hard do you think it would be to construct a test page to determine if a machine is affected? Such as front loading most of the page with 8's and then putting some patterns of 6's and 9's. Confirmation is overlaying the copy on the original, and seeing where it differs.

Similarly for this approach, it looks like it does a great job, but some simple image processing would help smooth out the noise of the "traditional" pipeline and provide at least a more mathematically deterministic output. On the other hand, this points out that our brain only really cares about a few bits of information out of an image (e.g. the title of the book is readable in the processed image). So if it fills in the dark portions of the image with junk data, perhaps it isn't all that bad in practice.

Given a function from low-light to normal-light images, is it injective? That is, does more than one input produce the same output? If so, how many? And more practically, how different are they? Is it semantically injective?

One feature seems that low-level granular noise is smoothed away. One can imagine many instances of noise that would be appropriately represented by a smooth gradient. But there are also noisy patterns that average out to a gradient, that shouldn't be represented by a gradient, because they form some other pattern.

Unfortunately, an objective, rigourous measure of whether this system is semantically injective or not wpuld require a formal model of image interpretation... pretty much strong AI.

Yes, it's a many-to-one.

Ignoring optical aberrations and losses from the lens, there's always additive noise that does not depend on the number of photons incident on the sensor. That could be dark/thermal noise, noise in the readout electronics, noise coupled in from external RF sources, etc. You could expose the detector with exactly the same number of photons 1000 times and get 1000 different images.

Your brain does the same, how do you know there is no "artist's impression" in your mind?

By definiton of reality: Several people agree on the same observation. This is also how science works (reproducibility).

Your agreement of the observation contains orders of magnitude fewer bits of information than the original object, or even the original image. If you look at an image and agree that you see "a cat" and someone next to you agrees that they see "a cat", you've compressed megabytes of data down into what is somewhere around 20-30 bits, tops. (I'm being generous with the bit count. "a cat" is realistically probably low teens, tops.)

This isn't a criticism of your point per se, because it is just the way things are. But it's relevant in this case, because we can see this sort of enhancement as something like taking "a cat" (the low light, low-information image), and presenting something to us that looks like an high-information image. Although it is not spanning anywhere that many bits in the process; this is a much more gentle enhancement.

But it is in some sense a quantitative difference rather than a qualitative one. There's a whole lot of interpolation and compression an re-expansions of the compressed data going on all over the place in the visual space. This algorithm doesn't strike me as a significant change in that space versus what brains are already doing. But it is something to keep an eye out on, as computer vision continues to improve in its ability to plausibly fill in data. I can see a world coming where you can feed an algorithm a 2x2 24-bit color picture and it'll "enhance" it into some perfectly plausible picture. And if you change one of the pixel inputs by one, some completely different plausible picture.

Well, I'm a scientist, I compare numbers. This makes science objective and unbiased. Of course this is not the full story, but it is what marks the cornerstone of enlightment: Knowledge vs. faith.

Rather a lot of philosophy of science disagrees with the confidence you are labeling your statement with. If numbers automatically gave us objective and unbiased information the world would be a very different place... and I mean that with a very generous interpretation of what putting numbers on things means, not a tight, pedantic one.

Well in this case they have the ground truth as comparison, so the error can be evaluated.

> The resulting images look nice but there's no guarantee that all the extra detail visible in those images is genuine detail and not just "believable" data filled in by the net

It depends on the error metric specified as the objective. If the network is minimizing squared error to ground truth, then it won't generate believable but untrue data, since its objective wouldn't reward this behavior. If the network is trained against adversarial distinguishability however (with the noisy prior), then this becomes a more or less inevitable issue.

So both can be chosen depending on what you want: certainty that the content is semantically equivalent, or just good looking images.

Remarkable machine learning result for "producing astoundingly sharp photos in very low light" (Cory Doctorow - https://boingboing.net/2018/05/09/enhance-enhance.html )

Demo example (one of many examples - drag the middle slider from left-to-right): http://web.engr.illinois.edu/~cchen156/SID/examples/16.html

GitHub: https://github.com/cchen156/Learning-to-See-in-the-Dark

Paper: https://arxiv.org/abs/1805.01934

Video: https://www.youtube.com/watch?v=qWKUFK7MWvg

I'd be very curious to see this applied to not just a single frame of video, but rather to the video as a whole. My assumption is that it would create a weird jitter to the parts of the image that have been recreated by the neural net.

I think what would be almost even more interesting is an algorithm like this that is specifically trained to video, and takes into account previous and next frames when recreating lost data.

There's been work (in a related field) to "stabilize" the jitter in consecutive frames. As you say, by taking into account the neighboring frames.

Relevant excerpt from [1]:

“If you just apply the algorithm frame by frame, you don’t get a coherent video — you get flickering in the sequence,” says University of Freiburg postdoc Alexey Dosovitskiy. “What we do is introduce additional constraints, which make the video consistent.”

[1] https://blogs.nvidia.com/blog/2016/05/25/deep-learning-paint...

I have a couple of questions if the authors are following along.

While I understand the choice of using a downsampled input with 4 channels I'm wondering why you went with a downsampled output instead of going to the original resolution directly where the 3 color channels are separate.

Also, did you investigate "faking" the training data by taking a single well exposed image, making it darker using conventional methods and using the resulting image as an input to the workflow?

I wish they provided the RAW files. Looking at "traditional-pipeline" photos I am positive I can get a much better result just spending some time with Lightroom and coming up with some "super high ISO" preset. Perhaps it will not match their new pipeline but it will be better than what they have for the "traditional pipeline".

This might work wonders for webcam video if it works fast enough

I love playing with my mirrorless camera and lenses, but I'm becoming more and more convinced that it's a risky proposition "investing" in a bunch of expensive camera gear (which traditionally holds it's value better than most gadgets) when computational methods will soon evaporate the advantages of bigger sensors / faster glass.

This does not sound right. The source image must have had more information (perhaps not compressed raw data) than in the example.

The book cover details simply are not there on the dark image (if you scale up the brightness then there is only blocky noise.

So either this is not the right dark image or their network dreamed it up.

The dataset is composed of D -> F sets (not pairs, because there are many underexposed images) of Dark to Fully-illuminated images.

Yes, the net is "dreaming" the details, based on what it learns from those mappings. I'd say this nets are very specialised on the sensor, and maybe even lens choice. Simply put, what they did is compressing a full pipeline of processing into a deep net that consumes RAW files and spits out natural looking images.

This isn't my field but I'm curious: are the results in the slider samples novel images or ones that were trained on?

Could it not just be really good are recreating the image it was trained on or is it generally doing this with novel images in this case?

I can see Apple and Google rushing to secure a deal to include this tech on their cameras.

Doesn't look like their slider bar works with mobile chrome.

Astronomers spend entire careers trying to squeeze as much data as possible from low light shots of the stars. And their math skills are superb, and budgets almost unlimited. There is very little to add to their job really.

The technology and those pics are interesting. Though the content of the pictures are odd.. mannequin heads and metamucil.. xD

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact