Hacker News new | past | comments | ask | show | jobs | submit login
DifuzCam: Replacing Camera Lens with a Mask and a Diffusion Model (arxiv.org)
116 points by dataminer 3 months ago | hide | past | favorite | 47 comments



For those interested in various approaches to lens-free imaging, Laura Waller at Berkeley has been pursuing this area for some time.

https://waller-lab.github.io/DiffuserCam/ https://waller-lab.github.io/DiffuserCam/tutorial.html includes instructions and code to build your own https://ieeexplore.ieee.org/abstract/document/8747341 https://ieeexplore.ieee.org/document/7492880


Note the difference between a "diffuser" and a "diffusion model" here.


The article mentions the use of a diffractive mask. This can be very generally associated with an optical diffuser in its function.

And yes, Laura Waller at UC Berkeley has been one of the pioneers in this research for a few decades.


This is not a 'camera' per se. It's more like a human vision system that samples light and hallucinates an appropriate image based on context. The image is constructed from the data more than it is reconstructed. And like human vision, it can be correct more often than not to be useful.


Thanks for the summary. I was looking for this.


It's also like some of space telescopes. Computational photography stuff.


This would be impressive if the examples weren't taken from the same dataset (Laion-5B) that was used to train the Stable Diffusion model it's using.


They also show actual images they took of real scenes. See figure 6.


You're right, I should have read everything before commenting... sorry.


This is quite amazing that using a diffuser rather than a lens, then using a diffusion model can reconstruct an image so well.

The downside of this is that is heavily relies on the model to construct the image. Much like those colorisation models applied to old monochrome photos, the results will probably always look a little off based on the training data. I could imagine taking a photo of some weird art installation and the camera getting confused.

You can see examples of this when the model invented fabric texture on the fabric examples and converted solar panels to walls.


The model basically guesses and reinvents what these diffuse pixels might be. It's more like a painter producing a picture from memory.

It inevitably means that the "camera" visually parses the scene and then synthesizes its picture. The intermediate step is a great moment to semantically edit the scene. Recolor and retexture things. Remove some elements of the scene, or even add some. Use different rendering styles.

Imagine pointing such a "camera" at person standing next to a wall, and getting a picture of the person with their skin lacking any wrinkles, clothes looking more lustrous as if it were silk, not cotton, and the graffiti removed from the wall behind.

Or making a "photo" that turns a group of kids into a group of cartoon superheroes, while retaining their recognizable faces and postures.

(ICBM course, photo evidence made with digital cameras should long have been inadmissible in courts, but this would hasten the transition.)



> photo evidence made with digital cameras should long have been inadmissible in courts

Sworn testimony is admissible in courts. I think the "you can just make evidence up" threshold was passed a few thousand years ago. The courts still, mostly, work.


Yes. One solution to the problem if false testimony was photo evidence...

But I'm less worried about the courts and more about media that might publish photos without realizing they are AI generated - or ordinary people using those cameras without understanding how they work and then not realizing there may be some details in the pictures that are plain fantasy.


Most people I know would understand "the camera comes with a built-in filter" to mean "what the camera photographs isn't what you'd see if you looked". The media publishing misleading (or misleadingly-captioned) photos is a problem as old as print photography.


Turned a guy right into a tree. This would have fascinating implications if deployed broadly.


I don't understand the use of a textual description. In which scenario do you not have enough space for a lens and yet have a textual description of the scene?


It's not as crazy as it seems, a pinhole camera doesn't have any lenses either and works just fine. The hole size is a tradeoff between brightness and detail. This one has many holes and uses software to puzzle their images back together.


so this is like use of (in a different species) light sensitive patches of skin instead of the eye balls (lenses) that most animals on earth evolved ?

interesting.

even if this does not immediately replace traditional cameras and lenses... I am wondering if this can add a complementary set of capabilities to a traditional camera say next to a phone's camera bump/island/cluster...so that we can drive some enhanced use cases

maybe store the wider context in raw format alongside the EXIF data ...so that future photo manipulation models can use that ambient data to do more realistic edits / in painting / out painting etc?

I am thinking this will benefit 3D photography and video graphics a lot if you can capture more of the ambient data, not strictly channeled through the lenses


Does a camera without a lens make any physics sense? I cannot see how the scene geometry could be recoverable. Rays of light travelling from the scene arrive in all directions.

Intuitively, imagine moving your eye at every point along some square inch. Each position of the eye is a different image. Now all those images overlap on the sensor.

If you look at the images in the paper, everything except their most macro geometry and colour pallet is clearly generated -- since it changes depending on the prompt.

So at a guess, the lensless sensor gets this massive overlap of all possible photos at that location and so is able, at least, to capture minimal macro geometry and colour. This isn't going to be a useful amount of information for almost any application.


Yes, compressed sensing cameras do exactly that. They reconstruct a photometrically correct image without the need for focusing optics or pixel arrays. They have limitations (not fundamental ones though) but they're useful for special use cases like X-ray or LWIR single-pixel imaging where focusing optics and pixel arrays are impossible or expensive. It was first used on X-ray telescopes in 1970's in the form of coded aperture, before the grazing incidence mirrors.


It's lensless, but it's not just a naked sensor. It still has an optical element - read the paper.


Oh great, waiting for the first media piece where pictures from this "camera" are presented as evidence. (Or the inverse, where actual photographic evidence is disputed because who knows if the camera didn't have AI stuff built in)


I wonder how it "reacts" to optical illusions? The ones we're familiar with are optimized for probing the limits of the human visual system, but there might be some overlap


Oh god, we are going to make lens a premium feature now aren't we?


It would be pretty great if cheap phones can get good cameras with this technology.


There is no camera. It is just a diffusion model trained on a big set that tries to reconstruct the picture. Essentially this is not much different from what Samsung did with their AI-enhanced camera that detected a moon and replaced that with high-resolution picture.


It still has coded aperture filters. It's well established that you can reconstruct images from shadows on an image sensor cast by strategically designed funky cutouts since pre-LLM days.


The pictures in the paper are pretty damn close, and this is just a prototype. Plus, as you said, phones already have AI filters.


The text on the Thor Labs "Lab Snacks" box is giant and still unreadable, the model interpolating total junk. It seems like there's nowhere near enough signal there.


And yet, we will probably get all camera software on our phones with unlimited zoom and details. Turning grain in crispy clear pictures. Inpainting, outpainting etc. In 5 years from now everybody uses it. Everything becomes fake.


If/when cheap enough, even non-phone devices (POS terminals, vending machines, etc) will have cameras; will living in a camera-free environment become the premium feature?


Cameras are already cheap enough to put in everything.


Re: is this a camera or not, I recently realized that my fancy mirrorless camera is closer to this than i’d previously thought.

The sensor has a zillion pixels but each one only measures one color. for example, the pixel at index (145, 2832) might only measure green, while its neighbor at (145, 2833) only measures red. So we use models to fill in the blanks. We didn’t measure redness at (145, 2832) so we guess based on the redness nearby.

This kind of guessing is exactly what modern CV is so good at. So the line of what is a camera and what isn’t is a bit blurry to begin with.


The structure you are referring to is a Bayer Array. Algorithms that do the guessing are called debayering algorithms


I think that’s just a particular (very common) case. In general it’s called demosaicing, right?


Not sure - I’ve seen both used and assumed they were interchangeable. Is there a more general case of a bayer array?


In general I think you just call it a color filter array.


They don't use models (although you certainly could.) They usually use plain old closed-form solutions.


Eh, I came to ML from the stats side of things, so maybe I use “models” more expansively. They definitely use some things tuned to typical pictures sometimes (aka tuned to a natural dataset). On camera, it’s much more constrained, but in postprocessing, more sophisticated solutions pop up.

The wikipedia article on demosaicing has an algorithms section with a nice part on tradeoffs, how making assumptions about the kinds of pictures that will be taken can increase accuracy in distribution but introduce artifacts out of distribution.

The types of models you see used on camera are pretty constrained (camera batteries are already a prime complaint), but there’s a whole zoo of stuff used today in off-camera processing. And they’re slowly making they’re way on-camera as dedicated “AI processors” (I assume tiny TPU-like chips) are already making their way into cameras.


I get the feeling that lens free cameras are the future. Obviously the results here are no where near good enough, but given the rapid improvement of diffusion models lately the trajectory seems clear.

Would love to lose the camera bump on the back of my phone.



Not sure what this has to do with anything. The paper I was commenting on is using diffusion models to parse raw light hitting the sensor as an alternative to a glass lens. No one wants an image generation model hooked up to weather data— that’s kind of ridiculous.


So you just get something based off GPS, time of day and rotation?

Or no photos anymore?


Not sure I understand the question. This paper is about using diffusion models to reconstruct usable images from raw sensor data. The diffusion model in essence replaces the lens.


there would be a bunch of holes and a ccd at the back, just no growths of bulbous lens.


+1 for the Thor labs candy box




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: