Hacker News new | past | comments | ask | show | jobs | submit login
Texture Enhancement for Video Super-Resolution (github.com/dachunkai)
179 points by smusamashah 4 months ago | hide | past | favorite | 31 comments



So this is surprisingly bleeding edge, at least to me. I had to go learn about some hardware and physical imaging stuff I didn’t know to get my head around it.

Upshot: Event Cameras are a different sort of camera in that they have an array of sensor pixels, and sensors only fire when there is a brightness change for that sensor. This has a bunch of benefits, including very high dynamic range, reduced ghosting, and high frame rates, and has some downsides, like reconstructing video, and presumably others.

The paper seems to have started out with the idea that if you had event camera output, you’d be able to reconstruct more fine texture details. And, this works incredibly well, their baby model trained for 8 days significantly beats SOTA and looks a lot better in comparisons as well.

They then seem to have added a step where you simulate/infer event camera data from “normal” RGB video, using a different set of networks, and use that inferred event data to do the texture recovery, and … this also works.

Pretty surprising, and interesting. Their GitHub is full of people like “I want to try this” and then realizing it’s a fairly deep stack to deploy. Even as is, it seems worth someone building a GUI around this in an app, it’s quite remarkable.


My first thought is, is this fake? (Edit: Okay maybe not?) It looks extremely fake. (Edit: A credit to how effective it is?) Like the blurriness seems applied after the fact. It can really generate a license plate number from that shadowy rectangle? I'm expecting to see giveaways for AI but I see none, except maybe the simple oval badge on the sedan (although that might be a real vehicle make; not sure).

If someone else manages to deploy and try this please share your result.


If I understand what they did correctly it won’t have the same failure mode of hallucinations that a diffusion model has - it’s not a model that has an understanding of the world, it’s a model that’s really good at turning async per pixel light event data plus blurry rgb into sharp rgb.

That said I don’t understand it very well, for instance there’s a voxel step in the pipeline and I have no idea why.


The first thing I looked for was text and saw they used cars in an example. I thought the same thing as you 'this can actually get readable license plates?' I don't know enough about ML to understand the associated paper, though.


One thing that might explain it is this logic thread: a) this is a car, b) this is where a car's license plate would normally be, and c) license plates usually have numbers like this.

Upon close inspection the plate's digits look realistic, but there are some symbols that look unfamiliar to me. But I don't know what country the footage is from, so I don't know real from unreal when it comes to symbols.

If the car's badge turned out to correctly match the car model, that might be a bit of a red flag. Although it's not out of the question that a model could eventually recognize car models and get badges right. It just seems unlikely that I'd see such an advancement in a video before I ever saw it in a still image.


What's the display equivalent for an event camera?


Instead of writing out frames at a fixed FPS (ie, 60 fps), the driver would send out updates to individual pixels at precise moments in time, and the unaddressed pixels would remain unchanged. I'm not sure whether or where this kind of technology is used in practice.


Yeah I've been searching for this kind of display technology for the past couple years and concluded I'm using the wrong search terms or it just doesn't exist. Techically possible with oled / microled.


If my interpretation of the paper is correct, they are using the high resolution event data in addition to the low resolution RGB data in order to do the reconstruction, so this technique won't enhance random videos on the internet. It's a new algorithm to take advantage of event-based cameras that usually record both high resolution event data and low resolution RGB.


So we finally have the magical "Enhance" button from sci-fi detective movies and shows, nice!


Except that it's useless for any kind of investigative work where truth matters.


No, that appears to be incorrect.

It is the case that with completely generative models, you will get hallucinated details very likely to be untruthful. But with this approach, you can see blurry input images of license plates that with our naked eye we could not possibly decipher the characters, then put through this model where the output is very close to the actual ground truth.

https://dachunkai.github.io/evtexture.github.io/static/image...

Where does this information come from? It seems they are generating synthetic "event camera-like" events just from diffs between still frames? So maybe they trained a model based on real events from a real event camera? It's hard to tell from their write-up. But these results are very impressive.


You don't need these kinds of solutions for license plates, take a look at this: https://www.youtube.com/watch?v=19wgu5GZDhk


Normal upscaling is picking pixel values in the higher-resolution target image that when run through an equivalent downscaling function would fit the smaller image. There are lots of degrees of freedom to make things up.

Then there is undoing reversible transforms, such as some blurs. That makes information that was there all along more legible. Such as the example you have there.

This paper is a case of both. It does upscaling, but it uses temporal information to find additional constraints that can be used to restrict the degrees of freedom of the "making values up" part. So it's part information recovery, part hallucination.


In the GitHub comments they say what they are doing to infer events is significantly different than per frame diffs. I have no idea what they are actually doing though.


It's true. And for other kinds of more common uses like watching for flaws. Imagine a system like this watching for small flaws in something with a repeated texture. The algorithm will assume the flaw is a mistake make by the camera because of poor resolution and cover up the flaw.


"A camera that produces a nice picture" and "a camera that tries reproduce reality as faithfully as possible" are going to become two different products. That's also a problem because the second one is not going to benefit from the economy of scales of the first one…


ESRGAN already comes very close to being that magical Enhance button.


I have an assortment of low quality original encodes from the 90s (an assortment of thousands of mpeg and flv web videos, think divx and co) that I’ve refrained from reencoding in hopes that some day AI would get there and having the originals would pay off. But looking at all the “originals” in the demo, they’re all super blurry (blurry upscaling, I know, but also trademark h264 low bitrate or high deblocking). It would be ironic if I had to use h264/h265 as a deblocking upscale intermediate step before using something like this someday.


Project page with a few different clips https://dachunkai.github.io/evtexture.github.io/


Generating plate numbers is concerning.


How so


Worst case scenario would result in a police raid on innocent people.


They don't explain what event-driven means, but AFAIK it's based on diffs between frames, which highlight motion and de-emphasise overall brightness/exposure:

https://github.com/uzh-rpg/rpg_vid2e?tab=readme-ov-file#read...


I believe the important point about event cameras is the diffs are per-pixel and entirely asynchronous, so there is no concept of a frame.

The video data is simply a stream of events which encode the time and location of a brightness change. For an immediate full-scene change (like removing the lens cap), you’d get a stream that happens to update every pixel, but there’s no particular guarantee about the ordering.


That's incorrect. Event cameras are different hardware devices than your traditional camera.


All the sample clips have camera motion. Does it perform worse with a static camera or is there enough variation from frame to frame to still recover details?


I think if there’s absolutely no motion for part of a scene ever, eg an anime backdrop, this would likely not work well. But if there’s even a pixel or two of motion (so, anything actually filmed), that should be enough to infer events. Just a guess though.


It would be great if there was an upfront metric about how long the process takes (say per one minute of video) as it usually is a lot.


I wonder what you get if you give it extremely low resolution pictures (say, 64x64)


Some historicaland b&w footage might have sold the idea. We're hoarding low resolution half scan VHS of our family super8.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: