I was skeptical of the bottom line of the eye chart image. characters are 3x3 pixels. The o should translate to a dot. there is no hole.
Their explanation for it:
"The lines of letters have been recovered quite well due to the existence of cross-scale patch recurrence in those image areas. However, the small digits on the left margin of the image could not be recovered, since their patches recurrence occurs only within the same (input) scale. Thus their resulting uniﬁed SR constraints reduce to the “classical” SR constraints (imposed on multiple patches within the input image). The resulting resolution of the digits is better than the bicubic interpolation, but suffers from the inherent limits of classical SR [3, 14]."
So it's guessing based on the larger characters. Neat.
Part of their algorithm finds areas of the image that are close matches but have different sizes, and uses the larger areas to create high-resolution replacements for the matching smaller areas. Of course this produces especially good-looking output for images like the eye chart (and also the final image on the page) that have many similar elements repeated at different scales.
Essentially it is looking at the large letters (or pieces of them) to guess how the small letters should look.
And as JTxt notes above, sometimes it chooses the wrong large letter. Even on the third-to-last line it couldn't get one of the letters correct. The output looks like "HKO" while the actual chart has "HKG" -- but since there is no larger "G" for the algorithm to use as an example, it ended up with a different but similar shape. This could probably be improved by the other SR techniques they mention that use libraries of sample images.
Ah, that's incredibly clever, but that must be incredibly expensive. For every pixel window you're trying to resolve, you have to look at (and generate smaller versions) of every other possible pixel windows, right?
[...] there is plenty of patch redundancy within a single image L. Let p be a pixel in L, and P be its surrounding patch (e.g., 5 × 5), then there exist multiple similar patches P1,...Pk in L (inevitably, at sub-pixel shifts). These patches can be treated as if taken from k different low-resolution images of the same high resolution “scene”, thus inducing k times more linear constraints (Eq. (1)) on the high-resolution intensities of pixels within the neighborhood of q ∈ H (see Fig. 3b). For increased numerical stability, each equation induced by a patch Pi is globally scaled by the degree of similarity of Pi to its source patch P. [...]
[...] Assuming sufficient neighbors are found, this process results in a determined set of linear equations on the unknown pixel values in H. Globally scale each equation by its reliability (determined by its patch similarity score), and solve the linear set of equations to obtain H.
Then they also add patches at different scale by making a few more versions of the image which have been scaled down by various amounts, and doing comparisons between the target patch and patches in those images too. If good matches are found, they can then use the original higher-resolution original patch that yielded the match when it was shrunk.
Right. Another example is the only W (on the bottom line) mistaken as a M.
The M first shows on the third from the bottom, but not very clear, but tries to reconstruct it; then it uses that M to replace the M the second from the bottom, finally it guesses that W (only known as a 4x3 pixel blur on the bottom line)is a M.
It's also interesting that the bottom M is skewed out a little on it's top right corner to match the fuzzy W shape.
So where are the '?' coming from? I think I have to call shenanigans / database, there aren't any in the image, or even any shapes like it. But then, I'm not sure what they're implying with this paragraph (this is the whole paragraph, my additions not italicized):
>Fig. 6 [the eye chart] compares our uniﬁed SR result against ground truth. In Fig. 7 [baby with hat] we compare our method to results from [11,13] and . Note that our results are comparable, even though we do not use any external database of low-res/highres pairs of patches [11, 13], nor a parametric learned edge model . Other examples with more comparisons to other methods can be found in the paper’s website.
Dear Mr. Glasner, Bagon, and Irani (authors of the super-resolution paper):
Please apply this technique to resolve a historically important question about the assassination of President John F. Kennedy by identifying the license plate number of the car immediately behind the bus:
Better resolution photos should be available since the ones I've seen printed in books are better than these. And there are plenty of photos of Texas license plates from 1963 with which to seed your algorithm.
The reason this is interesting is because Roger Craig, a Deputy Sheriff in Dallas, after he witnessed the assassination said that he saw Lee Harvey Oswald run from the Texas School Book Depository and get into Nash Rambler station wagon driven by a another man.
These photos show a Nash Rambler station wagon that was indeed passing the Depository just after the assassination. If motor vehicle records still exist for 1963, then its owner can be identified.
The key point is that the existence of a person picking up Oswald after the assassination would strongly indicate the possibility that he was not acting alone.
I'll readily grant that this isn't likely to settle the issue, but I think it still would be amazing if the super-resolution approach was able to generate new information about a topic that no one expects to ever see new information about.
I'm not an expert on this algorithm, but I'm quite sure that the algorithm does not recover information which exists in the image; rather, it invents information which categorically does not exist. Since both the natural world and many built objects are quasi-fractal in nature (repeating windows in a building, leaves on a plant, etc.), the information which it invents often happens to be correct. For non-fractal items like license plates, however, there's no chance that it would produce a useful result.
I'm pretty sure that if you seeded the algorithm with images of many other Texas license plates, and then asked it to sharpen this image, what you'd get would simply be the super-resolution sum of all the other Texas license plates. It would show you something that looked intelligible, but actually had absolutely nothing to do with the original image. That would do more to confuse the issue than to shed light on it.
The other point is that I'm not sure how well this would actually work on blurry images, or whether (as a super-resolution technique) it's very specialised for working on pixelated images.
That said, if it can work on blurry images too, then it would be a fun and interesting exercise to see what other historical images could yield insight (or confusion) with this kind of enhancement.
8"what you'd get would simply be the super-resolution sum of all the other Texas license plates."*
I do not think so. That sum would not look like any other Texas license plate. Instead, it would locally pick the best fit parts to what data it can find in the image. Chances are their algorithm works at different scales; the better they can mix the information at different scales correctly, the smaller the risk that those local best fits are inconsistent with each other (for example, the left of a letter might look like a b, while the right looks like a q. If the algorithm does not look at scales of the width of a character, it could easily produce something that combines the two characters into one. Looking at that Snellen chart example, I think I see something like that in the top/bottom of some of the characters (last line should probably read D K N T W U L J S P X V M R A H C F O Y Z G; see http://guereros.info/img.php?fl=p5r4p4m4d4n2t24606s4t2m416d4...)
I think that, if there is sufficient data in the image, chances are good (but not necessarily 100%) that the result will be a sharp representation of the real license plate. Chances are also good that it would remove that barely visible dead fly from the plate. If there is insufficient information, it would pick a valid license plate that fit the data best. It would be nice if the algorithm also computed some validity estimates (a bit like an alpha mask 'my confidence in predicting this pixel is x%')
That’s not how this technique works. There’s no magic here, just some algorithms.
It doesn’t use time and space travel to reach back into the original scene and pull data that wasn’t captured by the camera. Instead, it guesses at what the blown up image should look like based on an examination of other features already in the image, with some expectation that shapes and textures will be similar in different parts of the image and at different scales – will have a sort of fractal nature – and by aiming to make shapes with edges instead of fuzzy boundaries.
Using these algorithms can’t reveal the precise content of a license plate that was too small in the original image to see the letters of.
> license plate that was too small in the original image to see the letters of
I did mention that better quality photos are available. And I did show that there are at least 2 photos of the same license plate -- there are probably more. In the photos printed in books, the license plate looks like the last couple lines of the Snellen eye chart in the original article -- i.e. it's quite blurry, but the super-resolution nethod was able to resolve it.
> it guesses at what the blown up image should look like based on an examination of other features already in the image, with some expectation that shapes and textures will be similar in different parts of the image and at different scales
It is true that there are no license plates at different scales in the same image. But are you quite certain that the technique cannot be applied across a range of photographs? That is, couldn't photos of license plates of the same type (and hence with exactly the same font) at different scales taken by the same photographic process on the same day in the same conditions have the same "fractal nature" that could be exploited to improve the resolution of the target photo?
Let me put it another way: Suppose I cut up the Snellen eye chart into 6 equal pieces. Now I have 6 photos. Clearly, 5 of those photos can be used to improve the resolution of the 6th. Is there some principle at work here that says that for super-resolution to work, all data must come from a single original photo?
If you had several pictures of the same license plate, each of which was big enough to get meaningful information from but just barely too small to make out the letters when you blew it up with a regular bicubic interpolation + gaussian sharpening step, then you might be able to figure out more than you previously could. It seems quite unlikely to me, but go ahead and try it.
If you look carefully, it’s actually possible to make meaningful guesses at the letters in the eye chart just as well in the bicubic version as in the “SR” version.
Did we look at the same chart? The characters in that last line are 3 or 4 pixels tall. I just did an experiment and 80% of guesses were wrong. There is just not enough information, but this method can still decode them.
The authors explained the chart, it only works because it looks across the whole chart for similar parts and recreates the smaller characters using the data it finds in the larger ones. So you would need a couple larger images of the licence plate too.
Information cannot be created from nothing, the algorithm sees that the chart seems to be repeating a similar image over and over, so it guesses the best fit for those images on the bottom lines. They might even be completely wrong, for example, is it really a D and not an O?
Yes. That's why the author mentions using other plate images from the time, having the same typeface, to seed the algorithm. He mentions that higher res photographs exist, which might yield a plate image of a similar resolution to that chart's smallest characters.
He is not asking for magic, and it's obviously not "creating information from nothing". No one here is stupid. It's "matching low-density patterns to correlated high-density ones", which you already know if you read the article. Maybe there isn't enough data at all to do it - I don't have access to these images to know better, but that doesn't make the idea any less reasonable.
Unfortunately, you don't know that the algorithm got the right answer for the letters on the bottom row. You'd have to try it on an image where you already knew the answers. Other comment threads note the there are lots of errors on that last line in the eye chart. Again, you can't recover more information than is in the image. When multiple characters look identical when scaled down, no algorithm can recover them reliably.
Well, with a 16/22 chance of getting a letter right, that gives a 15% chance of correctly reading a 6-letter license plate correctly. Of course that can be improved with more training, but I don't think it'll ever get high enough to make any dent in the conspiracy theories. Then again, maybe they'll all want to believe in the CSI-style zoom/enhance magic.
I highly recommend "Family of Secrets" by Russ Baker. It gives great insight into the history of the Bush family and their intelligence network. A large part of the book is devoted to their involvement in the JFK assassination.
I'm a little late to the party, but super-resolution is a topic of interest in computer vision, and this approach is one of many that have been recently proposed. At a high level, I should say that the field is still searching for the right approach, and there is by no means a consensus that this method (or more generally, this philosophy) is the right one.
As a general rule of thumb, the performance shown in example figures of a computer vision paper should be taken as best cases (and in some papers, as outliers), and even the "failures" shown are often not the worst or most typical. Similarly, quantitative results are generally as optimistic as possible, arrived at through "graduate student descent" of the parameter space.
So I think we are still quite far from a super-resolution method that is truly practical OR effective.
Some specific points about this paper:
- This research group has a long history of exploring methods based on exploiting self-similarity in images for various tasks (super-resolution, denoising, segmentation, etc.), and although they have shown remarkable progress on this front, it is generally agreed that using more than just a single image would improve results drastically for just about any task.
- The use of more-than-exhaustive self-similarity search is EXTREMELY expensive, however, and that's why runtime is often not mentioned in these papers. It's not uncommon for processing times to be on the order of DAYS, for a single tiny VGA image. (I don't remember whether they quote a number in this paper or not, but it's certainly not faster than a few hours per image, unless you start using some hacks.)
- As others have commented, this method is not actually extracting new information from the image, but rather "hallucinating" information based on similar parts of the image. There is certainly reasonable justification for its use as a prior, but it's not clear whether it's optimal, or even close to it.
- My gut feeling is that super resolution methods will find much more application in compression-related areas rather than CSI-esque forensics. For example, JPEG compression is roughly 30%. But downsample an image by 2x in each direction and your compression ratio is already 25%. So if you can downsample images a few times and then use super-resolution to "hallucinate" a high-resolution version, that's probably good enough for the vast majority of common images (e.g., facebook photos), where users aren't very sensitive to quantitative "correctness". And of course with mobile adoption happening much faster than mobile bandwidth is increasing, this becomes an ideal application domain.
I'd be curious to know at what magnification this breaks down, if any. For instance, most of the images have been enlarged by about 3x or 4x. What happens when you blow them up to, say, 20x? It seems like this algorithm does some sort of vectorization -- because look at how crisp the lines are -- which would suggest to me that it scale well. But maybe I'm wrong.
Not directly (but of course you were not entirely serious). Look at, for instance, the letter chart. The letters at the bottom are turned into very sharp forms, but not all look like letters, and those that do might not be turned into the correct ones...
For instance, if they used this to "enhance" a picture with a piece of paper in someone's pocket, these letters would be enhanced based on other forms somewhere else in the picture -- so some vague letter shapes might become 'Burger King' just because the photo also contains a Burger King logo somewhere :)
It may not add totally new information, but it does more than create an image that looks nicer. It's not just inventing plausible information, it's providing the most likely new information given the available data. For example, "given what we can see of that eye chart, it's most likely the little blob is a 'G'". So it's adding new interpretive statistical information, which could be pretty helpful in some situations.
The problem is that the algorithm is using a definition of "most likely" that doesn't help you answer questions about the real scene captured in the image.
For example, if what you're really trying to do is decode text, this algorithm will lose every time to one that knows the prior probabilities of natural human languages. ("ONLY" is much more probable than "QNLY", but this algorithm does't know that).
It's a shame that hamburglar is hellbanned because his comment is insightful. In some sense, isn't this algorithm "inventing" detail? If you're lucky the detail corresponds closely to what should be there, but sometimes it doesn't and thus it can't be relied upon.
For those without showdead one, hamburglar's comment:
Yeah, and in fact the fact that it looks crisper can be misleading. You can't rely on a detail that a method like this infers from the low resolution image. Look at the green tiles on the kitchen floor in the 6th example. The "Our SR Result" image looks fantastic, but it's obvious that it's false, because the tiles end up misshapen and unevenly spaced. You really wouldn't want to convict someone of murder based on this type of enhancement.
NO. OMG NO. This (artistic) class of algorithm makes things look pretty, but does not add information and ABSOLUTELY does not add evidence. Any serious mention of generative algorithms in the context of evidence is at best negligent.
Humans are easily fooled. They will look at a fuzzy picture and see possibilities, but they will look at the output of this algorithm and see fact. There are other algorithms that predict the possibility of objects (usually letters) producing an image that are more suited for investigative work.
"Adding information" in this context means providing accurate data that was not previously there. The guesses that the algorithm makes are approximations, but there are many original signals that could produce those same approximations, so the information is not accurate.
Could you use AI to find a near match for the image? Using the eye chart as the example: Since this now becomes a image matching problem once Super Resolution is used, search an image database and zoom enhance.
Some of these images are in uncanny-valley territory for me. The bicubic images just look resized to me, but super-resolution images look real but "wrong". Given how much more "real" they still look than many computer graphics (especially realtime), I guess we won't leave uncanny valley behind us anytime soon.
Techniques like this have been used in some high-end sensor applications since (at least) the 1990s. It was an option for increasing discrimination resolution via DSP that could not practically be accomplished via higher resolution sensors due to things like space constraints given the technology of the time. It is probably less useful today due to ongoing miniaturization of imaging sensors.
So especially with the ongoing miniaturization of imaging sensors, which will allow you to take even bigger and bigger pictures, there are issues these algorithms could solve, unless SD cards get really cheap really fast I guess.
I meant less useful for the purposes which such algorithms were originally designed for two decades ago. They original use cases were discrimination, not making prettier pictures. I am quite familiar with compressive sampling.
Depixelizing Pixel Art may work even better for some types of images: http://research.microsoft.com/en-us/um/people/kopf/pixelart/
"The algorithm extracts a smooth, resolution-independent vector representation from the image, which enables magnifying the results by an arbitrary amount without image degradation."
The first application that comes to my mind is improving the quality of photos in real estate listings. Half the time it looks like they were taken by a camera phone circa 1995 by someone who was skipping from room to room.
It is the principle of NL-means denoising: since an image is redundant, you use informations about shapes that are similar and are present at different scales in the image to guess the shape of the zoomed version of the smallest ones by using the (more thorough) content of the bigger ones
I claim bullshit. The theoretically optimal algorithm is, and has been known to be for 40+ years, sinc interpolation. Gimp does this (Lanczos approximation). Comparing to bicubic leaves aliasing relics and looks bad. I'd be very interested in seeing the algorithm compared to something that's not a pure known strawman.
(In CS terms, this is akin to comparing your algorithm to something using a bubble sort, and ignoring the invention of n log n sorting algorithms)
I don't know anything about this area of specialty, but I thought it important to point out that Huffman compression was also proven to be theoretically optimal, and then along came arithmetic compression. I only mean to say that you should be very precise about what the proof of optimality actually showed.
And, with that, I'm bracing myself to get schooled.
Edit: This reminds me of the Iterated Fractal Systems proprietary lossy image compression algorithm they tried to commercialize in the 90s. It was able to decompress to a larger scale image that introduced synthetic detail that was often convincing to the eye. Notice how this article talks about different scales.
That wasn't my point. You can do better than theoretically optimal -- indeed, their results, in many cases, do look better than sinc. Theoretically optimal makes assumptions that the source image is, in some ways, random, in a way that the real world doesn't conform to. If you know there are hard edges, you can do better.
The point is that they should be comparing to sinc. They're comparing to a known-stupid algorithm to make themselves look better than they are.
The problem with these: Every half a year we hear about new revolutionary image transformation.
Today it's super-resolution, last year it was low-res PC games graphics interpolation, year before that - filter that could resize images by one direction (e.g. width) while preserving proportions and throwing away less interesting parts. The image will become 2x narrower and you'll hardly notice.
Sadly, this is usually first and last time we see the technology in question. They do not seem to produce any impact that could increase our quality of life. They just sit on some dusty shelves somewhere.
The low-res texture stuff is widely used in emulators (both mobile and on PCs) - after all, that's exactly where it was meant to be used.
The seam carving (the "least interesting line removal" thing) is now a major feature of Photoshop (called content-aware scale) and has become really well integrated (I hear there are attempts to do in video as well).
The fact that you don't see this tech in your line of work doesn't mean it's not out there.