Problems with image resizing is a much deeper rabbit hole than this. Some important talking points:
1. The form of interpolation (this article).
2. The colorspace used for doing the arithmetic for interpolation. You most likely want a linear colorspace here.
3. Clipping. Resizing is typically done in two phases, once resizing in x then in y direction, not necessarily in this order. If the kernel used has values outside of the range [0, 1] (like Lanczos) and for intermediate results you only capture the range [0,1], then you might get clipping in the intermediate image, which can cause artifacts.
4. Quantization and dithering.
5. If you have an alpha channel, using pre-multiplied alpha for interpolation arithmetic.
I'm not trying to be exhaustive here. ImageWorsener's page has a nice reading list[1].
I've definitely learned a lot about these problems from the viewpoint of art and graphic design. When using Pillow I convert to linear light with high dynamic range and work in that space.
One pet peeve of mine is algorithms for making thumbnails, most of the algorithms from the image processing book don't really apply as they are usually trying to interpolate between points based on a small neighborhood whereas if you are downscaling by a large factor (say 10) the obvious thing to do is sample the pixels in the input image that intersect with the pixel in the output image (100 in that case.)
That box averaging is a pretty expensive convolution so most libraries usually downscale images by powers of 2 and then interpolate from the closest such image which I think is not quite perfect and I think you could do better.
If you downscale by a factor of 2 using bandlimited resampling every time, followed by a single final shrink, you'll theoretically get identical results to a single bandlimited shrinking operation. Of course real world image resampling kernels (Lanczos, cubic, magic kernel) are very much truncated compared to the actual sinc kernel (to avoid massive ringing which looks unacceptable in images), so the results won't be mathematically perfect. And linear/area-based resampling is even less mathematically optimal, although they don't cause overshoot.
Isn't this generally addressed by applying a gaussian blur before downsizing? I know this introduces an extra processing step, but I always figured this was necessary.
I played a little with FFT Gaussian blur. It uses the frequency domain, and so does not have to average hundreds of points, but rather transforms the image and the blur kernel into the frequency domain. There it performs a pointwise multiplication and transforms the image back. It's way faster than the direct convolution.
Having to process 100 source pixels per destination pixel to shrink 10x seems like an inefficient implementation. If you downsample each dimension individually you only need to process 20 pixels per pixel. This is the same optimization used for Gaussian blur.
> If you downsample each dimension individually you only need to process 20 pixels per pixel.
If you shrink 10x in one direction, then the other, then you first turn 100 pixels into 10, before turning 10 pixels into 1. You actually do more work for a non-smoothed shrink, sampling 110 pixels total.
To benefit from doing the dimensions separately, the width of your sample has to be bigger than the shrink factor. The best case is a blur where you're not shrinking at all, and that's where 20:1 actually happens.
If you sampled 10 pixels wide, then shrunk by a factor of 3, you'd have 100 samples per output if you do both dimensions at the same time, and 40 samples per output if you do one dimension at a time.
Two dimensions at the same time need width^2 samples
Two dimensions, one after the other, need width*(shrink_factor + 1) samples
> 3. Clipping. Resizing is typically done in two phases, once resizing in x then in y direction, not necessarily in this order. If the kernel used has values outside of the range [0, 1] (like Lanczos) and for intermediate results you only capture the range [0,1], then you might get clipping in the intermediate image, which can cause artifacts.
Also, gamut clipping and interpolation[0]. That's a real rabbithole.
Wow, points 2, 3 and 5 wouldn't have occured to me even if I tried. Thanks. I now have a mental note to look stuff up if my resizing ever gives results I'm not happy with. :)
Point 2 is the most important one, and the most egregious error. Even most browsers implement it wrong (at least the last time I checked, I confirmed it again with Edge).
Here is the most popular article about this problem [1].
Warning: once you start noticing incorrect color blending done in sRGB space, then you will see it everywhere.
I'm a little bit sympathetic for doing it wrong on gradients (having said that SVG spec has an opt-in to do the interpolation in linear colorspace, and browsers don't implement it). But not for images.
I imagine that beyond just using linearized srgb using perceptually uniform colorspace such as oklab would bring further improvement. Although I suppose the effect might be somewhat subtle in most real-world images.
For downscaling, I doubt that. If you literally squint your eyes or unfocus your eyes, then colors you see will be mixed in a linear colorspace. It makes sense for downscaling to follow that.
When image generating AIs first appeared, the color space interpolations were terribly wrong. One could see hue rainbows practically anywhere blending occurred.
I'd also add speed to that list. Resizing is an expensive operation. Correctness is often traded off for speed. I've written code that deliberately ignored the conversation to a linear color space and back in order to gain speed.
A connected rabbit hole is image decoding of lossy format such as jpeg: from my experience depending on the library used (opencv vs tensorflow vs pillow) you get rgb values that varies between 1-2% of each others with default decoders.
And also (for humans at least) the rabbit hole coming from effectively displaying the resulting image : various forms of subpixel rendering for screens, various forms of printing... which are likely to have a big influence on what is "acceptable quality" or not.
Another thing I had experienced before was a document picture I used after downsizing to mandatory upload size had a character/number randomly changed (6 to b or d). Don't remember which exactly and had to convert the doc to PDF that managed it better.
It would. It would also not accumulate quantization errors from an intermediate result. Having said that there are precedents for having the intermediate image pixels in integral values.
If you're doing interpolation you probably don't want a linear colourspace. At least not linear in the way that light works. Interpolation minimizes deviations in the colourspace you're in, so you want it to be somewhat perceptual to get it right.
Of course if you're not interpolating but downscaling the image (which isn't really an interpolation, the value at a particular position in the image does not remain the same) then you do want a linear colourspace to avoid brightening / darkening details, but you need a perceptual colourspace to minimize ringing etc. It's an interesting puzzle.
I'd argue that if your ML model is sensitive to the anti-aliasing filter used in image resizing, you've got bigger problems than that. Unless it's actually making a visible change that spoils whatever it is the model supposed to be looking for. To use the standard cat / dog example, filter choice or resampling choice is not going to change what you've got a picture of, and if your model is classifying based in features that change with resampling, it's not trustworthy.
If one is concerned about this, one could intentionally vary the resampling or deliberately add different blurring filters during training to make the model robust to these variations
You say that “if your model is classifying based in features that change with resampling, it’s not trustworthy.”
I say that choice of resampling algorithm is what determines whether a model can learn the rule “zebras can be recognized by their uniform-width stripes” or not; as a bad resample will result in non-uniform-width stripes (or, at sufficiently small scales, loss of stripes!)
A zebra having stripes that alternate between 5 black pixels, and 4 black pixels + 1 dark-grey pixel, isn’t actually a visible change to the human eye. But it’s visible to the model.
I'm not saying your general argument is wrong, but... zebra stripes are not made out of pixels. A model that requires a photograph of a zebra to align with the camera's sensor grid also has bigger problems.
For those going down this rabbit hole, perceptual downscaling is state of the art, and the closest thing we have to a Python implementation is here (with a citation of the original paper): https://github.com/WolframRhodium/muvsfunc/blob/master/muvsf...
Other supposedly better CUDA/ML filters give me strange results.
I really wish there are some better general-purpose imaging libraries that steadily implement/copy these useful filters, so that more people can use them out of the box.
Most of languages I've involved are surprisingly lacking in this regard despite their huge potential use cases.
Like, in case of Python, Pillow is fine but it has nothing fancy. You can't even fine-tune parameters of bicubic, let alone billions of new algorithms from video communities.
OpenCV or ML tools like to re-invent the wheels themselves, but often only the most basic ones (and badly as noted in this article).
I found https://dl.acm.org/doi/10.1145/2766891 but I don't like the comparisons. Any designer will tell you, after down-scaling you do a minimal sharpening pass. The "perceptual downscaling" looks slightly over-sharpened to me.
I'd love to compare something I sharpened in photoshop with these results.
That implementation is pretty easy to run! The whole Python block (along with some imports) is something like:
clip = core.imwri.Read(img)
clip = muf.ssim_downscale(clip, x, y)
clip = core.imwri.Write(clip, imgoutput)
clip.set_output()
> Any designer will tell you, after down-scaling you do a minimal sharpening pass
This is probably wisdom from bicubic scaling, but you usually dont need further sharpening if you use a "sharp" filter like Mitchell.
Anyway I havent run butteraugli or ssim metrics vs other scalers, I just subjectively observed that ssim_downscale was preserving some edges in video frames that Spline36, Mitchell, and Bicubic were not preserving.
> The definition of scaling function is mathematical and should never be a function of the library being used.
Horseshit. Image resizing or any other kind of resampling is essentially always about filling in missing information. The is no mathematical model that will tell you for certain what the missing information is.
Not at all. He is correct that those functions are defined mathematically and that the results should therefore be the same using any libraries which claim to implement them.
Arguably downscaling does not fill in missing information, it only throws away information. Still, implementations vary a lot here. There might not be a consensus of a unique correct way to do downscaling, but there are certain things that you certainly don't want to do. Like doing naive linear arithmetic on sRGB color values.
If some of the edges are infinitely sharp, and you know which ones they are by looking at them, as in my example, then it's using more than all its bandwidth at any resolution.
That's true in the 1D case as well. That requires upsampling with information generation before downsampling. Using priori to guess missing information is a task that will never be finished and is interesting. It isn't necessary for a satisfactory downsampling result.
One interesting complication for a lot of photos is that the bandwidth of the green channel is twice as high as the red and blue channels due to the Bayer filter mosaic.
Aha, no! Downscaling *into a discrete space by an arbitrary amount* is absolutely filling in missing information.
Take the naive case where you downscale a line of four pixels to two pixels - you can simply discard two of them so you go from `0,1,2,3` to `0,2`. It looks okay.
But what happens if you want to scale four pixels to three? You could simply throw one away but then things will look wobbly and lumpy. So you need to take your four pixels, and fill in a missing value that lands slap bang between 1 and 2. Worse, you actually need to treat 0 and 3 as missing values too because they will be somewhat affected by spreading them into the middle pixel.
So yes, downscaling does have to compute missing values even in your naive linear interpolation!
>Take the naive case where you downscale a line of four pixels to two pixels - you can simply discard two of them so you go from `0,1,2,3` to `0,2`. It looks okay.
This is already wrong, unless the pixels are band-limited to Nyquist/4. Trivial example where this is not true:
For downscaling, area averaging is simple and makes a lot of intuitive sense and gives good results. To me it's basically the definition of downscaling.
Like yeah, you can try to get clever and preserve the artistic intent or something with something like seamcarving but then I wouldn't call it downscaling anymore.
The article talks about downsampling, not upsampling, just so we are clear about that.
And besides, a ranty blog post pointing out pitfall can still be useful for someone else coming from the same naïve (in a good/neutral way) place as the author.
Now that's an interesting topic for photographers who like to experiment with anamorphic lenses for panoramas.
An anamorphic lens (optically) "squeezes" the image onto the sensor, and afterwards the digital image has to be "desqueezed" (i.e. upscaled in one axis) to give you the "final" image. Which in turn is downscaled to be viewed on either a monitor or a printout.
But the resulting images I've seen until now nevertheless look good. I think that's because in natural images you have not that many pixel-level details. And we mostly see downscaled images on the web or in youtube videos most of the time ...
By that I mean, I know what bilinear/bicubic/lanczos resizing algorithms are, and I know they should at least have acceptable results (compared to NN).
But I don't know famous libraries (especially OpenCV which is a computer vision library!) could have such poor results.
Also a side note, IIRC bilinear and bicubic have constants in the equation. So technically when you're comparing different implementations you need to make sure this input (parameters) is the same. But this shouldn't excuse the extreme poor results in some.
At least bilinear and bicubic have a widely agreed upon specific definition. The poor results are the result of that definition. They work reasonably for upscaling, but downscaling more than a trivial amount causes them to weigh a few input pixels highly and outright ignore most of the rest.
I've seen more than one team find that reimplementing an OpenCV capability that they use gain them both in quality and performance.
This isn't necessarily a criticism of OpenCV, often the OpenCV implementation is, of necessity, quite general, and a specific use-case can engage optimizations not available in the general case
If their worry is the differences between algorithms in libraries in different execution environments, shouldn't they either find a library they like that can be called from all such environments or if they can't find one or there is no single library that can be used in all environments then shouldn't they just write their own using their favorite algorithm? Why make all libraries do this the same way? Which one is undeniably correct?
That's basically what they did, which they mention in the last paragraph of the article. They released a wrapper library [0] for Pillow so that it can be called from C++:
> Since we noticed that the most correct behavior is given by the Pillow resize and we are interested in deploying our applications in C++, it could be useful to use it in C++. The Pillow image processing algorithms are almost all written in C, but they cannot be directly used because they are designed to be part of the Python wrapper.
We, therefore, released a porting of the resize method in a new standalone library that works on cv::Mat so it would be compatible with all OpenCV algorithms.
You can find the library here: pillow-resize.
Hmmm. With respect to feeding an ML system, are visual glitches and artifacts important? Wouldn't the most important thing to use a transformation which preserves as much information as possible and captures relevant structure? If the intermediate picture doesn't look great, who cares if the result is good.
Ooops. Just thought about generative systems. Nevermind.
So, what are the dangers? (what's the point of the article?) That you'll get different model with same originals processed by different algorithms?
The comparison of resizing algorithms is not something new, importance of adequate input data is obvious, difference in image processing algorithms availability is also understandable. Clickbaity.
I was sort of expecting them to describe this danger to resizing: one can feed a piece of an image into one of these new massive ML models and get back the full image - with things that you didn't want to share. Like cropping out my ex.
IS ML sort of like a universal hologram in that respect?
If you upscale (with interpolation) some sensitive image (think security camera), could that be dismissed in court as it "creates" new information that wasn't there in the original image?
The bigger problem is that the pixel domain is not a very good domain to be operating in. How many hours and of training and thousands of images are used to essentially learn about Gabor filters.
This article throws a red flag on proving negative(s). This is impossible with maths. The void is filled by human subjectivity. In a graphical sense, "visual taste."
Zimg is a gold standard to me, but yeah, you can get better output depending on the nature of your content and hardware. I think ESRGAN is state-of-the-art above 2x scales, with the right community model from upscale.wiki, but it is slow and artifacty. And pixel art, for instance, may look better upscaled with xBRZ.
Image resizing is one of those things that most companies seem to build in-house over and over. There are several hosted services, but obviously sending your users photos to a 3rd party is pretty weak. For those of us looking for a middle-ground: I've had great success with imgproxy (https://github.com/imgproxy/imgproxy) which wraps libvips and well is maintained.
1. The form of interpolation (this article).
2. The colorspace used for doing the arithmetic for interpolation. You most likely want a linear colorspace here.
3. Clipping. Resizing is typically done in two phases, once resizing in x then in y direction, not necessarily in this order. If the kernel used has values outside of the range [0, 1] (like Lanczos) and for intermediate results you only capture the range [0,1], then you might get clipping in the intermediate image, which can cause artifacts.
4. Quantization and dithering.
5. If you have an alpha channel, using pre-multiplied alpha for interpolation arithmetic.
I'm not trying to be exhaustive here. ImageWorsener's page has a nice reading list[1].
[1] https://entropymine.com/imageworsener/