
Pillow-SIMD – Fast, production-ready image resize for x86 - igordebatur
https://blog.uploadcare.com/the-fastest-production-ready-image-resize-out-there-part-0-7c974d520ad9
======
pedrocr
Is doing lanczos really faster and/or better quality for scaling down than
just doing a simple pixel mixing:

[http://entropymine.com/imageworsener/pixelmixing/](http://entropymine.com/imageworsener/pixelmixing/)

I implemented this for my image pipeline:

[https://github.com/pedrocr/rawloader/blob/230432a403a9febb5e...](https://github.com/pedrocr/rawloader/blob/230432a403a9febb5e5c004780211f7e765130b9/src/imageops/demosaic.rs#L155-L190)

Makes for simple enough code and even before any serious effort at
optimization or SIMD it can convert a 3680x2456x4 image in 32bit float (source
article is 3x8bit) to 320x200x4 also in 32bit float in about 60ms (across 4
threads in 2 cores on a i5-6200U).

~~~
jacobolus
Probably not faster. Definitely better quality. As Alvy Ray Smith said, a
pixel is not a little square,
[http://alvyray.com/Memos/CG/Microsoft/6_pixel.pdf](http://alvyray.com/Memos/CG/Microsoft/6_pixel.pdf)

Edit: If you want to do even better than the tensor product Lanczos filtering,
you can do a filter based on Euclidean distance. Make sure that you work in
linear (non-gamma-adjusted) color space.

A couple more resources to read:
[http://www.imagemagick.org/Usage/filter/#cylindrical](http://www.imagemagick.org/Usage/filter/#cylindrical)
[http://www.imagemagick.org/Usage/filter/nicolas/](http://www.imagemagick.org/Usage/filter/nicolas/)

Edit 2: really weird that this is getting downvoted. There’s not really
anything to dispute here. It is straight-forward to show that the Lanczos
method has objectively better output. Moreover, Alvy Ray Smith’s paper is a
classic which anyone interested in image processing should read.

~~~
wtallis
That memo needs to die. There are better ways to make the few of its points
that are still valid, without making assumptions and assertions that are
frequently invalid with today's technology.

Pixels are often not representative of little squares (rectangles). However,
they are more likely to be representative of little rectangles now than they
were in 1995, now that most photographs are captured on Bayer arrays of square
sensors, most computer-generated images are rendered as an average of point
samples weighted according to estimated coverage of a square pixel, and
everything is displayed on LCD or OLED panels with square pixels comprised of
rectangular subpixels. If you take a random JPEG or PNG off the Internet,
interpreting the pixel data as point samples will usually be less accurate
than interpreting them as integrated over a rectangle. Interpreting the data
as integrated over a gaussian distribution is also less realistic than the
rectangle interpretation. Doing image processing in a point sample or gaussian
context is certainly useful, but it's definitely not more fundamentally right
than the little square model. Historically, that context was at best a neutral
choice that was equally unrealistic no matter what kind of hardware you were
working with, but mathematically convenient.

The paper's arguments about coordinate systems (whether the y-coordinate
should grow toward the top of the screen or the bottom, and whether pixels
should be centered on half-integer points) are also a waste of time for the
modern reader.

~~~
mark-r
Even if your physical sensor is made of little squares, they cease to be
squares once you've converted them to single readings - you've integrated the
square into a single point. This is the basis of sampling theory. Continuing
to think of them as little squares leads you to bad intuitions such as the
resampling algorithm that we're responding to. Maybe there's a better way to
make that point than the cited paper, but I haven't seen it yet.

Whether pixels are on the half-integer points is a completely arbitrary
decision, unless you're trying to mix raster and vector graphics. Then the
correct solution will become obvious.

~~~
pedrocr
From the point of view of sensors the square model seems correct. Sensor
pixels are photon counters. If you want to turn a 2x2 pixel square into a
single pixel summing the four pixels is the same as if you had a sensor with
1/4 of the pixels and each of them counts the photons over those 2x2 areas[1].
What we may be saying is that since you have the extra pixels you can actually
do _better_ than if the sensor was already natively at your desired resolution
by using a more complex filter over the extra data (e.g., by avoiding the
Moiré artifacts that are common in lower resolution sensors).

[1] Most sensors are bayer patterns so some extra considerations apply to
color resolution

------
Asooka
Let me insert my personal pet-peeve: have you thought of making it colour-
space-aware? Most (all?) images you'll encounter are stored in sRGB
colourspace, which isn't linear, so you can't do the convolution by just
multiplying and adding (the result will be slightly off). The easiest way
would be to convert it to 16-bit-per-channel linear colour space using a
lookup table, do the convolution in linear 16-bit space, then convert back to
8-bit sRGB.

~~~
homm
Color space conversion is a hard topic in terms of performance. First of all,
not all images are stored in sRGB. Most of them have another color profiles
(such as P3 or ProPhoto). So, sRGB conversion is not enough, you need the full
color management.

Second, you'll see a real profit of color management only on a few images.
Most time you'll see the difference only when you see both images at the same
time on the same screen.

For now, I came up to the resizing in original non-linear color space and
saving the original color profile with the resulting image.

~~~
jacobolus
By far most images in the wild are sRGB, and those that aren’t should
typically be tagged with their color space.

Resizing in gamma-adjusted space (sometimes) causes nasty artifacts when
resizing. If you can afford the CPU use, always convert to an approximately
linear space first, then downsize, then convert back. If you get the gamma
curve slightly wrong (e.g. gamma = 2.0 vs. 2.2) it’s not too big a deal, the
resulting artifacts won’t really be noticeable, so feel free to use a square
root routine or something if it has better performance.

~~~
jacobolus
But note if you go to a linear space, you need to bump up the precision of
your numbers. 8 bits per channel doesn’t cut it anymore, you’ll want 12 or 16
bits per channel (or even better, a floating point number). This might have a
big effect on performance.

------
dahart
I'd be very interested in an optional Pillow-SIMD downsampling resize that
produces 16 bit output internally and then uses a dither to convert from 16
bit to 8 bit. Photoshop does this by default and it produces superior
downsampling. Without keeping the color resolution higher, you can end up with
visible color banding in resized 8 bit images that wasn't visible in the
source image.

I am curious if the reason that Pillow-SIMD is more than 4x faster than IPP is
due to features IPP supports - like higher internal resolution - that Pillow-
SIMD doesn't? The reported speeds here are amazing, and I'm definitely going
to check this project out and probably use it, but I'd love a little clarity
on what the tradeoffs are against IPP or others. I assume there are some.

~~~
homm
Each resampling algorithm will internally produce some high-precision result
before cutting it to 8 bits. For Pillow-SIMD it is 32-bit integers. Currently,
I haven't considered dithering, but it is a very interesting idea. Do you have
any links for further reading about downsampling banding and dithering?

About IPP's features: the comparison is pretty fair: the same input, the same
algorithm and filters, pretty much the same output. If IPP uses more resources
internally with the same output, so, maybe it shouldn't.

Shame on me, I still haven't added the link to IPP's test file I used. Here is
it:
[https://gist.github.com/homm/9b35398e7e105a3c886ab1d60bf598d...](https://gist.github.com/homm/9b35398e7e105a3c886ab1d60bf598dd)
It is modified ipp_resize_mt program from IPP's examples. If you have
installed IPP, you'll easily find and build it.

~~~
dahart
> Do you have any links for further reading about downsampling banding and
> dithering?

Sadly, no, I wish I did. I just made some expensive mistakes printing giclee
images from downsampled digital files, and whipped up my own dither for
converting 16bit to 8bit. It wasn't until it bit me that I noticed Photoshop
does it better than most apps because dithering is on by default. That's when
I went looking and found an option for it in Photoshop's settings.

The main banding problem when downsampling is with slow changing gradients.
Sky and interior walls, for example. I bump into it a lot with digital art
too, since the source images don't have any noise. But even when there's noise
in the source image, downsampling 2x or more with a good filter can eliminate
the noise and cause gradients to stabilize and show their edges in 8bit color.
In my experience, the problem is more common with print than on-screen resized
images, but it's still pretty easy to spot on a screen, especially in the
darks, and especially when jpeg compressing the results.

Implementation wise, the 16-to-8 bit dither is nowhere near as sensitive as
the dithers we normally see converting 8bits to black & white or when
posterizing. Almost anything you come up with will do. You don't need any
fancy error diffusion or anything like that. Here's what I do: imagine the
filtered 16 bit result as an 8.8 fixed point number in the [0-256) range, so
the least significant bits are in the [0-1) range. I add a random number
between -0.5 and +0.5 before rounding to the nearest integer. Viola, drop the
low 8bits and the result is a dithered 8 bit value.

What I just described will be way slow in your world if you call a random
number function every pixel, so don't do that. :P For Pillow-SIMD you'd want a
random number lookup table or something slightly smarter than a random()
function. And I dither on the color channels separately, but there might be
some way to make it blaze by dithering the brightness and rounding all three
channels up or down at the same time. I've just never tried to optimize it the
way you're doing, but if you find a way and release anything that dithers, I
would _LOVE_ to hear about it.

------
FrozenVoid
Obligatory:
[http://johncostella.webs.com/magic/](http://johncostella.webs.com/magic/)

------
gioele
Unrelated: since when has this picture of Bologna become a new lenna.jpg?

I think I have already seen it in a couple of recent posts about image
compression. (Fits perfectly the definition of Baader-Meinhof phenomenon [1].)

[1]
[https://en.wikipedia.org/wiki/List_of_cognitive_biases#Frequ...](https://en.wikipedia.org/wiki/List_of_cognitive_biases#Frequency_illusion)

~~~
josteink
> Unrelated: since when has this picture of Bologna become a new lenna.jpg?

Lenna does IMO not contain enough sharp edges and contrasts to highlight the
differences between the different resize techniques.

With Bologna, you can clearly see the problems with a nearest-neighbour
approach. I'm not sure that would have been equally visible with lenna.

------
stephencanon
FWIW the Accelerate framework[1] gives roughly comparable performance[2] for
Lanczos resizing. Apple platforms only, but all Apple platforms, not limited
to x86.

[1] vImageScale_ARGB8888( ).

[2] I don't have identical hardware available to time on, and it's doing an
alpha channel as well, so this is slightly hand-wavy.

~~~
arcticbull
Accelerate is a really amazing and highly underrated framework. Sure the
function names are, uh, sub optimal (I'm looking at you, vDSP). That said,
having a framework guaranteed across all devices to implement algorithms and
primitives in the fastest way for each new device as they come out is
amazingly valuable.

I've built production systems over the last few years with it that really
wouldnt have been possible without it.

~~~
izacus
What kind of (server-side) production system can afford to only run on Apple
consumer hardware? Or was it a mobile app?

~~~
nuschk
IMGIX (a image scaling service) used to use Mac Pros [1]. Dunno if they still
do though.

[1] [http://photos.imgix.com/racking-mac-
pros](http://photos.imgix.com/racking-mac-pros)

------
nostrademons
Curious how this would compare vs. running it on the GPU? This is literally
what GPUs are made for, and they often have levels of parallelism 500+ times
greater than SIMD.

~~~
dimatura
I tried to do this once with Theano, and found that the latency of the
roundtrip to GPU and back made it not worthwhile for a single image. Maybe a
batch of images at once would make it worthwhile. And this isn't what theano
is intended for, admittedly - custom CUDA might do a better job.

~~~
fulafel
I got curious about the numbers so I did a napkin calculation:

In a 2012 vintage Nvidia article[1] they get 5-6 GB/s in both directions
(array size 4MB) which would be around 1500 Mpix/s with 8bit RGBA pixels.

15 Mpix image: Transfers both ways would take 20 ms, and given GPU kernel
going at ~5x the CPU speed (CPU 30, GPU 150 Mpix/s), you would spend 100 ms
doing the computation. So 120 ms on GPU vs 500 ms on the CPU.

[1] [https://devblogs.nvidia.com/parallelforall/how-optimize-
data...](https://devblogs.nvidia.com/parallelforall/how-optimize-data-
transfers-cuda-cc/)

edit: so I have no idea about the real GPU spedup, but this shows that the
transfers shouldn't hurt too much unless the speedup vs CPU is very small.

~~~
nostrademons
Interesting, thanks. So it seems like it'd still be a pretty heavy win for the
GPU.

Also, a common use-case on the web today is to have one input image and then a
large number of output images (usually smaller) for different screen
resolutions & thumbnails. Seems like you could save a lot of time by uploading
the input image once and then running a bunch of resize convolutions for
different output sizes while it's still in the GPU memory, then download the
output files as a batch.

------
mark-r
I'm really happy to see this. The one time I tried looking at the PIL sources
for resizing, I was appalled at what I saw. Simply seeing that you're
expanding the filter size as the input to output ratio shrinks is a huge deal.

When I wrote my own resizing code, I found it helpful to debug using a
nearest-neighbor kernel: 1 from -0.5 to 0.5 and 0 everywhere else. It shook
out some off-by-one errors.

------
cvwright
> No tricks like decoding a smaller image from a JPEG

Given that most cameras are producing JPEG now, I'm curious why you don't make
use of the compressed / frequency-domain representation. To a novice in this
area (read: me), It seems like a quick shortcut to an 8x or 4x or 2x
downsample.

Or is the required iDCT operation just that much more expensive than the
convolution approach?

~~~
jacobolus
That’s a shortcut if you only ever have to downsample by powers of two and you
don’t mind worse image quality, since your down-sampled picture won’t use any
data from across block boundaries.

~~~
mark-r
You can use it in a multi-step process. Use JPEG blocks to get slightly above
the target size, then Lanczos to finish it off.

------
Veratyr
Looks really nice!

I'd love to see vips in the benchmark comparison, perhaps a Halide-based
resizer too as those are the fastest I've found so far. Perhaps GraphicsMagick
too, as I believe it's meant to be faster than ImageMagick in many cases.

~~~
dahart
GraphicsMagick (GM) is definitely loads faster than ImageMagick (IM) for some
resize operations, but I doubt it's in the same league as Pillow-SIMD, just
guessing though. I've had to do some large image (gigapixel) resizing, and IM
keeps the entire image in memory, which causes it to hit swap. GM streams the
image resize instead, so it doesn't have to swap, and it can finish in a few
minutes instead of many hours.

------
ashishuthama
Another data point: MATLAB, glnxa64 AVX2, 12 core

>> maxNumCompThreads(1);

>> im = randi(255, [2560, 1600, 3],'uint8');

>> timeit(@()imresize(im,[320,200],'bilinear','Antialiasing',false))

ans =

    
    
        0.0083
    

>> timeit(@()imresize(im,[320,200],'bilinear'))

ans =

    
    
        0.0301
    

>> maxNumCompThreads(6);

>> timeit(@()imresize(im,[320,200],'bilinear','Antialiasing',false))

ans =

    
    
        0.0062
    

>> timeit(@()imresize(im,[320,200],'bilinear'))

ans =

    
    
        0.0113
    
    

Oh, missed that lanczos2 part:

>> maxNumCompThreads(1);

>> timeit(@()imresize(im,[320,200],'lanczos2','Antialiasing',false))

ans =

    
    
        0.0146
    

>> maxNumCompThreads(6);

>> timeit(@()imresize(im,[320,200],'lanczos2','Antialiasing',false))

ans =

    
    
        0.0049
    
    

Since MATLAB tries to do most of the computation in double precision, its
harder to extract much from SIMD.

------
ttoinou
Have you tried to use a fast blur (like StackBlur for example :
[http://www.quasimondo.com/BoxBlurForCanvas/FastBlur2Demo.htm...](http://www.quasimondo.com/BoxBlurForCanvas/FastBlur2Demo.html)
, the radius should be computed according to the ratio between original size
and target size) as a first step before taking the classic nearest neighbor ?
And also try to make an algorithm that resize to multiple resolution at the
same time could improve speed

    
    
        I take an image of 2560x1600 pixels in size and resize it to the following resolutions: 320x200, 2048x1280, and 5478x3424
    

So you are also upscaling ?

------
vadiml
Great work and great article. One question though: Did you consider to replace
convolution based filters by FFT based ones ?

~~~
stagger87
For small filter sizes, convolution is going to be faster than an FFT
approach. Plus, correct me if I'm wrong, but you need to perform a convolution
for every output pixel where the filter kernel is different for each
convolution (Sampling the lanczos filter at different points depending on the
resample ratio), which would really slow down an FFT approach.

~~~
pishpash
Beyond the mismatched signal size, the FFT approach creates circular
convolutions, which is not what you want in images. You'd need windowing or a
larger effective signal size, and pay the cost of conversions back and forth.

------
gfody
> I wasn’t building it for fun: I work for Uploadcare and resizing images has
> always been a practical issue with on-the-fly image processing.

you ever consider pushing the work entirely to the client with a resize
implemented in javascript? that would cut down on bandwidth as well.

~~~
homm
It's exactly what my previous article "Image resize in browsers is broken" is
about )

[https://blog.uploadcare.com/image-resize-in-browsers-is-
brok...](https://blog.uploadcare.com/image-resize-in-browsers-is-
broken-e38eed08df01)

It originally was written two years ago, so some things have changed. But in
general, it is still correct: for most browsers, you need to combine several
ugly technics to get suitable results. Though, the quality will have nothing
common with quality when you have direct access to hardware.

~~~
MichaelGG
Why not use asm.js instead of relying on canvas and hacking around browser
problems?

~~~
homm
1\. asm.js is dead in favor of WebAssembly 2\. I need a solution for most
browsers on most platforms, not for some very recent and most popular

------
techdragon
Any reason this had to be a fork?

I would much rather this feature be in Pillow so ALL of the python ecosystem
could get 6 times faster image resizing.

~~~
girvo
He addressed that in the article: Pillow is cross-platform and cross-
architecture, so these sorts of specific optimisations (x86-64 only, with some
pretty specific instruction requirements) meant that the author felt it
wouldn't be a good fit in the original library

------
vortico
Looks fantastic! Are there other bottlenecks, such as JPEG encoding and
decoding that can be ported to SIMD code in Pillow?

~~~
lcnmrn
No, that’s handled by libjpeg which can be replaced by libjpeg-turbo but
nothing else at the moment.

------
legulere
Can the speed increased even further using GPGPU?

------
MuffinFlavored
> With optimizations, Uploadcare now needs six times fewer servers to handle
> its load than before.

This is devil's advocate, but did you guys have concrete need for this
optimization? You now need six times fewer servers, but was that a crippling
problem, or is it a cool statistic for the future when you get more users?

~~~
girvo
One can make an argument that the CO2 and energy cost for wasted server usage
is a decent reason for it! 6x fewer is not a small amount, that's a great
result.

~~~
dmitrymukhin
I thought this was in article already :)

