

Optimizing Gaussian blurs on a mobile GPU - jcxplorer
http://sunsetlakesoftware.com/2013/10/21/optimizing-gaussian-blurs-mobile-gpu

======
rjdagost
The page isn't loading for me but I will chime in with an image smoothing
optimization trick I've used. If you don't need an exact Gaussian blur you can
approximate it very efficiently using integral images:
[http://en.wikipedia.org/wiki/Summed_area_table](http://en.wikipedia.org/wiki/Summed_area_table)

The more "boxes" you use in your sliding window (analogous to a convolution
kernel), the better you can approximate a Gaussian kernel. If you're using a
big Gaussian kernel on a big image then integral images can result in a large
reduction in the number of operations performed and thus potentially enable a
big speed-up.

------
jerf
In addition to what was already there, I'd suggest looking at trying to
operate on the whole image at a time rather than pixel-by-pixel. To take a
simple blur example, a blur that samples the local pixel at 50%, and each of
the top, bottom, left, and right by 12.5% is equivalent to drawing the entire
image half darkened, then drawing it in each of one pixel to the left, right,
up, and down at 12.5% of the original brightness, merging additively. I don't
know if this would be faster or not, you'd have to profile it, but I'd
consider it worth a try.

~~~
xenonite
On the contrary. Consider that the memory is the bottleneck when performing
the blur. I understand you would create five instances of the images (50%,
12.5% Left, R, Top, B). This would worsen the bottleneck even more.

Additionally, the advantage of computing pixel by pixel is that the shader can
operate massively parallel.

~~~
sliverstorm
I would think it is going to depend on your cache size. Piecewise will be
better if the image can't all fit in the cache, but if the image is small
enough you can fit everything in the cache.

Or, do you mean that memory is the bottleneck as in, shipping the image to the
GPU's memory space?

~~~
xenonite
Yes, it really depends on your cache size.

And the problem with that is, you can't guess the cache size. You can help
yourself with profiling, but this leads to a local optimization for only some
GPUs.

If you wish to run your code optimized for any GPU, the pixel-by-pixel
approach usually works best. Then, the GPU scheduler can run as many
neighboring threads as possible in subprocessors. Note that every subprocessor
has another local cache which is really quick.

------
tlarkworthy
FFTs where designed for this
[http://en.m.wikipedia.org/wiki/Convolution_theorem](http://en.m.wikipedia.org/wiki/Convolution_theorem)

~~~
backprojection
I'm surprised this isn't even mentioned in the article. I guess there are
issues with boundary effects, but those can be overcome in a variety of ways.

Also, if your convolution kernel is short enough, you can beat an FFT. So, in
2D with an NxN image, an FFT filter will have a complexity of about (N^2 + N^2
Log N). If your kernel is KxK, then direct convolution will be (N^2 K^2), so
you have to compare Log N with K^2. But keep in mind 1.) these are asymptotic
complexities the coefficients matter 2.) K may or may not depend on N 3.) FFTs
are may have larger memory requirements, although you may be able to get away
with in-place FFTs, and you don't have to explicitly compute the FFT of the
gaussian - you can work that out analytically (you should get another
gaussian...).

Also, as someone pointed out, the gaussian is separable, so then the
complexity of the direct convolution is (K N^2).

Anyway, what are FFT libraries like for iOS? I suppose with Apples policies,
you can't compile FFTW for your App?

~~~
lcrs
Last I played with FFT convolution it didn't end up being a win for any sane
filter size, even compared to brute-force convolution with arbitrary non-
separable kernels. Admittedly I was using off-the-shelf tools in Nuke, but I
would imagine both their FFT and Convolve nodes are reasonably optimized.

There's an analysis of the 1D case here with some interesting cautions at the
end:
[http://www.engineeringproductivitytools.com/stuff/T0001/PT15...](http://www.engineeringproductivitytools.com/stuff/T0001/PT15.HTM)

------
monkey_slap
I've done some research in doing live blurs and created my own hack solution
[0]. The biggest bottleneck I've always seen people run into is doing the
snapshot. Your solution looks really great, but I'm wondering how you would
continuously snapshot the screen without running into issues? Or, have you
found another solution.

Really great read!

[0]: [http://whoisryannystrom.com/2013/09/17/Live-blur-in-
iOS7/](http://whoisryannystrom.com/2013/09/17/Live-blur-in-iOS7/)

------
thomasahle
There is also blur by fft. This way you get nlogn performance in the number of
pixels. All unrelated to sigma and the size of your blur kernel.

------
jffry
Another opportunity for efficient blur would be to rescale the image if the
blur radius is above a certain amount. This incurs overhead, so for smaller
radii it might be more efficient to just blur the full-size image, but for
larger radii you can first reduce the image, blur with a smaller radius, and
then upscale the image.

You might need an Arbitrary Hack Constant (oh, sorry, "tuning parameter") to
decide when to make that transition.

~~~
crazygringo
That's already in the article:

> _Instead of filtering the image in its native resolution, I used OpenGL 's
> native support for linear interpolation to downsample the source image by 4X
> in width and height, blurred that downsampled image, then linearly upsampled
> via OpenGL afterward._

~~~
jffry
Indeed it is, my bad for not RTFM

------
FreeFull
Are there other blurs that are cheaper than the Gaussian but look good enough
to replace it for this sort of purpose?

~~~
lcrs
I've used a triangular ("conical"?) kernel when in a hurry and it looked
reasonable. Vastly better than a box blur, at least. They're actually not
separable but one can pretend and just do horizontal and vertical passes
without much damage.

Amusingly Autodesk ship a GLSL shader with Flame which uses a triangular
kernel under the hood but labels the button in the interface "Gaussian", the
cheeky swines...

The nicest option to imitate lens blur is a circular kernel but that's quite
aggressively non-separable. The other thing that really helps is blurring
linear-light values instead of the usual gamma-encoded images. The blurs in
Windows 7's Aero themes very clearly don't do this, to my eye - dark lines
spread out too much and look generally muddy.

VFX guy who loves to blur things here :)

------
dharma1
you can also downscale before blurring and upscale after blur. 4x without much
noticeable difference in image quality. This gives a speed boost in two
respects - less pixels and a smaller radius

~~~
xenonite
This is already mentioned in the article.

------
kespindler
You can do Gaussian filter in 2 passes: first along X, then along Y. Then you
only grow linearly in size, rather than quadratically. This change should also
improve your cache hits.

~~~
widdershins
Mentioned in the article.

