 Hmm, aside from my note in https://news.ycombinator.com/item?id=26543937 about how the family of kernels Costella discovered are precisely the uniform cardinal B-splines, I noticed this slight error:> Computationally, however, nothing short of Fast Fourier Transforms or hardware acceleration can make a large blurring job efficient: for large values of blur, the support of any blur kernel is large, and not practical for some at-scale applications.The standard way to do Gaussian blur is to approximate it as a B-spline, specifically a cascade of simple moving average filters, also known as a Hogenauer filter or CIC filter in the time domain. Typically a cascade of three filters is used, precisely as Costella describes. This is both simpler and faster than doing the task with a Fourier transform, and moreover it is an online algorithm — it produces output samples at some processing lag behind the input samples.As you can easily verify, the following program uses this approach to convolve its input buffer with a Gaussian kernel with support of 766 samples, constructed in just the way Costella describes (as a B-spline), in a single pass over the input buffer, with only six additions and subtractions per sample, not counting the additions and subtractions used for addressing.`````` #include enum {lag = 256, image_size = 4096}; unsigned long buf[image_size] = {0}; int main(int argc, char **argv) { buf[image_size/2] = 1; for (size_t i = 0; i < image_size - 3 * lag - 3; i++) { buf[i + 3 * lag + 2] += buf[i + 3 * lag + 1]; buf[i + 3 * lag + 1] += buf[i + 3 * lag]; buf[i + 3 * lag] += buf[i + 3 * lag - 1]; buf[i + 2 * lag] -= buf[i + 3 * lag]; buf[i + lag] -= buf[i + 2 * lag]; buf[i] = buf[i + lag] - buf[i]; } for (size_t i = 0; i < image_size - 3 * lag - 3; i++) { if (i) printf((i%8) ? ", " : ",\n"); printf("%6ld", buf[i]); } printf("\n"); return 0; } `````` Removing the printf calls, this takes about 162'677 instruction executions to run, according to Valgrind, which is 39 instructions per pixel. But most of that is still startup and shutdown; increasing the image size by another 4096 increases the number to 232'309 instructions, only 69632 more instructions, or 17 instructions per sample. (printf takes almost two orders of magnitude more time than the blur call.)Adding an outer loop reveals that on my nine-year-old laptop it takes 6.5 μs to run over the whole buffer, or 1.6 nanoseconds per sample. In theory this performance does not depend on the lag and thus the blur radius, but I've only tested a single lag.The output samples are in some sense scaled by 2²⁴ = 16777216, or by 2¹⁶ = 65536 if you're interpolating a signal that's being upsampled by zero insertion by a factor of 256. This scale factor does depend on the lag (it is the cube of the lag).This version computes the results in place starting from output that is already in memory, but you can also use a ring buffer with a capacity of at least 3lag + 2 = 770 samples. If you're downsampling, an alternative is to decimate the samples as soon as they've passed through the cascade of three integrators at the top of the loop, thus reducing both the amount of memory needed and the number of operations to do.Overflow of intermediate results can happen, but it is not a problem as long as the final output is within range, because exact integer addition mod k is a group. If you were doing this in floating point, you might prefer to interleave the integrators with the comb filters to prevent intermediate overflow and consequent loss of precision leading to numerical instability.Only the parts of the result buffer over which all six pipeline stages have passed contain valid results. The simplest solution to this is padding, but you can also use special logic to run the integrators but not the combs over the beginning of the input.I would argue that 17 instructions or 1.6 nanoseconds per sample is always "efficient" and "practical" for nearly all "at-scale applications". If you were blurring a 3840×2160 display at 120 fps, one gigapixel per second, it would take 13 milliseconds of my laptop CPU cores per second for each of the R, the G, and the B; you could do it on about 6 cores of a modern CPU, without a GPU. Correction: if you want to do that in two dimensions, you have to do that twice, and the second dimension is going to have worse locality of reference, especially if you don't tile your images. I still think that's efficient and practical.GCC unrolled the loop twice, so it was 34 instructions that processed two samples per iteration. The instructions are mostly things like this:`````` addq (%rax), %rbp subq %rdi, %rcx addq %rbx, %rdi subq %rcx, %rdx movq %rcx, -2064(%rax) movq -2056(%rax), %rcx `````` That should clarify a bit how one core of my mere 2.8GHz Core i7 is managing to run over ten billion instructions per second.Godbolt link for those who want to play: https://godbolt.org/z/MfEqYn3PdGaussian blur gets used for this a lot precisely because it's so easy to compute this way — it's the only convolution kernel that's anisotropic and separable. So people press it into service doing things like camera bokeh that it's not really very good at. Ah nice, I will correct that statement, which I was sure was bound to have a counterargument. :) I see that now your page links to my comment, saying, "More efficient ways of performing Gaussian blur have been noted; the Magic Blur was convenient in that it made use of existing hardware libraries and calling code stack."But in fact what I described above is the Magic Blur algorithm, just in a slightly different guise, and it can probably take advantage of existing hardware libraries in the same way. However, I don't know what your hardware libraries are doing in the blur case! Happy to help! Search: