
Show HN: C++ Implementation of the Paper Median Filter in Constant Time - Ldpe2G
https://github.com/Ldpe2G/ArmNeonOptimization/tree/master/ConstantTimeMedianFilter
======
vanderZwan
Very nice! Starred for further experimentation of my own when (if) I ever have
time for this kind of thing again! (it's only a matter of time until one of
the regulars on Observable does a fast canvas implementation so they'll
probably beat me to it)

This reminds me a lot of Mario Klingeman's Stackblur[0] (and my "quadratic"
stackblur variant), which also is O(1) by kernel radius (if I understand the
metric of the paper correctly - if measured by total pixels it is O(n)). The
techniques seems very similar.

The column histogram trick feels like taking what would be a two-pass and
merge it into one more memory-intensive pass. Pretty simple and elegant! I
think this could be applied to the stack blur approach too.

[0] [https://observablehq.com/@jobleonard/mario-klingemans-
stackb...](https://observablehq.com/@jobleonard/mario-klingemans-stackblur)

------
GistNoesis
Nice. Is it amenable to a deep-learning gpu op ? There is a trivial channel-
wise parallelization. There are also little more complicated image-size-wise
parallelization with stitching. The real crux is computing the gradient op, I
don't seem to figure a way to do the same histogram trick.

(Edit : after thinking a little more, in the edge case with an image of
constant value, the adjoint of median must add the gradient contribution to
every pixel in the radius, meaning the backward pass won't be constant time.
Maybe with a softer definition of the median for which the adjoint is allowed
to pick pseudo-randomly one the index among the indices of the median values,
we can still salvage it.)

~~~
throwaway_bad
How does max pool deal with this? It seems similar where you rank all the
pixel values and take the max (rather than median). The problem with ties
still occurs there, no?

~~~
GistNoesis
Good idea. I've looked at tensorflow approach for backward max pool [0]

They seem to be using a naive algorithm, scanning over the pool area and
picking the first index whose value match the pooled max, so the bigger the
kernel radius the slower it gets.

They deal with the tie problem by taking the index of first value that match
the max value (with float32 that's an edge case that should almost never
happen so it's a fine speedup).

My guess is it's probably possible to apply this constant time algorithm doing
the same sort of approximation for the ties, but you also need to enrich the
histograms so that in addition to the count it also store and update the last
index associated with the count.

I also just noticed that the constant time algorithm hasn't been run for float
values. It's not really a problem though because we can use some order
preserving bit-tricks to convert float32 to UInt32 and then truncate to UInt16
[1].

[0]
[https://github.com/tensorflow/tensorflow/blob/master/tensorf...](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/maxpooling_op_gpu.cu.cc#L337)

[1] [https://fgiesen.wordpress.com/2013/01/21/order-preserving-
bi...](https://fgiesen.wordpress.com/2013/01/21/order-preserving-bijections/)

