
Sort Faster with FPGAs - szczys
http://hackaday.com/2016/01/20/a-linear-time-sorting-algorithm-for-fpgas/
======
vvanders
I wonder how quickly routing resources will cause the fan-out time to hurt the
speed of sort as well.

Also fun is Radix Sort. If you can guarantee a fixed size key(say 32bits) a
Radix Sort will do O(kN) where k is usually 1-4 based on your cache line size.
Destroys any other sort due to linear data reads by almost an order of
magnitude in some cases. I've used on on some previous gigs to sort thousands
of quads front-to-back in record time on some older ARM chips.

~~~
minimax
_O(kN)_

If this is trolling you're doing a very good job.

~~~
vvanders
Not particularly.

I'll probably get downvoted but I'm not a huge fan of Big-O notation(hence the
k). It's very hand-wavey about the things that have a significant impact on
performance(like cache misses).

Great from the ivory tower of academia but down in the real world we like
things that run fast :).

~~~
minimax
I agree with you and I meant it as a genuine complement.

~~~
vvanders
Thanks, I'll have to re-train my complement detector :).

I enjoy breaking out Radix because it's one of those thing's that's _so_
unintuitive when you come at it from a traditional CS perspective. AFAIK the
only book I'm aware of that approaches data structures with a consideration
for this type of performance is Real-Time Collision Detection(which is
basically datastructures for 3D spatial domains).

------
kbwt
A better solution which only uses local connections between adjacent cells is
Odd-Even Transposition Sort.

When the cells are arranged in a 2D grid, Shearsort sorts all the values in
snake-like order in O(sqrt(n) log(n)). A slightly more complicated algorithm
based on Shearsort is that of Schnorr and Shamir, which runs in O(sqrt(n)).

------
theocean154
There's a far better way to do this, which essentially decomposes into an
insertion sort. Details here:
[https://github.com/theocean154/dau/tree/master/sort](https://github.com/theocean154/dau/tree/master/sort)

~~~
hamiltonkibbe
Yeah I was thinking the same thing, just compare all the registers to the
value of the next item @posedge_clk and use that bit to tell the registers
whether or not they need to shift their value into the next register. It
replaces the `memmove()` (the abstract concept) with a single cycle shift

~~~
theocean154
The benefit of my way is that you get pipelining (which lets you maintain O(1)
throughput without having a fill/flush cycle), as well as better scaling (due
to said pipelining) and better energy usage due to smaller fanout.

------
alain94040
O(N) is not particularly insightful for fixed/limited values of N. This would
be more interesting if the author reported sorts/s with the FPGA vs. software.
Because obviously, as N grows, the FPGA only performs the full sort in "one
cycle" if you are willing to lower the cycle time more and more (until the
FPGA blows up).

~~~
zik
Yes, it's really an insertion sort where the insert step is done in parallel
hardware. That's cool for small N but it's really still an O(N^2) algorithm.

------
mabbo
> each sorting step is fast, bubbling down to N identical and _independent
> questions asked in parallel_ at every clock cycle.

> The first empty cell could take the new data since it’s empty, but since 4
> just got kicked out from the cell above, this empty cell gets the 4 instead

These are incompatible. The result in cell 3 is dependent on the result of
cell 2 (and so on up the chain of cells).

Therefore, the steps have to run in order, making this O(n) just for adding
one item to the set.

The end goal might still be possible, but I don't see how this can work.

~~~
lightcatcher
I initially thought this as well, but reading the code showed me it's actually
O(1) to add each item to the set.

Each cell computes in parallel (new_data < cell_data) & (cell_state ==
OCCUPIED). If this is true, then the cell will be pushed down, either be the
new data or by the data from the cell above. Each cell can also grab this
value from the cell above. If the cell above's value is true, then the cell
grabs the value from the cell above. If the value in the cell above is false,
then the cell grabs the data being inserted. This is O(1) because there's no
cascading value shift. I personally found this pretty clever, and I'm looking
forward to implementing some of these bounded size hardware sorts on GPU.

~~~
mabbo
> Each cell can also grab this value from the cell above

Therein lies the problem I see. Grabbing this value from the above cell
implies that this value has already been calculated- but it hasn't, as these
cells run in parallel. The results of a cell in a single round of calculations
is not accessible to the other cells during the same round.

In short- how does a cell know whether its new value is the value being
inserted or the value being pushed down _without knowing the result of the
cell above it_?

And having asked that, I can actually think of one answer: it can calculate
whether it's neighbor will be pushing or not independently. So each cell is
_really_ doing the calculation of "given these two numbers (my cell's value
and the above cell's), where does this new number fit in sorted order with
them- before, between, or after".

------
jcbeard
Here's an old but great reference on the subject:
[http://www.ccrc.wustl.edu/~roger/papers/cg09.pdf](http://www.ccrc.wustl.edu/~roger/papers/cg09.pdf)

I used to teach digital design, sorting was one of the first topics for FPGA's
:)

------
e19293001
A much more faster sorting algorithm implementation for hardware is sorting
networks[1]. This can be implemented with just comparators and wires. Thus,
there's no clock involved since it is a combinational circuit. It's quite
simple circuit than can be built from bottom to top with logic gates. Knuth's
TAOCP has a good explanation. Additionally, there are lots of information on
the internet.

\---

[https://en.wikipedia.org/wiki/Sorting_network](https://en.wikipedia.org/wiki/Sorting_network)

------
melted
We'll see a lot of FPGA use in the coming years. Consider this: newest FPGAs
have 1TB/s (that's a terabyte per second, not terabit) memory bandwidth and
1KB-wide memory bus. Short of developing your own silicon (which takes years
and tons of money), you simply can't get this kind of throughput with anything
else. That's basically why Intel bought Altera.

~~~
wtallis
HBM and HMC are already in the process of bringing most of those advantages to
GPUs and probably some CPUs. Last year's high-end GPUs introduced HBM with a
4kb (0.5kB) wide memory bus with 0.5TB/s bandwidth. The only reason they
didn't have more memory bandwidth was because they didn't need it and the die
area was better put to use for compute resources. FPGAs may be the only
general-purpose chips so far to actually ship that kind of memory bus, but
they aren't the only ones that _can_.

~~~
melted
GPUs aren't custom logic. They come with significant and fundamental
programming model limitations. So I don't see much of a functional overlap
FPGAs and GPUs. In particular, GPUs suck mightily at anything that involves
branching. They're also not ideal for things like SDN, crypto, search
retrieval/ranking, realtime DSP, and so on.

~~~
wtallis
My point was that memory bandwidth won't sell anybody on FPGAs if they didn't
already have strong reasons to be considering FPGAs.

------
bedros
I built a test chip for a class project in college that did sorting in linear
time, without all the routing issues of this algorithm.

~~~
switchbak
Care to share some details? It sounds interesting.

~~~
pkaye
Perhaps implement a sorting network?
[https://en.wikipedia.org/wiki/Sorting_network](https://en.wikipedia.org/wiki/Sorting_network)

------
jhallenworld
The article referenced in the comments is very cool:

[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83....](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.83.5786&rep=rep1&type=pdf)

It seems related to certain pipeline bypass structures.

------
keithwhor
I have the pleasure of working with Josh, and first heard him talk about this
in a car ride to get lunch. :) Very cool stuff, and very excited to see the
things he builds in the future.

------
theocean154
rearranged my repo, here is the scalable, pipelined way of doing this, it ends
up being an insertion sort with the benefit of small fanout (for lower
energy), better scalability (faster due to lower fanout and shorter critical
path with pipelining), and maintains O(1) throughput (no fill/drain cycle).

[https://github.com/theocean154/dau/blob/master/docs/part_sor...](https://github.com/theocean154/dau/blob/master/docs/part_sort_block_diagram.pdf)

~~~
theocean154
more explanation is in the paper in the comments below, this is just a simple
way of visualizing how it works

------
TickleSteve
Trades time for parallelism.

This isn't going to scale to large data sets...

