

Sorting out graph processing - mrry
https://github.com/frankmcsherry/blog/blob/master/posts/2015-08-15.md

======
ibdknox
Looks like one of the x-stream guys replied:
[https://github.com/ar104/sortingVsScanning/blob/master/Sorti...](https://github.com/ar104/sortingVsScanning/blob/master/SortingConsideredHarmful)

Related discussion on twitter:
[https://twitter.com/roy_amitabha/status/634078514698952704](https://twitter.com/roy_amitabha/status/634078514698952704)

------
jules
Brilliant writing style! Also check out his work on Naiad, which is IMO by far
the most interesting work in its area, and check out the Scalability! But at
what COST? paper:
[http://www.frankmcsherry.org/assets/COST.pdf](http://www.frankmcsherry.org/assets/COST.pdf)
which showed that a program running on a single thread often outperforms
distributed systems running on 100+ cores.

------
qqueue
>The downside to radix sort is that is just looks at bytes. If you had some
deep and semantically meaningful ordering defined over your type, (or, like, a
pointer) that's great but we're sorting by its bytes.

See Fritz Henglein's "Generic Top-down Discrimination for Sorting and
Partitioning in Linear Time" paper for a nice way to extend radix sort's
runtime to arbitrary meaningful orderings:

[http://www.diku.dk/hjemmesider/ansatte/henglein/papers/hengl...](http://www.diku.dk/hjemmesider/ansatte/henglein/papers/henglein2011a.pdf)

Beginning of a haskell implementation here:
[http://ekmett.github.io/discrimination/index.html](http://ekmett.github.io/discrimination/index.html)
. Might be doable in rust as well.

------
lorenzhs
Random memory accesses are very expensive. This is because TLB misses are
typically resolved using a tree data structure (a hash table is not a good
choice as knowledge of the hash function would allow for timing attacks). This
tree has logarithmic depth (although with a large base). Huge pages make this
a lot cheaper, but if they are not used, you might well end up with a
situation where _n_ accesses to random positions in a very large array of size
_n_ are similarly expensive to sorting the array.

------
asdfjghafsdkh
For large MSB radix sorts you should do 6 bits at a time (64 bins) to avoid
spilling the TLB cache. You can switch to 8 bits when you get below 64 pages
(256k).

Also, for sorting 64-bit integers I like to do a bitwise or of all the
integers while computing the histogram. Then if the bits you are looking at
are zero (plausible for unsigned 64-bit integers) you can skip to the ones
that matter for the next histogram.

~~~
frankmcsherry
There is this cool related work [0] that I wanted to talk about, but couldn't
get to show an improvement on my laptop, which uses software write combining
to mitigate cache limitations. In particular, they are worried about the fact
that 8-way associative L1 doesn't like to hold 256 specific cache lines, so
they manually buffer writes into 16k (= 256 cache lines) contiguous bytes,
using a non-temporal write when a cache line is full. They report hitting 88%
of their system's peak memory bandwidth.

You could do the same thing (I think) to avoid TLB limitations, manually
staging everything in a contiguous 2MB of memory (backed by one large page,
say), in order to keep the radix high and do fewer full scans. If you are
hitting memory bandwidth, your performance _should_ be determined by the
number of scans you end up doing.

All of this is "caveat: I just read other people's work and haven't done this
myself, because I haven't figured out inline asm in Rust yet". If you have
more details on engineering radix sort, I'd love to read up! :D

[0]: [http://arxiv.org/abs/1008.2849](http://arxiv.org/abs/1008.2849)

------
yzh
I'm doing GPU graph processing:
[http://gunrock.github.io](http://gunrock.github.io) I think another critical
aspect of getting high performance graph processing is load balancing. Also,
there are tons of papers talking about how to reorganize data to get coalesced
memory access. Combining the above strategies can outperform pre-processing
such as sorting. At least on GPU this holds.

------
anonetal
As Frank notes, this is not really surprising if you have seen the recent-ish
papers on sorting. It is indeed frustrating when a paper seems to have nice
ideas but does not do a good enough job at the simple baselines; and in most
cases, there is no way for you to easily check your intuitions :(

~~~
amirouche
why do you think is difficult to check your intuitions?

------
elliptic
Is this claim correct? "You do four of these sequential scans, and in each you
write sequentially to one of 256 locations. Four passes, no random access.
This is great. I don't even see a log n there, do you?"

You're writing to one of 256 (presumably far apart) memory locations - why
isn't this considered random access?

~~~
frankmcsherry
The cost of writing to some number of unrelated locations is mostly determined
by the level of cache where your working set can stay resident. You'll be
doing random access to that level of cache, but it can exchange data with the
next level cache in larger contiguous blocks.

In this case, 256 cache lines fit in the L1 cache, so while it does look like
"random access" to the L1, this can end up looking more like "sequential
access" to main memory. There are addition complications, like the L1 data
cache being only 8-way associative, a quite small TLB cache sitting in front
of the L2 and up, etc. So in practice this ends up somewhere between random
access to L1 and random access to L3, with sequential access to main memory, I
think.

