
Graphs not Grids: How caches are corrupting algorithm designers (2010) - kuwze
http://blog.ezyang.com/2010/07/graphs-not-grids/
======
willvarfar
The article talks about the Cray XMT where each core has a run-list of tasks
that it can quickly switch between each time one of them stalls on IO.

Here's an approach for the massively multi-threaded CPUs that may be
complimentary: versioning the memory so you can speculatively execute tasks:
[http://people.csail.mit.edu/sanchez/papers/2017.fractal.isca...](http://people.csail.mit.edu/sanchez/papers/2017.fractal.isca.pdf)

Putting it together sounds like a dream.

~~~
CyberDildonics
> run-list of tasks that it can quickly switch between each time one of them
> stalls on IO

Sounds like a less granular version of hyper threading.

> versioning the memory so you can speculatively execute tasks

The more you speculatively execute the less you actually execute. With more
threads you don't have to try to squeeze more granular parallelism at the cost
of diminishing returns.

~~~
willvarfar
> The more you speculatively execute the less you actually execute

True, but there are lots of graph-type problems where it is very useful. As
explained in the paper I linked to?

~~~
CyberDildonics
> As explained in the paper I linked to?

Are you asking me if the paper you linked to explains it?

You are saying "it's useful", I'm saying it quickly becomes too wasteful. CPUs
already do speculative execution with the hopes of mitigating memory latency.

I'm saying if you spend all your transistors and power doing speculative
execution, it is better to find an architecture that lets you use concurrency
on a much courser level so that you don't have to waste so much doing
speculative execution, only to get diminishing returns for the transistors and
power you put in to it.

~~~
willvarfar
The paper was co-authored by an NVIDIA chap, although the prototyping was
x86-based.

There are lots of hardware that has embarrassingly many cores which are
underutilized on embarrassingly serial graph problems. Speculation is one way
to get all those cores humming if you have a good heuristic for what is likely
to be true and the hardware has versioned memory and hardware scheduling so as
to make the speculation cheap enough.

The paper gave examples of such problems.

~~~
CyberDildonics
Again though, it is an answer of how to speed up a graph of lots of extremely
tiny operations. If every node on the graph was an operation on lots more
data, you wouldn't need to solve a problem of under utilization in the first
place.

~~~
willvarfar
Now I know how Baggage felt about people wanting to slice pineapples ;)

Coarse-grained SMT (what Intel call 'hyperthreading') is when a core swaps to
a second task when one task stalls on a main memory access.

Intel chips typically have just two tasks they can pingpong between.

The Cray XMT takes this much further with a very long list of tasks it can
swap between, with hardware caching, ready-monitoring and scheduling of them.
This is excellent for graph problems with lots of main memory stalls. There
are plenty of graph problems where there are many tasks that can make progress
but they are all waiting on main memory all the time.

The research I linked to takes it further and allows the tasks to be executed
speculatively. Basically a tree of hardware transactional memory.

Everything you say about speculation being cycles you can't use for non-
speculable computation is true. But there are problems with bottlenecks that
force them to be essentially serial.

When cores are cheap, the problem doesn't fit in on-chip-cache and the buffers
needed for SMT or SMT+speculation are affordable, speculation is a good way to
utilize cores that would otherwise sit idle if that fits your power budget.

~~~
CyberDildonics
I'm not sure what part of all this explains why it is a good use of computing
power. If you don't skip around in memory as much, you don't have to wait as
much on memory latency. If you process data in bigger chunks, you don't have
to skip around as much. On chip cache doesn't matter as much here since that
can be worked around by making sure data will be prefetched.

Also OOO cores already pre-load things into memory so they can be actually
executed, not speculativly executed. Speculative execution on modern CPUs will
execute both sides of a branch to prefetch data further.

What I'm saying is, if you traverse a graph with tiny amounts of data at each
node, it is wildly inefficient. Mitigating that inefficiency with more
hardware might be faster, but is still an inefficient use of hardware since it
doesn't confront the real problem of skipping around in memory.

------
jules
GPUs are massively multicore and there we see the opposite of what this
article predicts or hopes would happen: grids win even more over graphs. The
best graph algorithms for those chips try to turn the graph problem into a
grid problem as soon as possible.

------
deepnotderp
What on earth is this article talking about? Sparsity and control flow are
horrendous and basically the two things a gpu wants nothing to do with. And
graph algorithms generally run on CPUs in supercomputers, not GPUs, for
precisely those reasons.

The only way any of this makes sense is if the article is talking specifically
about graph processors (e.g. the Cray Eldorado, which to be fair, is
mentioned), but there are very few commercial graph processors and even those
are shrinking in importance. The number one parallel processor is the gpu, the
only one with any chance of making it into the mainstream, and even that is
far from mainstream. And the gpu _hates_ control flow and sparsity.

~~~
convolvatron
if you have machines that do grids well (vectors, simd), and you have a
problem that you can get into that form, then clearly you do. its actually
kind of stunning how things you wouldn't think could be phrased that way can.

i'm extrapolating from the talk, but i think the point is that architectures
that can tolerate dynamic flow control (mimd) are more performance general
than ones that cant. and while no, they aren't mainstream, I think the
historical jury is still out.

since the cost of keeping an instruction pointer per unit is (largely)
absorbed by hardware, its pretty easy to argue that they should win in the
limit. unfortunately there are other costs, the memory interface is vastly
different. caching becomes interesting and potentially prohibitive.

but is mimd dead forever because gpus can be used for some things and are
clearly the cheapest ops you can buy today?

~~~
deepnotderp
No, MIMD is not dead, for example, see the Rex Neo or the Adapteva Epiphany,
two of my favorite processors. _However_ , when you say that something works
well on "massively parallel processors", you have an obligation to specify
that you mean MIMD processors and/or graph processors, and SPECIFICALLY NOT A
GPU.

Because the programming principles espoused in this article are antithetical
to what you should do on a GPU, which is definitely the de facto "massively
parallel processor" that would first come to people's mind.

~~~
convolvatron
ok, so we're just arguing about terminology and expectation. maybe that
reenforces Feo's point.

it is worth noting that the El Dorado started out as a general purpose
computer, then a general purpose scientific computer, and then later as a
"graph processor" in an XT frame because the MTA on its own was market-dead.
and that Feo has been a big supporter since at least stage 2.

maybe is exactly this 'parallelism means GPU' which he's trying to argue
against. I agree the overall arch of that point didn't really follow through
in his talk, or someone's summary of that talk.

'mpp' was certainly a label adopted by TMC's cms 1-2, which were pretty damn
simd. its was also used by the cray T3e, which was markedly not. the mta had
128 threads per cpu. a large enough instantiation could have had a few
thousand hardware threads. later unbuilt designs backed the thread state in
dram and could have gotten pretty damn fat. arguably 'massive' should be
squishily about concurrency width.

------
tjalfi
The mentioned tech talk is available at [0].

[0] [https://vimeo.com/13860477](https://vimeo.com/13860477)

