
Dataflow computers: Their history and future (2008) [pdf] - luu
http://csrl.unt.edu/~kavi/Research/encyclopedia-dataflow.pdf
======
nachiketkapre
Cool review! We have been looking at custom dataflow for floating-point
computations --
[http://dx.doi.org/10.1109/TCAD.2011.2173199](http://dx.doi.org/10.1109/TCAD.2011.2173199)
and
[http://dx.doi.org/10.1109/DFM.2014.21](http://dx.doi.org/10.1109/DFM.2014.21)
but the big challenge has been handling dataflow graphs larger than chip
capacity can hold.. Caching doesn't really work well as the dataflow wavefront
proceeds along multiple separate paths.

------
detaro
[2008], the author has quite a few more specific papers in the field as well
(just remove the filename from the URL to see a list)

------
yvdriess
I really wonder if dataflow machines could genuinely be a feasible commercial
processor.

Software is moving more and more towards this type of programming model, from
the reactive stuff to even HPC. OoO cores are already small dataflow
processors, but their instruction window does not seem to scale. Is x86/arm
just too much of a semantic bottleneck? All the dependency information
available to a compiler/runtime is lost when code-generating to x86 (same for
ARM), yet a processor core needs to extract the same information again in
hardware.

~~~
cfallin
> but their instruction window does not seem to scale.

The issue with scaling instruction windows really has more to do with (i) the
scheduling logic and (ii) the physical register file than it does with ISA
semantics. In particular, the scheduler is usually quite expensive to scale,
because it needs to compare each completed result against the not-yet-
fulfilled sources on waiting instructions.

> All the dependency information available to a compiler/runtime is lost when
> code-generating to x86 (same for ARM), yet a processor core needs to extract
> the same information again in hardware.

I think this is a bit different than what you describe above w.r.t. software
dataflow. When you talk about reactive programming in software, dependences
are things like event streams or promises that become fulfilled. That's very
different from (say) an individual 32-bit register value flowing from producer
to consumer instruction, and the compiler usually only knows about the latter
type of dependence. Perhaps you're envisioning a different type of
reactive/dataflow programming framework though, like maybe chaining together
arithmetic transformations or something similar lower-level?

\---

All of that said, there has been some interesting work in compiling
conventional software down to a more "dataflow-like" architecture. See e.g.
the TRIPS project at UT Austin [1] from about 10 years ago -- the idea there
was to build a fabric of execution units connected by a mesh interconnect on
the chip, and then stream data through.

[1] [https://www.cs.utexas.edu/~trips/](https://www.cs.utexas.edu/~trips/)

~~~
yvdriess
I was indeed more thinking of the fine-grained dataflow or event-driven tasks,
in the style of VAL/SISAL or very recently OCR ([https://xstack.exascale-
tech.com/wiki/index.php/Main_Page](https://xstack.exascale-
tech.com/wiki/index.php/Main_Page)).

I heard there were some fundamental problems with TRIPS, but never got any
detailed explanation. Any idea what it was? Is fine-grained dataflow still too
much overhead to do in hardware or was there something else.

There were some dataflow-critique papers back in the 80-90s, but most of their
points seem moot these days: low transistor occupancy (compared to the 90%+ of
the simple RISC microprocessors) and high cost of associative memory needed
for the token matchers. These days, we are working with billions instead of
10k transistors and commonly ship processors with megabytes of associative
memory in the form of SRAM cache.

~~~
cfallin
Cool, I have some more reading to do -- thanks for that link!

My understanding of TRIPS is that compiler complexity is an issue -- it
depends on being able to chunk the program into hyperblocks with fairly high
control locality (i.e., remain in one hyperblock for a while), because the
transition cost is higher than a simple branch on a conventional superscalar.
In the best case this can be great (32 ALUs worth of ILP, IIRC) but in the
worst case (e.g. large code footprint) this can degrade terribly.

IIRC there are some more practical issues too -- e.g., the ISA is specific to
the microarchitecture (because it explicitly maps to ALUs in the fabric), so
binary compatibility is an issue.

All that said, still a really interesting idea, I think.

------
leoc
Write purely functional programs in continuation-passing style and you already
have something that looks verry close to dynamic-graph dataflow programming,
no?

~~~
chrisseaton
If you mean a lazy functional program, then the difference is that those are
generally demand-driven. You execute a thunk, and it executes thunks that it
needs. In most dataflow models, computation is availability driven, in that
thunks which can run will do, and when they complete they will enable other
thunks which needed those values to now run.

~~~
leoc
You're right, I meant 'strictly' rather than 'purely' there (hate that bit of
jargon, btw). Unfortunately 'continuation-passing style' was probably the
wrong choice of words too: I meant 1) the restricted language with only tail
calls, no evaluable arguments etc., not 2) the method (thread the continuation
through the code, etc.) used to encode arbitrary functional programs into 1).
Naturally 2) produces programs which run in 1), but not usually _idiomatic_
programs in 1) (while otoh idiomatic/arbitrary programs in 1) may be side-
effect-free, but they're not usually really _functional_ in the sense of
equational reasoning/term-rewriting).

AFAIK you can encode either the lazy or eager variants of functional programs
into 1), while 1) itself is effectively "super-eager" as it just doesn't
permit any arguments that need evaluation except where explicitly delayed
inside a lambda, right?

