Some scholarly input:
> But most of the existing ones are visual languages.
The dataflow architecture guys were making textual languages from the start. Others spinned off dataflow (synchronous dataflow, for example) for SE purposes, rather than parallelism. e.g.: Lustre, LabView.
> Ok, so I have plenty of ideas for all levels of the stack, but not enough time to pursue them with sufficient rigor.
It could help to look at people that have tried to do the same thing. At the language level, lookup Sisal, Id, Val or pHaskel for example.
A small selection, from bottom to top of the stack:
To go completely off the hardware deep-end:
 J. Gurd, C. Kirkham, and I. Watson, “The Manchester prototype dataflow computer,” Commun. ACM, vol. 28, no. 1, pp. 34–52, 1985.
This imparts some intuition on how a civilised language can map on dataflow instructions. Made by hardware guys.
 R. S. Nikhil, “Executing a program on the MIT tagged-token dataflow architecture,” Computers, IEEE Transactions on, vol. 39, no. 3, pp. 300–318, 1990.
SISAL: SSA language oriented for numerical/HPC work. Made by compiler guys.
 J. Gaudiot, T. DeBoni, J. Feo, W. Böhm, W. Najjar, and P. Miller, “The Sisal project: real world functional programming,” Compiler optimizations for scalable parallel systems, pp. 45–72, 2001.
And somewhat more high-level, for the FP crowd:
 P. Barth and R. Nikhil, “M-structures: extending a parallel, non-strict, functional language with state,” Functional Programming Languages and Computer Architecture, pp. 538–568, 1991.
 Arvind, R. S. Nikhil, and K. K. Pingali, “I-structures: data structures for parallel computing,” Transactions on Programming Languages and Systems (TOPLAS, vol. 11, no. 4, Oct. 1989.
The problem with textual data-flow programming languages is that I haven't seen any that aren't ugly beyond repair.
SISAL isn't all that bad, ignoring the 80's syntax:
function BottlesOfBeer(i : integer returns array[string])
if i = 1 then
"1"," bottle","If that bottle","No more"," bottles"
elseif i = 2 then
itoa(2)," bottles","If one of those bottles",itoa(1)," bottle"
itoa(i)," bottles","If one of those bottles",itoa(i-1)," bottles"
s || bottles || " of beer on the wall",
s || bottles || " of beer!",
preface || " should happen to fall... ",
n || nextbottles || " of beer on the wall!",
(Jon Riecke <http://scholar.google.com/citations?user=lSdHdpUAAAAJ&hl...; did most of the work on the language.)
P.S.: I was fascinated after reading Peter van Roys book some time ago and still wonder about this a bit disappointed.
This perspective from a compiler developer lends credence to Stanislav Datskovskiy's rants about the importance of the underlying CPU architecture. Stanislav has done some work on a new dataflow-based CPU architecture (http://www.loper-os.org/?p=846).
I wonder why Mike considers the usual blend of concepts provided by FP languages to not be worthwhile. After all, FP languages that don't allow reassignment of variables are in SSA form already, unless I misunderstand what SSA form is.
Just the fact that a language is closer to SSA form doesn't make it interesting. If a compiler is able to turn pretty much every source language into SSA, then why should I bother to do part of its work by hand? Much more relevant is how easy it is to express my intentions as a programmer and how much of that ends up helping the compiler and the CPU to execute the code faster.
I really don't want this to turn into a flame-war about the merits of functional programming. The Wikipedia article has a nice list of the typical concepts: http://en.wikipedia.org/wiki/Functional_programming#Concepts ... and the particular mix simply doesn't appeal to me personally.
Imperative compilers have three steps: they compile from an imperative language to SSA (which is almost functional), they then optimizes the SSA and then convert back to an imperative language.
The reason that they convert the program into SSA is because it is much easier to reason about the program in that form. My take is that imperative compilers are an admission that functional programs are easier to reason about.
If you're implying that "easier to reason about" for a compiler is equivalent to "easier to reason about" for humans, I think that's a fallacy.
Where functional programming fails is in performing tasks that are predominantly imperative; say, anything involving real-time behavior, like games or animation. And by "fails" I mean that an imperative language is easier for humans to follow when it maps more directly to the problem at hand.
I would like to know why you think that. The same predominant factor, complicated interactions, prevents easy reasoning in both cases.
I don't agree that functional programming fails in those cases.
Even the most strident FP advocates I know who are real game developers say that they wouldn't even consider using a strictly functional language for game development. IO-heavy apps in general seem like they're not a good use of functional languages.
Different programming paradigms suit different problem domains. It's not complicated interactions that would make it harder to understand for a human; it's that when you try to use a language for something that it isn't well suited for, it can be hard to figure out what the program does or how it works.
The objective isn't to write the most clever code, or even the code that's easiest to reason about, but the code that works and is easy to understand.
It's easier to reason about for the developers who are writing the compiler transforms, ie. humans.
As the FP community has consolidated around Haskell, this usually refers to the Haskell feature set: Graph reduction, lazy semantics. Graph reduction is extremely imperative (see: spineless stackless g-machine), which makes it map well on a Von Neumann processor, but limits the amount of exploitable parallelism. Relying on lazy semantics also wrecks parallelism, because some evaluation branches cannot be visited ahead of time, without changing the program's semantics.
A functional language using eager semantics and string-reduction execution would map much better on, say, a parallel dataflow processor. Research in this direction got buried as parallel research money evaporated in the early 90's
For a pure functional language, like Haskell, speculatively evaluating pieces of it in parallel shouldn't hurt anything (since they are guaranteed not to have side effects).
It is my understanding that the lazy semantics of Haskell mostly mean that you shouldn't be making any assumptions about the order of evaluation which would seem to be useful from an optimization perspective.
I am still new to FP and Haskell, so please correct me if I am misinterpreting things.
A crudely example, to impart some intuition:
True || this_crashes();
The semantics of your program rely on the fact that some branches of your AST/program are not visited, due to lazy semantics. A FP with eager evaluation semantics can be free to evaluate every branch of the program, throwing away what isn't used. Lazy evaluation means a runtime cannot be as aggressive, introducing a number of logical sequence points.
NB. A Haskell colleague lets me know that it is considered good style in Haskell these days to program as if it was eagerly evaluated. There is talk of limiting or doing away with it in Haskell Prime. Can anyone confirm/deny this?
Eagerness would fix this since, such a program as above would always cause a infinite-loop, and thus the program would no longer be a useful construction. While with lazy evaluation it only spins if you try to evaluate any more than the first element in the list.
Of course you could also fix this by going with total programming which basically removes the implicit bottom-value from the language. Which is a nice concept. Though then you need to explicitly declare you infinite data from you finite data (think a network socket, compared with string). And you lose Turing-completeness (but who needs that anyway :) )
That kind of code is only valid in strict languages because of short circuiting, is functionally equivalent to lazyness in this respect. If you tried to automatically parallelize C by speculatively evaluating both sides you would have exactly the same problem.
Lazy languages may make this kind of code easier to write, but the problem is actually that of non-deterministic calculations, and would be a problem anywhere.
That's not functional though--this_crashes() has side effects. I'm far from a Haskell expert, but I guess in practice you'd be returning some error type from that function, and it didn't matter whether it's evaluated or not, since True || arbitrary_expression will always just be True.
Laziness, AFAIK, only complicates compilation due to making it hard to reason about memory/performance, and for having branches all over to see if expressions have been evaluated yet.
> True || construct_a_1TB_object()
Unless you have infinite computation and memory, there are always visible side-effects, no matter how functional your language is.
If the 1TB object can in any way be constructed a bit at a time, then in principle you can get some work done in constructing it and back that out if you wind up not needing it.
If you're interested, someone already did some legwork on this, using an eager version of Haskell:
 S. Aditya, Arvind, J.-W. Maessen, L. Augustsson, and R. S. Nikhil, “Semantics of pH: A parallel dialect of Haskell,” presented at the FPCA 95, 1995, pp. 35–49.
Roughly, graph reduction execution represents an expression as a node, with an edge to nodes of subexpressions. A value of a node is obtained by walking through it's children, setting the value of each node as the recursion winds down.
String reduction simply modifies the expression itself. This does not eliminate common subexpressions (a() * a() will reduce a() twice), which is why graph reduction is favoured. In a parallel world, you often don't care about the extra work if it reduces the number of synchronisation.
SSA is actually more appropriate for imperative languages. Compilers for functional languages tend to use other intermediate representations such as CPS / ANF. They are "equivalent" though: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34....
Of course, the advantage that ANF / CPS have is that they can naturally represent higher-order programs, but for first-order programs they don't really have any advantage over SSA.
Basically it is a FSM (Finite State Machine) that generates an output depending on its input.
I don't think you lose too much meaning of his sentence if you replace FST by Turing Machines (the more usual comparison to lambda-calculus).
I think it's a relatively simple transformation from a FP language without reassignment but I believe it's one significant difference at the surface level.
It seems so often languages are built, and then the tools are an afterthought, things like debugging, visualizing program flow, code assist, whatever it may be. What if program flow (that is, what's going on inside the program) were in mind from the start, instead of something you try and add later.
Just my thoughts. Is this a feasible way to design a language?
The question is whether it is possible to do both at once, hence Mike's point about the ideal language needing to find the right high-level abstractions (machine-efficient and developer-intuitive). What we currently have is somewhere in-between, but more towards the machine-efficient side. It's quite easy to produce inefficient code, without realising it, in today's high-level languages.
Yeah that's a valid point. And it's probably just my lack of experience in language design or just general naïveté. It seems like for most devices (modern smartphones, tablets, traditional computers) are fast enough that we'd be better served focusing on developer productivity in newer languages. Sure, they'll never compete with C in terms of efficiency.
Besides, I can still easily create terribly inefficient code in C ;)
There will definitely always been places where software efficiency trumps programmer efficiency, but for a lot of cases, it doesn't. It's still important, just not always the most important.
But is certainly harder than think: How make recursion?
It uses dataflow to efficiently execute individual instructions while atomically committing groups of instructions to preserve understandable semantics. While tiled architectures are common now, most of them have one core per tile. TRIPS built a large, high-performance core out of simple tiles so it can run different types of code efficiently.
My impression is that TRIPS is much too complicated to be clocked at comparable frequencies to commercial processors. Their ISA also appears very hard to compile to.
I've heard some people conjecture that it's actually a pseudonym for a group of people.... but I actually prefer to think it's one guy, by himself, getting awesome shit done.
It didn't quite work out that way, but as a handy side effect it killed off PA-RISC and Alpha with little more than a few press releases.
Basically they gambled on the mother of all handwaves and lost.
* It's the evaluator's job to evaluate expressions. Expressions are primitive recursive so they always normalize.
* The executor's job is to talk to the outside world and to provide data to the evaluator and receive commands from the evaluator. It also can't get stuck in an infinite loop; it must eventually provide data to the evaluator after receiving a command.
The only way an infinite loop can happen is if the evaluator and the executor get stuck together.
When I said processor before, I was referring to the part that did the processing: i.e. the evaluator.
* from a systems point of view, an infinite loop is no worse than a routine that doesn't return after a reasonable amount of time. PR is a huge complexity class (it contains NEXP); just because a function always terminates doesn't mean it's efficient (or as efficient as you would like/need it to be.)
* PR functions are somewhat easier to reason about than general recursive functions (no need to mess around with bottom and domain theory) but I haven't seen a lot of evidence that that translates to making them easier to aggressively optimize.
* In fact, I have heard many PR functions have a more efficient GR equivalent, and I don't know of any way to automatically derive the GR version from the PR; and I expect (just on hunch) that no perfectly general PR->GR conversion could exist.
* Granted, a function can be shown to be total without necessarily being PR, but then you have the burden of proving that it is total, and it seems inelegant to move that burden to hardware. Maybe it's not, maybe that's just "normal science" talking.
* In practice, if I want to run an interpreter for an existing TC programming language on this architecture, it has to treat the architecture as TC (i.e. conceptually break its separation between executor and evaluator) anyway.
As for optimization, I believe that it may be possible to more effectively automatically parallelize a PR function than a GR one.
But at least one thing I can see is that in PR, you need to fix the upper bounds of the loops (/recursion depth) ahead of time, not "on the fly". If you're doing some kind of iterative approximation, you probably don't know what those bounds "should" be, because you're going to terminate on some other condition, like, the change in the value has become negligible. Your upper bound will be a worst-case estimate -- which you have to do extra work to compute -- and I don't see how it differs much, in practice, from a timeout, which has the advantage (again from a systems perspective) of being measured in clock time rather than cycles.
Not sure about parallelization. PR doesn't suggest any advantages to me for that offhand, but then, I haven't thought about it.
"I don't know for sure that this is possible yet, but I believe that the processor would be able to estimate the amount of work required to evaluate an expression. Using this ability, it would be able to automatically parallelize the evaluation of an expression by splitting it up into pieces of approximately equal size and then distributing them to sub processors."
This, and things like this, are much more difficult for untyped terms.
Plus anything you encode on the silicon is basically permanent. (Sure, you can tweak things with microcode, but only so much.)
That's debatable. By making the separation between hardware and software here, it allows future processors to make substantial changes to their internals.
> It isn't clear to me how to make the silicon, say, check for the totality of an arbitrarily-large function in any reasonable manner.
Actually, that part is easy. Essentially, you can define a typed functional language without recursion (or recursive definitions). The only looping that you do allow is the use of eliminators over inductive data types.
> Plus anything you encode on the silicon is basically permanent.
The only thing that is really permanent is the interface. Future processors can change their internals, but the old programs should always work on new processors.
Yes, but I said I don't see how you can check totality on an arbitrarily large function efficiently. If a total function has 65,537 cases to be checked, silicon is not the place to do it. If the silicon isn't checking it, then it's the compiler doing it anyhow.
The thing is, your processor may be arranged differently, even very differently, but on the whole it can't be doing more than modern processors are, or it will be slower than modern processors, in which case, why not just continue compiling onto them anyhow? History is full of "better" processors from various places that, by the time they came out, were already a factor of 3 or 4 (or more) slower than just running on x86.
CPUs are extraordinarily wasteful. The reason I think a processor like this might work better is because it could do a lot more work in parallel.
I've seen this claim before, and I've even accepted it at face value before. But it seems to me that it isn't adequately supported by evidence. What evidence do we have that a substantially more efficient (therefore less wasteful) CPU architecture is possible?
But Epiphany is not a dataflow architecture, and suffers from execution efficiency problems in individual cores. Dataflow is really the way to go to lower energy consumption dramatically.
Funny this topic comes up. It just happens the EU has recently invested in such a project to co-design a dataflow processor and software stack. It was north of a million euros for the initial proof-of-concept research, and the results are slowly starting to trickle through.
(I am one of the authors; ping me for more information)
Could you expand on this for a layperson? I'm terribly interested.
However a conventional processor pipeline will usually fetch instructions and begin processing them, only to realize later on that they were not necessary. This happens upon mispredicted branches, cache misses, exceptions, etc. These correspond to circuits that get activated, spend energy, only to throw away their results because the instruction's effects must be discarded.
In contrast, in a dataflow processor, each instruction indicates explicitly which other instruction(s) will produce its input. Or conversely, which other instruction(s) get activated as the result of one instruction completing execution. This way, instructions only enter the pipeline when their operands are ready, and speculation never occurs. So there is no more energy spent than strictly necessary to do the work (instructions).
Now, the reason why we use the former forms of speculation is that it is the only way to make the pipeline fast if there is no information in the instruction stream (program) about the dataflow dependencies between instructions. Because it does not know better, the scheduler has to either: 1) try all instructions in program order, start do work as early as possible, and sometimes need to discard the work already started because an earlier instruction has decided a branch / fault / etc. or 2) rediscover the dataflow links by analyzing the instructions as they enter the processor, but then again the silicon logic to implement these tricks is also costing energy.
The funny thing is, all compilers know about dataflow dependencies between instructions, but they throw the information away because the existing instruction sets cannot encode it.
So really the situation should be simple: make new processors that support dataflow annotations, extend the compilers to encode this information (which they already have anyways), and off we go.
However as others have highlighted making new instruction sets is like a "moonshot" because you have to involve a lot of people: compiler implementers, but also OS devs and everyone who will need to port their code to the new ISA.
Besides, dataflow processors have a gorilla in the kitchen too. In a "pure" dataflow scheduler, all the instruction order is destroyed and as a result, cache locality is broken. So the flip side of the coin becomes 1) bad memory performance 2) extra energy expenditure on the memory system to deal with cache misses.
Now there are ways to get the best of both worlds.
One is to destroy the ordering of instructions only partially, by only applying dataflow scheduling on a window (eg the next 20 instructions). This is more or less what modern out-of-order processors do, although they still waste energy re-discovering dataflow links at run-time.
The other technique is where many of us are going right now: use multiple hardware threads interleaved; keep the instruction order within threads to exploit whichever locality is encoded by the programmer, and apply dataflow scheduling techniques across instructions from separate threads, ie exploit maximum instruction concurrency between independent threads. Sun/Oracle started it with Niagara, now ARM is going there too. This approach really works very well in terms of operations / watt, however it requires software to use threads in the first place and not much software does that (yet).
Also there is still a lot of ongoing research.
For many applications, pure power consumption isn't even the best metric anymore. Due to advances in on-chip power and clock distribution, the energy x delay product and overall silicon efficiency have gained more importance in past years.
Obviously, dataflow processors excel in these metrics. And VLIW processors fall behind, which IMHO is the primary reason for their demise.
I agree with you that a practical dataflow architecture needs to be hierarchical. Not just for cache locality, but to reduce wiring overhead and debugging complexity, too.
Really the way forward would be to extend x86 completely, with a "mode" where all instructions are annotated and go through a different pipeline front-end than legacy x86 code. But Intel already tried that with IA64, and it burned them very hard. I am not sure they are willing to do it again.
AMD64 / x64 pretty much hops into a different mode and goes on executing from there. Given how many modes and instructions these chips support, I don't see why adding another would easily upset people.
However there is a big difference: the move to 64-bit words was something that was in demand when it was introduced. There was a market for it, with a very clear value proposal.
In contrast a new "mode" with the same computational power plus dataflow annotations would be a tough sell: larger code size, and and better performance / watt only for some applications.
(Also, as far as I know AMD64 / x64 on Intel cores uses the same decode and issue unit, just with different microcode. Circuit-wise there is a lot in common with x86.
Here we would be talking about a new mode and also a new instruction+data path in the processor. The core would become larger and more expensive. Not sure how that plays.)
1) What is really important is to realize that dataflow variables (I-structures) are not in memory. So any language/library that gives dataflow semantics to programmers should not allow programmers to indirect through memory to get to the I-structures. This is the main requirement for an efficient projection to a hardware dataflow scheduler.
In practice, things like Occam, SISAL and most pure functional programming languages are OK-ish.
2) any language should allow an (advanced) programmer (or compiler) to annotate the instructions to also suggest some ordering not related to data dependencies. As I explained before dataflow scheduling tends to destroy order and break locality, and for some applications this is very bad. Unfortunately all existing dataflow-ish languages (incl most functional languages) were designed with the outdated vision that all memory accesses have the same cost. We now know this is not true any more.
Other than using threads (as I explained before) a well-known theoretical way forward is to introduce optional control flow edges between instructions using "ghost" data dependencies, which impact scheduling but do not allocate registers/I-vars. However I am not aware of languages where this is possible.
What is really important is to realize that dataflow variables (I-structures) are not in memory. So any language/library that gives dataflow semantics to programmers should not allow programmers to indirect through memory to get to the I-structures. This is the main requirement for an efficient projection to a hardware dataflow scheduler
I don't know what you mean by I-structures, or dataflow variables, or why they are not in memory, or why allowing programmers to get at them would mess with a hardware dataflow schedule. I can guess at what you mean by all of these, but a more detailed explanation would still be very interesting.
I am very interested in this! Drop me a line at my nick @ vub.ac.be
* Intel's iAPX "mainframe in a chip" project  was aimed at directly supporting HLL features,
* The Rekursiv chip project  was intended for an objected-oriented language,
* The Transputer was meant to support Occam 
and there have probably been others.
Did he publish those ideas somewhere?