Hacker News new | comments | show | ask | jobs | submit login
Mike Pall: Ramblings on languages and architectures (freelists.org)
146 points by asb on Apr 16, 2013 | hide | past | web | favorite | 96 comments



Great to see Mike come to such conclusions. I've been interested myself in old dataflow architectures work for the same reasons.

Some scholarly input:

> But most of the existing ones are visual languages.

The dataflow architecture guys were making textual languages from the start. Others spinned off dataflow (synchronous dataflow, for example) for SE purposes, rather than parallelism. e.g.: Lustre, LabView.

> Ok, so I have plenty of ideas for all levels of the stack, but not enough time to pursue them with sufficient rigor.

It could help to look at people that have tried to do the same thing. At the language level, lookup Sisal, Id, Val or pHaskel for example.

A small selection, from bottom to top of the stack:

To go completely off the hardware deep-end:

[1] J. Gurd, C. Kirkham, and I. Watson, “The Manchester prototype dataflow computer,” Commun. ACM, vol. 28, no. 1, pp. 34–52, 1985.

This imparts some intuition on how a civilised language can map on dataflow instructions. Made by hardware guys.

[2] R. S. Nikhil, “Executing a program on the MIT tagged-token dataflow architecture,” Computers, IEEE Transactions on, vol. 39, no. 3, pp. 300–318, 1990.

SISAL: SSA language oriented for numerical/HPC work. Made by compiler guys.

[3] J. Gaudiot, T. DeBoni, J. Feo, W. Böhm, W. Najjar, and P. Miller, “The Sisal project: real world functional programming,” Compiler optimizations for scalable parallel systems, pp. 45–72, 2001.

And somewhat more high-level, for the FP crowd:

[4] P. Barth and R. Nikhil, “M-structures: extending a parallel, non-strict, functional language with state,” Functional Programming Languages and Computer Architecture, pp. 538–568, 1991.

[5] Arvind, R. S. Nikhil, and K. K. Pingali, “I-structures: data structures for parallel computing,” Transactions on Programming Languages and Systems (TOPLAS, vol. 11, no. 4, Oct. 1989.


Thanks for the references! I'll have a read over the weekend.

The problem with textual data-flow programming languages is that I haven't seen any that aren't ugly beyond repair.


Indeed. Many, lacking compiler research know-how, just make it a graph-encoding language, which is really not the point of dataflow. Any language trivially mapping to SSA semantics is already a dataflow language, essentially.

SISAL isn't all that bad, ignoring the 80's syntax:

  function BottlesOfBeer(i : integer returns array[string])
    let
      s,bottles,preface,n,nextbottles :=
        if i = 1 then
          "1"," bottle","If that bottle","No more"," bottles"
         elseif i = 2 then
           itoa(2)," bottles","If one of those bottles",itoa(1)," bottle"
         else
           itoa(i)," bottles","If one of those bottles",itoa(i-1)," bottles"
         end if;
    in
      array[1:
        s || bottles || " of beer on the wall", 
        s || bottles || " of beer!",
        preface || " should happen to fall... ",
        n || nextbottles || " of beer on the wall!",
        ""
      ]
    end let
  end function


Look for the term "streaming". My project is distributed, asynchronous streaming (research name "System S", product name "InfoSphere Streams") which uses a dataflow language called SPL (Streams Processing Language). However, it allows imperative code at the vertices in the dataflow graph. It's appropriate for large scale streaming across whole clusters. The StreamIt project at MIT is not distributed and is synchronous, which maps to single chips much better. In fact, it grew out of making DSPs (digital signal processors) more programmable.


Working with a couple IBM consultants, I had just gotten an MVP of InfoSphere Streams running at a supermajor oil company before I left the job last year :)


We built a dataflow system for Complex Event Processing (CEP) at one of my previous startups. It included both a visual programming GUI and a non-hideous programming language called SPLASH that allowed one to code nodes of the dataflow graph. The product is now known as the SAP Sybase Event Stream Processor, and this page has an example of SPLASH: http://www.sybase.com/products/financialservicessolutions/co...

(Jon Riecke <http://scholar.google.com/citations?user=lSdHdpUAAAAJ&hl...; did most of the work on the language.)


You mentioned in your original post that pure visual programming languages have issues, and having a ton of experience with one (Simulink), I wouldn't disagree - especially as the 'codebase' size increases. However a hybrid visual/code approach can work quite well, in my experience; see Stateflow and MATLAB Coder, which are embedded within the graphical Simulink environment.


Very interesting track for somebody eager to learn. May I consider Alice ML and Mozart Oz to be examples of textual data-flow languages. I wonder why these research projects have obv. slowed down resp. never really took off.

P.S.: I was fascinated after reading Peter van Roys book some time ago and still wonder about this a bit disappointed.


I still don't understand the policy of editors changing post titles here. The original title was "Mike Pall (LuaJIT) on programming language design". Maybe "Ramblings on languages and architectures" like the original posting is better, but I don't see the win in removing the author's name from it. After all, the reason we're all initially interested is that these thoughts are from Mike Pall, author of perhaps the most impressive dynamic language JIT.


At this point, why is there even a title field available?


One of the anonymous editors has added back "Mike Pall" to the title. Thanks.


EDIT: Expanded my question about Mike's opinion of FP.

This perspective from a compiler developer lends credence to Stanislav Datskovskiy's rants about the importance of the underlying CPU architecture. Stanislav has done some work on a new dataflow-based CPU architecture (http://www.loper-os.org/?p=846).

I wonder why Mike considers the usual blend of concepts provided by FP languages to not be worthwhile. After all, FP languages that don't allow reassignment of variables are in SSA form already, unless I misunderstand what SSA form is.


Converting to SSA form is comparatively easy. The interesting question is how pure a particular SSA variant actually is (hint: almost none are in practice).

Just the fact that a language is closer to SSA form doesn't make it interesting. If a compiler is able to turn pretty much every source language into SSA, then why should I bother to do part of its work by hand? Much more relevant is how easy it is to express my intentions as a programmer and how much of that ends up helping the compiler and the CPU to execute the code faster.

I really don't want this to turn into a flame-war about the merits of functional programming. The Wikipedia article has a nice list of the typical concepts: http://en.wikipedia.org/wiki/Functional_programming#Concepts ... and the particular mix simply doesn't appeal to me personally.


It's interesting that you say that you are doing the compiler's work when you program functionally.

Imperative compilers have three steps: they compile from an imperative language to SSA (which is almost functional), they then optimizes the SSA and then convert back to an imperative language.

The reason that they convert the program into SSA is because it is much easier to reason about the program in that form. My take is that imperative compilers are an admission that functional programs are easier to reason about.


>My take is that imperative compilers are an admission that functional programs are easier to reason about.

If you're implying that "easier to reason about" for a compiler is equivalent to "easier to reason about" for humans, I think that's a fallacy.

Where functional programming fails is in performing tasks that are predominantly imperative; say, anything involving real-time behavior, like games or animation. And by "fails" I mean that an imperative language is easier for humans to follow when it maps more directly to the problem at hand.


> If you're implying that "easier to reason about" for a compiler is equivalent to "easier to reason about" for humans, I think that's a fallacy.

I would like to know why you think that. The same predominant factor, complicated interactions, prevents easy reasoning in both cases.

I don't agree that functional programming fails in those cases.


What kind of code do you write? I write games (and, recently, UI-heavy interactive apps that talk to servers).

Even the most strident FP advocates I know who are real game developers say that they wouldn't even consider using a strictly functional language for game development. IO-heavy apps in general seem like they're not a good use of functional languages.

Different programming paradigms suit different problem domains. It's not complicated interactions that would make it harder to understand for a human; it's that when you try to use a language for something that it isn't well suited for, it can be hard to figure out what the program does or how it works.

The objective isn't to write the most clever code, or even the code that's easiest to reason about, but the code that works and is easy to understand.


> If you're implying that "easier to reason about" for a compiler is equivalent to "easier to reason about" for humans, I think that's a fallacy.

It's easier to reason about for the developers who are writing the compiler transforms, ie. humans.


Agreed--I think functional programming comes a lot closer (or at least it can, in theory) to expressing programmer intentions. The compiler doesn't have to deal with side effects and all of their hairy consequences (escape analysis, aliasing, inferring dependencies, ...).


> I wonder why Mike considers the usual blend of concepts provided by FP languages to not be worthwhile.

As the FP community has consolidated around Haskell, this usually refers to the Haskell feature set: Graph reduction, lazy semantics. Graph reduction is extremely imperative (see: spineless stackless g-machine), which makes it map well on a Von Neumann processor, but limits the amount of exploitable parallelism. Relying on lazy semantics also wrecks parallelism, because some evaluation branches cannot be visited ahead of time, without changing the program's semantics.

A functional language using eager semantics and string-reduction execution would map much better on, say, a parallel dataflow processor. Research in this direction got buried as parallel research money evaporated in the early 90's


> Relying on lazy semantics also wrecks parallelism, because some evaluation branches cannot be visited ahead of time, without changing the program's semantics.

For a pure functional language, like Haskell, speculatively evaluating pieces of it in parallel shouldn't hurt anything (since they are guaranteed not to have side effects).

It is my understanding that the lazy semantics of Haskell mostly mean that you shouldn't be making any assumptions about the order of evaluation which would seem to be useful from an optimization perspective.

I am still new to FP and Haskell, so please correct me if I am misinterpreting things.


> It is my understanding that the lazy semantics of Haskell mostly mean that you shouldn't be making any assumptions about the order of evaluation which would seem to be useful from an optimization perspective.

A crudely example, to impart some intuition:

  True || this_crashes();
Imagine the above reliance on lazy semantics, but then for Haskell memory management or constructs such as infinite lists.

The semantics of your program rely on the fact that some branches of your AST/program are not visited, due to lazy semantics. A FP with eager evaluation semantics can be free to evaluate every branch of the program, throwing away what isn't used. Lazy evaluation means a runtime cannot be as aggressive, introducing a number of logical sequence points.

NB. A Haskell colleague lets me know that it is considered good style in Haskell these days to program as if it was eagerly evaluated. There is talk of limiting or doing away with it in Haskell Prime. Can anyone confirm/deny this?


Crashing would be a side-effect. However your idea is correct. Consider instead:

    [True, SpinForever()]
You can consider any "variable" in Haskell to have a value of the type of that variable or the bottom value. The bottom value is tech-speak for no-value at all, since it means that the program has gone into a infinite-loop trying to produce the value.

Eagerness would fix this since, such a program as above would always cause a infinite-loop, and thus the program would no longer be a useful construction. While with lazy evaluation it only spins if you try to evaluate any more than the first element in the list.

Of course you could also fix this by going with total programming which basically removes the implicit bottom-value from the language. Which is a nice concept. Though then you need to explicitly declare you infinite data from you finite data (think a network socket, compared with string). And you lose Turing-completeness (but who needs that anyway :) )


Fair enough. Though as later comments point out, constructing a 1tb object is probably a better failure case for functional languages. It would also be fairly easy to avoid if the True on the lhs could be evaluated quickly so you could abort the second calculation.

That kind of code is only valid in strict languages because of short circuiting, is functionally equivalent to lazyness in this respect. If you tried to automatically parallelize C by speculatively evaluating both sides you would have exactly the same problem.

Lazy languages may make this kind of code easier to write, but the problem is actually that of non-deterministic calculations, and would be a problem anywhere.


> True || this_crashes();

That's not functional though--this_crashes() has side effects. I'm far from a Haskell expert, but I guess in practice you'd be returning some error type from that function, and it didn't matter whether it's evaluated or not, since True || arbitrary_expression will always just be True.

Laziness, AFAIK, only complicates compilation due to making it hard to reason about memory/performance, and for having branches all over to see if expressions have been evaluated yet.


Consider

> True || construct_a_1TB_object()

Unless you have infinite computation and memory, there are always visible side-effects, no matter how functional your language is.


If the 1TB object is constructed atomically, it has to wait, yes - but really, that's the case in eager languages too if you want the same final semantics.

If the 1TB object can in any way be constructed a bit at a time, then in principle you can get some work done in constructing it and back that out if you wind up not needing it.


Would Standard ML be a suitable language for the kind of research you describe?


Very likely. At the very least a subset or with some additions, I-structures come to mind.

If you're interested, someone already did some legwork on this, using an eager version of Haskell:

[1] S. Aditya, Arvind, J.-W. Maessen, L. Augustsson, and R. S. Nikhil, “Semantics of pH: A parallel dialect of Haskell,” presented at the FPCA 95, 1995, pp. 35–49.


But these kinds of languages do exist and it's reasonable to think about writing compilers for them and/or subsets.


I haven't heard of any recent string reduction languages. The last I heard about it was in Kent Dybvig 'Three Scheme Implementations' thesis, where he used it to map Scheme on a dataflow-like FP architecture (IIRC).


I don't have a formal definition of a string reduction language, so maybe I'm mis-generalizing. That said, J's parser (http://www.jsoftware.com/help/dictionary/dicte.htm) looks like what I imagined the phrase meant.


p. 127 of www.cs.indiana.edu/~dyb/papers/3imp.pdf gives a pretty good overview, including a ton of references.

Roughly, graph reduction execution represents an expression as a node, with an edge to nodes of subexpressions. A value of a node is obtained by walking through it's children, setting the value of each node as the recursion winds down. String reduction simply modifies the expression itself. This does not eliminate common subexpressions (a() * a() will reduce a() twice), which is why graph reduction is favoured. In a parallel world, you often don't care about the extra work if it reduces the number of synchronisation.


Warning: I am far from being knowledgeable about compilers. That being said...

SSA is actually more appropriate for imperative languages. Compilers for functional languages tend to use other intermediate representations such as CPS / ANF. They are "equivalent" though: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34....


In practice, working with them can be quite different. In a lexically scoped representation (either ANF or CPS) you have to preserve lexical scoping after each transformation, which can be pretty annoying if you want to do optimizations in their full generality. With SSA, you have to preserve dominance, but you don't have to update the dominator tree (the equivalent of the information encoded in the lexical scoping of ANF / CPS) if you don't want to; you can simply recompute it.

Of course, the advantage that ANF / CPS have is that they can naturally represent higher-order programs, but for first-order programs they don't really have any advantage over SSA.


I always considered CPS and SSA to be duals. (Just like FST and Lambda-calculus are duals)


What do you mean by FST?


In that context it means Finite State Transducer.

Basically it is a FSM (Finite State Machine) that generates an output depending on its input.

I don't think you lose too much meaning of his sentence if you replace FST by Turing Machines (the more usual comparison to lambda-calculus).


Ah, thanks. I thought the comment had some precise dualities in mind, but I guess not? (There is such a correspondence for CPS and SSA, cf Appel's paper.)


I may be wrong, but doesn't SSA imply the usage of the strange PHI thingy?

I think it's a relatively simple transformation from a FP language without reassignment but I believe it's one significant difference at the surface level.


Doesn't an FP language without reassignment imply the existence of a function like phi to perform nontrivial control flow?


I feel like you could do even better if you approached a language design from the other side, instead of focussing on the lower-level aspects to instead be thinking of higher level aspects. If I were designing a programming language, I would start with what the development environment would look like, what tools do I want to build to support development in the language, etc., and then build the language to support that.

It seems so often languages are built, and then the tools are an afterthought, things like debugging, visualizing program flow, code assist, whatever it may be. What if program flow (that is, what's going on inside the program) were in mind from the start, instead of something you try and add later.

Just my thoughts. Is this a feasible way to design a language?


Depends what your goal is. Your approach is to optimize for developer productivity. Mike's is to optimize for machine productivity.

The question is whether it is possible to do both at once, hence Mike's point about the ideal language needing to find the right high-level abstractions (machine-efficient and developer-intuitive). What we currently have is somewhere in-between, but more towards the machine-efficient side. It's quite easy to produce inefficient code, without realising it, in today's high-level languages.


>Your approach is to optimize for developer productivity. Mike's is to optimize for machine productivity.

Yeah that's a valid point. And it's probably just my lack of experience in language design or just general naïveté. It seems like for most devices (modern smartphones, tablets, traditional computers) are fast enough that we'd be better served focusing on developer productivity in newer languages. Sure, they'll never compete with C in terms of efficiency.

Besides, I can still easily create terribly inefficient code in C ;)


Devices maybe but servers less so, if you can get a 100x performance increase from going g from say PHP to LuaJIT then you can save a lot of money without a productivity loss.


Definitely. But you're also not going to stand to gain as much from writing some of the tools I'm talking about, anyway.

There will definitely always been places where software efficiency trumps programmer efficiency, but for a lot of cases, it doesn't. It's still important, just not always the most important.


I think the same. I have dream to build one myself, but lack the expertise and the time for it. However, I think on it very similar to how build a "normal" app: The end-user, thinking in the most tedious work (like debugging, pass data & transformation, etc) then think in how could be it write. In fact I look at my past projects and try to remember what were the parts that waste more time...

But is certainly harder than think: How make recursion?


Check out the TRIPS/EDGE architecture: http://www.cs.utexas.edu/~trips/overview.html

It uses dataflow to efficiently execute individual instructions while atomically committing groups of instructions to preserve understandable semantics. While tiled architectures are common now, most of them have one core per tile. TRIPS built a large, high-performance core out of simple tiles so it can run different types of code efficiently.


That link is missing their evaluation of TRIPS, which they did in 2008 after they built the chips: http://www.cs.utexas.edu/~bmaher/pubs/asplos09.pdf

My impression is that TRIPS is much too complicated to be clocked at comparable frequencies to commercial processors. Their ISA also appears very hard to compile to.


Mike Pall was the first person I've ever met (via his postings and projects) that made me think, holy cow, this dude is way smarter than me.

I've heard some people conjecture that it's actually a pseudonym for a group of people.... but I actually prefer to think it's one guy, by himself, getting awesome shit done.


"Some people" includes Roberto Ierusalimschy, the inventor of Lua, although he told me he now knows someone who has met Mike so he believes he is actually a singleton.


I'm reminded a bit of the logic that lead to Itanium: all this silicon dedicated to doing voodoo mind reading on linear machine code is wasteful. Let's instead push parallelism up into the compilers.

It didn't quite work out that way, but as a handy side effect it killed off PA-RISC and Alpha with little more than a few press releases.


I often wonder if it was the theory that was broken or the implementation? I doubt anyone will tread that way for quite a while due to the wreckage.


Basically, Intel & HP bet on the appearance of "Sufficiently Smart Compilers". And they didn't turn up. Most of the code people want to run on high CPUs is written in C or Fortran. Producing correct parallel code from branchy, pointer-filled, shared-everything codebases is difficult.

Basically they gambled on the mother of all handwaves and lost.


Would a CPU architecture that executed LLVM bitcode, or some other encoding of LLVM IR, be significantly less wasteful than what we have now? I wonder because LLVM IR is already SSA-based and is designed for optimization passes in a compiler. And it exists and is used today.


I've also been thinking about processor architecture recently. My conclusion was that it is ludicrous to have processors' main input language be Turing-complete. Processors need to perform optimizations, to do that they need to make assumptions. In this sense, a Turing-complete input language is the worst possible choice.


What kind of input language would be both useful for general computation and not Turing-complete?


If I were to design a processor, it would essentially execute programs written in a total functional language with corecursion. So you would be able to create Turing-complete programs, but most of the time, the processor will be evaluating expressions in a primitive recursive subset that it can reason about and aggressively optimize.


Turing-complete but with clean subsets seems completely different from not using a Turing-complete input language at all.


In order to execute a total functional language, you have to basically split the processor into two. I'll call these two parts the evaluator and the executor.

* It's the evaluator's job to evaluate expressions. Expressions are primitive recursive so they always normalize.

* The executor's job is to talk to the outside world and to provide data to the evaluator and receive commands from the evaluator. It also can't get stuck in an infinite loop; it must eventually provide data to the evaluator after receiving a command.

The only way an infinite loop can happen is if the evaluator and the executor get stuck together.

When I said processor before, I was referring to the part that did the processing: i.e. the evaluator.


I'm interested in approaches like this too, but there are some practical considerations that seem to complicate it:

* from a systems point of view, an infinite loop is no worse than a routine that doesn't return after a reasonable amount of time. PR is a huge complexity class (it contains NEXP); just because a function always terminates doesn't mean it's efficient (or as efficient as you would like/need it to be.)

* PR functions are somewhat easier to reason about than general recursive functions (no need to mess around with bottom and domain theory) but I haven't seen a lot of evidence that that translates to making them easier to aggressively optimize.

* In fact, I have heard many PR functions have a more efficient GR equivalent, and I don't know of any way to automatically derive the GR version from the PR; and I expect (just on hunch) that no perfectly general PR->GR conversion could exist.

* Granted, a function can be shown to be total without necessarily being PR, but then you have the burden of proving that it is total, and it seems inelegant to move that burden to hardware. Maybe it's not, maybe that's just "normal science" talking.

* In practice, if I want to run an interpreter for an existing TC programming language on this architecture, it has to treat the architecture as TC (i.e. conceptually break its separation between executor and evaluator) anyway.


You make some good points. Especially that a GR function may be able to work more efficiently than a PR function; do you have an example or can you cite a reference for this?

As for optimization, I believe that it may be possible to more effectively automatically parallelize a PR function than a GR one.


Unfortunately, it's hearsay to me; it came up in conversation (about computability and complexity) with a professor, and I took his word for it at the time; I've been meaning to ask him for a reference ever since, but never got around to it.

But at least one thing I can see is that in PR, you need to fix the upper bounds of the loops (/recursion depth) ahead of time, not "on the fly". If you're doing some kind of iterative approximation, you probably don't know what those bounds "should" be, because you're going to terminate on some other condition, like, the change in the value has become negligible. Your upper bound will be a worst-case estimate -- which you have to do extra work to compute -- and I don't see how it differs much, in practice, from a timeout, which has the advantage (again from a systems perspective) of being measured in clock time rather than cycles.

Not sure about parallelization. PR doesn't suggest any advantages to me for that offhand, but then, I haven't thought about it.


W.r.t parallelism, I'll copy here something I wrote elsewhere in this thread:

"I don't know for sure that this is possible yet, but I believe that the processor would be able to estimate the amount of work required to evaluate an expression. Using this ability, it would be able to automatically parallelize the evaluation of an expression by splitting it up into pieces of approximately equal size and then distributing them to sub processors."


The evaluator is the interesting part, of course. How do you stop it from getting stuck evaluating functions like this?

    def f():
        f()
If you can do general recursion, you can do infinite loops. If you can't do general recursion, I'm not sure the language will be very useful.


No, you can't do general recursion on the evaluator alone. But it would be possible to run any general recursive function across both parts, for example if you wanted to run software not designed for this architecture. The problem with this is that optimizations would only happen on per expression basis, so it wouldn't make very good use of the architecture.


What benefit would you get from this restriction at the hardware level? Don't all practical reduction algorithms work just fine on untyped terms as well?


I don't know for sure that this is possible yet, but I believe that the processor would be able to estimate the amount of work required to evaluate an expression. Using this ability, it would be able to automatically parallelize the evaluation of an expression by splitting it up into pieces of approximately equal size and then distributing them to sub processors.

This, and things like this, are much more difficult for untyped terms.


It seems like you're putting a lot of stuff on silicon that better belongs in a compiler. It isn't clear to me how to make the silicon, say, check for the totality of an arbitrarily-large function in any reasonable manner; silicon prefers rather sharp bounds.

Plus anything you encode on the silicon is basically permanent. (Sure, you can tweak things with microcode, but only so much.)


> It seems like you're putting a lot of stuff on silicon that better belongs in a compiler.

That's debatable. By making the separation between hardware and software here, it allows future processors to make substantial changes to their internals.

> It isn't clear to me how to make the silicon, say, check for the totality of an arbitrarily-large function in any reasonable manner.

Actually, that part is easy. Essentially, you can define a typed functional language without recursion (or recursive definitions). The only looping that you do allow is the use of eliminators over inductive data types.

> Plus anything you encode on the silicon is basically permanent.

The only thing that is really permanent is the interface. Future processors can change their internals, but the old programs should always work on new processors.


"Essentially, you can define a typed functional language without recursion (or recursive definitions). The only looping that you do allow is the use of eliminators over inductive data types."

Yes, but I said I don't see how you can check totality on an arbitrarily large function efficiently. If a total function has 65,537 cases to be checked, silicon is not the place to do it. If the silicon isn't checking it, then it's the compiler doing it anyhow.

The thing is, your processor may be arranged differently, even very differently, but on the whole it can't be doing more than modern processors are, or it will be slower than modern processors, in which case, why not just continue compiling onto them anyhow? History is full of "better" processors from various places that, by the time they came out, were already a factor of 3 or 4 (or more) slower than just running on x86.


A function is never arbitrarily large, it's made of a bounded number of sub-expressions and those sub-expressions can be checked independently.

CPUs are extraordinarily wasteful. The reason I think a processor like this might work better is because it could do a lot more work in parallel.


A CPU can do anything. Silicon is turing complete. The question is whether it can do it more efficiently than some rather well-tuned, if suboptimally-architected, processors. It's a high bar to leap, despite substantial rhetoric to the contrary. (I believe it to be leapable, but it's not easy.)


> CPUs are extraordinarily wasteful.

I've seen this claim before, and I've even accepted it at face value before. But it seems to me that it isn't adequately supported by evidence. What evidence do we have that a substantially more efficient (therefore less wasteful) CPU architecture is possible?


cf the discussion in the same thread: https://news.ycombinator.com/item?id=5559282


How timely that The Adapteva Parallela computer board was just made available as well. Perhaps something interesting can happen at the confluence of that hardware and these compiler thoughts.


This is where it needs to happen. The problem is that co-designing a new hardware/software stack from scratch is the computer science equivalent of a moonshot.


The word "moonshot" is close to the truth, yet there are strategies to overcome the adoption threshold. That was precisely the topic of my PhD dissertation, and I concur that the Adapteva guys have chosen a sound approach.

But Epiphany is not a dataflow architecture, and suffers from execution efficiency problems in individual cores. Dataflow is really the way to go to lower energy consumption dramatically.

Funny this topic comes up. It just happens the EU has recently invested in such a project to co-design a dataflow processor and software stack. It was north of a million euros for the initial proof-of-concept research, and the results are slowly starting to trickle through.

Some references:

- http://staff.science.uva.nl/~poss/pub/poss.12.dsd.pdf - http://staff.science.uva.nl/~poss/pub/poss.13.micpro.pdf

(I am one of the authors; ping me for more information)


Dataflow is really the way to go to lower energy consumption dramatically.

Could you expand on this for a layperson? I'm terribly interested.


Minimum energy usage is very dependent on not activating more circuits than strictly required for a given computation.

However a conventional processor pipeline will usually fetch instructions and begin processing them, only to realize later on that they were not necessary. This happens upon mispredicted branches, cache misses, exceptions, etc. These correspond to circuits that get activated, spend energy, only to throw away their results because the instruction's effects must be discarded.

In contrast, in a dataflow processor, each instruction indicates explicitly which other instruction(s) will produce its input. Or conversely, which other instruction(s) get activated as the result of one instruction completing execution. This way, instructions only enter the pipeline when their operands are ready, and speculation never occurs. So there is no more energy spent than strictly necessary to do the work (instructions).

Now, the reason why we use the former forms of speculation is that it is the only way to make the pipeline fast if there is no information in the instruction stream (program) about the dataflow dependencies between instructions. Because it does not know better, the scheduler has to either: 1) try all instructions in program order, start do work as early as possible, and sometimes need to discard the work already started because an earlier instruction has decided a branch / fault / etc. or 2) rediscover the dataflow links by analyzing the instructions as they enter the processor, but then again the silicon logic to implement these tricks is also costing energy.

The funny thing is, all compilers know about dataflow dependencies between instructions, but they throw the information away because the existing instruction sets cannot encode it.

So really the situation should be simple: make new processors that support dataflow annotations, extend the compilers to encode this information (which they already have anyways), and off we go.

However as others have highlighted making new instruction sets is like a "moonshot" because you have to involve a lot of people: compiler implementers, but also OS devs and everyone who will need to port their code to the new ISA.

Besides, dataflow processors have a gorilla in the kitchen too. In a "pure" dataflow scheduler, all the instruction order is destroyed and as a result, cache locality is broken. So the flip side of the coin becomes 1) bad memory performance 2) extra energy expenditure on the memory system to deal with cache misses.

Now there are ways to get the best of both worlds.

One is to destroy the ordering of instructions only partially, by only applying dataflow scheduling on a window (eg the next 20 instructions). This is more or less what modern out-of-order processors do, although they still waste energy re-discovering dataflow links at run-time.

The other technique is where many of us are going right now: use multiple hardware threads interleaved; keep the instruction order within threads to exploit whichever locality is encoded by the programmer, and apply dataflow scheduling techniques across instructions from separate threads, ie exploit maximum instruction concurrency between independent threads. Sun/Oracle started it with Niagara, now ARM is going there too. This approach really works very well in terms of operations / watt, however it requires software to use threads in the first place and not much software does that (yet).

Also there is still a lot of ongoing research.


Adding to that:

For many applications, pure power consumption isn't even the best metric anymore. Due to advances in on-chip power and clock distribution, the energy x delay product and overall silicon efficiency have gained more importance in past years.

Obviously, dataflow processors excel in these metrics. And VLIW processors fall behind, which IMHO is the primary reason for their demise.

I agree with you that a practical dataflow architecture needs to be hierarchical. Not just for cache locality, but to reduce wiring overhead and debugging complexity, too.


If compilers already have the dependency information and could provide it in instruction annotations, then why hasn't Intel done anything with this? Intel has its own C/C++ compiler, so it could extend the x86 instruction set with new instructions that contain the necessary annotations, and add support for these annotations in its compiler.


Except that would not show lower energy usage. To preserve backward compatibility, the cores would need to keep the logic that analyses data dependencies to continue delivering good performance to legacy code. To make any difference they would need to both do what you say, and also define some protocol to instructs the processor to disable the dataflow analysis unit entirely (to save energy). But that protocol would be invasive, because you need to re-activate the unit at the first instruction that is not annotated, and upon faults, branches, etc. The logic to coordinate this protocol becomes a new energy expenditure on its own!

Really the way forward would be to extend x86 completely, with a "mode" where all instructions are annotated and go through a different pipeline front-end than legacy x86 code. But Intel already tried that with IA64, and it burned them very hard. I am not sure they are willing to do it again.


IA64 was an entirely new instruction set on an entirely new architecture; having nothing whatever to do with x86. Compatibility for x86 was added later to try and improve sales.

AMD64 / x64 pretty much hops into a different mode and goes on executing from there. Given how many modes and instructions these chips support, I don't see why adding another would easily upset people.


Yes you are right, if that was marketed like the introduction of x64 it may work.

However there is a big difference: the move to 64-bit words was something that was in demand when it was introduced. There was a market for it, with a very clear value proposal.

In contrast a new "mode" with the same computational power plus dataflow annotations would be a tough sell: larger code size, and and better performance / watt only for some applications.

(Also, as far as I know AMD64 / x64 on Intel cores uses the same decode and issue unit, just with different microcode. Circuit-wise there is a lot in common with x86.

Here we would be talking about a new mode and also a new instruction+data path in the processor. The core would become larger and more expensive. Not sure how that plays.)


Is there something about dataflow annotated instructions that would require lots of internal changes? I mean past the decoding step. Because Intel and AMD add new instructions and modes all the time, frequently stuff that is basically totally orthogonal to whatever's gone before.


In your research, are you using a new programming language to take advantage of dataflow scheduling techniques, or are you working with one or more existing languages? If the latter, do you have any data or opinions on which languages or language features are most amenable to an effective dataflow-based architecture?


We use just C extensions for now, very close to what Cilk does.

1) What is really important is to realize that dataflow variables (I-structures) are not in memory. So any language/library that gives dataflow semantics to programmers should not allow programmers to indirect through memory to get to the I-structures. This is the main requirement for an efficient projection to a hardware dataflow scheduler.

In practice, things like Occam, SISAL and most pure functional programming languages are OK-ish.

2) any language should allow an (advanced) programmer (or compiler) to annotate the instructions to also suggest some ordering not related to data dependencies. As I explained before dataflow scheduling tends to destroy order and break locality, and for some applications this is very bad. Unfortunately all existing dataflow-ish languages (incl most functional languages) were designed with the outdated vision that all memory accesses have the same cost. We now know this is not true any more.

Other than using threads (as I explained before) a well-known theoretical way forward is to introduce optional control flow edges between instructions using "ghost" data dependencies, which impact scheduling but do not allocate registers/I-vars. However I am not aware of languages where this is possible.


Could you expand on the following?

What is really important is to realize that dataflow variables (I-structures) are not in memory. So any language/library that gives dataflow semantics to programmers should not allow programmers to indirect through memory to get to the I-structures. This is the main requirement for an efficient projection to a hardware dataflow scheduler

I don't know what you mean by I-structures, or dataflow variables, or why they are not in memory, or why allowing programmers to get at them would mess with a hardware dataflow schedule. I can guess at what you mean by all of these, but a more detailed explanation would still be very interesting.


See: [5] Arvind, R. S. Nikhil, and K. K. Pingali, “I-structures: data structures for parallel computing,” Transactions on Programming Languages and Systems (TOPLAS, vol. 11, no. 4, Oct. 1989.


Thank you.


Thank you. Great answer, great thread.


ping

I am very interested in this! Drop me a line at my nick @ vub.ac.be


Moonshots get made from time to time, however:

* Intel's iAPX "mainframe in a chip" project [1] was aimed at directly supporting HLL features,

* The Rekursiv chip project [2] was intended for an objected-oriented language,

* The Transputer was meant to support Occam [3]

and there have probably been others.

[1] http://en.wikipedia.org/wiki/Intel_iAPX_432

[2] http://en.wikipedia.org/wiki/Rekursiv

[3] http://en.wikipedia.org/wiki/Transputer


In the case of Adapteva, I think it's helped along because their starting point is fairly standard. Using Arm, Ansi C, Ubuntu, etc. They have an fpga on their board, and then their specialized parallel processing chip. They support OpenCL. I'm not a compiler guy, but it seems to me that various concepts can be injected at different levels of this hardware stack, slowly but surely, without having to create the entire thing from scratch.


> Ok, so I have plenty of ideas for all levels of the stack, but not enough time to pursue them with sufficient rigor.

Did he publish those ideas somewhere?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: