Hacker Newsnew | comments | show | ask | jobs | submit login
Solving the Multicore Dilemma (flowlang.net)
102 points by tel 957 days ago | 49 comments



For massively parallel systems, functional language compilers do not address the primary implementation challenges around software design. In fact, the C++ compilers for a number of supercomputers have functional language extensions but I have almost never seen those extensions used because they do not solve any of the problems parallel systems designers have.

To dispel one misconception, the reason it is simple to write massively parallel code in C++ is that the basic model is a single process locked to a single core with no context switching at all. It is all non-blocking, coroutine-esque, cooperative scheduling. Writing this type of code is exactly like old school single-process, interrupt-driven UNIX code from back in the day when no one did threading. Simple to write, simple to analyze. Analogues of this work well even on fine-grained systems.

The real parallelization challenge is often completely overlooked by people offering nominal solutions. While your application may only be composed of two algorithms and those algorithms are independently massively parallel, that does not imply that the algorithms are operating on identical representations of the underlying data model, something that is tacitly assumed most times. Representation transforms between algorithm steps inherently parallelize poorly.

A canonical example of this is the hash join algorithm. It is a two-stage algorithm, and both stages are obviously parallelizable, yet ad hoc hash joins parallelize very poorly in practice because it requires a representation transform.

In a sense, the real parallelism problem is a data structure problem. How do you design a single data structure such that all operators required for your algorithms and data models are both efficient and massively parallel? Using functional programming languages does not speak to that and it is the fundamental challenge regardless of the hardware.

-----


I think C++ is the right tool for some problems (single-threaded algorithms locked to a core, no context switching sounds like a good fit for C++).

I think many problems are not well suited to C++. Google prefers python for the first attempt at a problem allegedly for the ease of prototyping.

I think Chuck Moore (of Forth fame) frequently propounds an important idea when it comes to improving HPC performance. Of course, Chuck Moore doesn't do HPC optimization that I know of. But he does talk a lot about thinking about the whole problem and avoiding premature optimization. As such, it sounds like the hash join algorithm is not well suited to some parallel problems - so what? Pick the right tool for the job. Picking a hash table could be premature optimization if the problem demands massive parallel scalability.

It seems flowlang.net is right to say that massive parallel scalability will rapidly become a must-have at most companies.

-----


> To dispel one misconception, the reason it is simple to write massively parallel code in C++ [...]

Are you joking?

-----


Not at all. I've written massively parallel codes for a number of computing platforms, both conventional and exotic. The vast majority are primarily written in systems-style C++ (basically, C++ minus several popular features).

Most of the knocks against C++ are micro-optimizations that are not material to the actual parallelization of the code. Parallelization is done by carefully selecting and designing the distributed data structures, which is simple to do in C++. For parallel systems these are required (theoretically) to be space decomposition structures, which happen to be very elegant to implement in C. Communication between cells in the space decomposition is conventional message passing.

The usual compiler-based tricks such as loop parallelization are not all that helpful even though supercomputing compilers support it. By the time you get to that level of granularity, it is a single process locked to a single core, so there is not much hardware parallelism to exploit. And even then, you have to manually tune the behavior with pragmas if you want optimal performance.

Most compiler-based approaches to parallelism attack the problem at a level of granularity that is not that useful for non-trivial parallelism. The lack of multithreading in modern low-latency, high-performance software architectures make the many deficiencies of C/C++ for heavily threaded environments largely irrelevant.

-----


For a while, I've been pondering how to solve the same problem. I also reached similar conclusions, that parallelism needs to be implicit at the language level. I've even been working on an implementation that will run on the JVM (it's all of a week old for writing code for it, but I've been documenting the how it works for months in my notebooks). The language is based around the idea of stream processing. By using type signatures, function calls can be parallelized based on how the are called.

As an example, given two functions, A(Item) and B(List[Item]), if you call them like:

    B << A << List[Item](x,y,z)
the A should be called once for each item in the list of x,y,z, the results of which are collected into a list and passed to B. If you add in a function C(Item) and call it like:

    B << C << A << List[Item](x,y,z)
then at any given point, A can be processing a list item, C can be processing a list item and the total results of C are being collected. Which leads into, what if it is order dependent?

    B << serial { A << List[Item](x,y,z) }
will guarantee that A is called serially and the order of the List (and results) is preserved.

I really shouldn't advertise the link to the language at all here, considering I have zero documentation and there's almost no comments in the current code, but I'm going to anyways, https://bitbucket.org/ismarc/muster . It currently will parse a single file, generate the AST and generate the symbol table. I'm working on a non-parallel interpreter right now, with the intention of using the interpreter to write the compiler so the language is fully exercised to work out any potential issues with it. And on a final note, I'm targeting the JVM mainly for existing libraries, but I'm open to discussions about it.

-----


I've been thinking about a similar thing, attaching Storm-like parallel semantics to Pipes (probably Gabriel Gonzalez's version) in the Par monad.

-----


Some good ideas -- have you looked at Dryad?

-----


Maybe I'm wrong, but it seems to me that most of this guy's ideas are either a bit half-baked or simply flat out wrong.

Consider the argument for his proposed language being Turing complete on the flow manifesto page. He claims that this follows from the Böhm-Jacopini theorem. However the Böhm-Jacopini theorem requires variable assignment is allowed in the subprograms (which does not appear to be the case in flow), and without this the theorem does not hold, as this paper by Kozen demonstrates: www.cs.cornell.edu/~kozen/papers/bohmjacopini.pdf. Then he goes on to argue that order duality implies the existence of a selection operator in flow, which doesn't seem to make any sense.

I think it's telling that he has been working on this since 2010, yet has no decent proofs of his claims, nor any working implementation.

-----


Variable assignment is not strictly necessary, what is necessary is that one loop iteration can modify the initial state of variables in the next loop iteration. This is the second alternative to setting the current value of a variable presented in Flow (specifically, recurrence relations, e.g. x' = x + 1, which are syntactic sugar but allow "variable assignment" but only allow assignment to variables accessible in the next loop iteration). So the Flow Manifesto doesn't cover the hole in the Böhm-Jacopini theorem specifically, but that doesn't mean this computational model is not Turing complete.

As far as proofs or working implementation, sometimes life gets in the way of progress :-)

-----


Since this thread is young and full of expertise, I'm going to ask a question about a nuance of Moore's Law that's been bugging me for a while. (I study CS and currently have very little background in physics and electrical engineering.)

Over the past few years I've read engineers talk about the manufacture and design of CPUs in terms of an XX nanometer processes, where XX is some ever smaller integer - presumably representing the density of transistors in the CPU, or something proportional to that.

But for many decades, we've known that there are fundamental physical limits of computation as-we-know-it with Moore's law predicting a smooth logarithmic curve until that point, doubling every 18 months.

My question is why does it have to be a smooth increase? Don't we have enough computation and understanding now to leapfrog straight to the physical limits using standard run-of-the-mill CPU architecture? After all we can manipulate individual atoms and I think we have transistors made of a just few atoms or less.

Why do we have to go through the 20 nanometer processes then presumably through the picometers until finally stopping at some point maybe in the femtometers (where the lengths of many atomic nuclei can be expressed in single digit integers). Development that Moore's Law presumably predicts will take multiple decades to reach.

Why can't we build processors using modern CPU paradigms on the atomic scale earlier? Is it mass-production issues? Or some physical bottlenecks? Do we need to continually develop ever slightly more dense processors at great expense slowly or can we skip the interval and develop straight at the atomic scale?

I just feel like at some point there should be step function straight to the physical limits of processors as-we-know-them. Why hasn't this happened?

-----


First, you should know that computing is done with electrons, not nucleons. That means the fundamental limit is at a length scale of the size of atoms (i.e. electron orbits) and not nuclei. Atoms are about 0.1 nm in size, whereas nuclei are about 0.00001 nm = 10 fm across. That's a difference of four orders of magnitude, so it's pretty important.

Second, your intuition is off. The jump from "being able to manipulate X in a laboratory" to "being able to build useful devices with X on a large scale" is large. We've been able to handle individual protons and neutrons with the right equipment too, but that doesn't mean we're close to building computers out of them.

-----


Much of this depends on your time scale, of course. If you stretch out the time granularity just a bit, we really did make a pretty rapid jump akin to a step function.

Indeed, I wonder if historians will look back and drop all references to Moore's law, and just recognize that From 1950-2050, we developed the whatever-the-end-game-is transistor, and pretty much ignore all the interesting (to us) developments that got us there.

-----


After reviewing http://en.wikipedia.org/wiki/MOSFET#Difficulties_arising_due... it makes sense to just point you there for a starting point.

tl;dr Current Intel processors are already at the physics-imposed limits for shrinking the size. Every process shrink Intel achieves represents a fundamental change in the way semiconductors are produced.

-----


Fortress, a programming language funded by DARPA and developed by Sun, aimed at solving this very issue by parallelizing core operations. For example, 'for' loops in the language are parallel operations that do not guarantee linear iteration. Unfortunately, development wound down this summer due to incompatibility with VM type systems.

I'm surprised that the author did not mention join lists in his discussion of parallelizable functional program structure. A join list is either empty, a single element, or contains two join lists. In a normal list, you recur on a list that is only one smaller. In a join list, each recursion passes half of the list to a new processor.

The switch to easy parallelization will require data structures that do not guarantee linear iteration but gain scalability "for free." In my experience, this is easy to add to lisp dialects, but difficult to implement and use in imperative languages.

-----


Factual correction: Fortress lost the DARPA competition to IBM, I think. So after the competition Sun funded it internally...maybe because it was GLS's project.

-----


It also lost to Chapel from Cray.

-----


May I make a small, unrelated suggestion if the OP is the blog owner? Smaller logo. Took me a full half-minute to figure out that I should scroll down. Took up my entire screen, and thought I was supposed to click something special to see the content.

/rant

-----


I'm not the OP but I'm the blog owner. Whoever posted the link posted the mobile link, and blogger.com has re-formatted the image to maximum width... not sure how to fix it, but I'll take a look :-)

-----


Fair enough. I didn't mean to be a sour puss--I was just surprised by the situation.

-----


A colleague of mine remarked that one of the things that bothered him about Go (and uC++, and to some extent, VHDL for that matter) was how much the languages resembled traditional imperative languages (like C). In particular, it was the cases where these languages did something that was different from C (i.e., "different from what the user expected") that caused him the most pain.

If the Flow language can figure out how to deal with this expectation, I think there might be something here.

-----


While I agree with him in theory, I doubt this will work in practice.

I very much doubt that a compiler will be able to choose appropriate parallelization schemes at runtime without hints from the user, Big-O approximation will probably be insufficient in an annoying number of cases and very hard for the compiler to implement. IO will require monads or that DAG turns into a linked list. I also find it very telling that he doesn't show what that DAG looks like for loops or recursive functions.

-----


The flowlang mailing list has discussed these points in some detail; he definitely has loops.

https://groups.google.com/forum/?fromgroups#!forum/flowlang

-----


Right, a loop is a morphism (or sub-lattice) from the state of loop variables at the beginning of a loop iteration to the state of loop variables in the next iteration. Iteration represents unrolling copies of the loop body sub-lattice in a linear string until termination. Recursion represents expanding sub-lattices inside sub-lattices until termination.

-----


I did not understand everything in the article, but if Haskell has solved all these problems already, why isn't everyone jumping ship en masse to functional programming? Surely there must be a better reason than "It's too hard for programmers"?

-----


We don't have that many cores yet. The biggest servers have 64, some phones have 4. That's all within a range where we can parallelize by running different processes on different cores without changing our code.

I guess we will see more urgency in a couple of years when hundereds of cores are going to be the norm. At that point software that makes use of those cores will pull away from software that doesn't.

I'm not saying functional programming will be the answer but it could be. I don't think it's that difficult to grasp.

-----


He overlooks a lot of related work on data flow languages. Also, 1) and 3) remind me of I-structures (http://dl.acm.org/citation.cfm?id=69562).

Otherwise he seems to have described asynchronous message-passing style, except with convoluted programming language concepts. But Microsoft already has a compiler that statically analyzes message passing in large programs; look up the Singularity operating system. [TL;DR: Message-passing microkernel; everything runs in supervisor mode; safety is checked statically at compile-time.]

Unordered collections is just too lax to be useful: for example, multiplication of 100 matrices can be nicely parallelized because multiplication is associative, but not commutative. So you'd probably end up defining write-once locations ("boxes") that contain (ordered) vectors of elements and use that for communication. So, back to I-structures.

So, basically, just another mash-up of old ideas.

-----


I'll never forget this exchange I had with a CS professor:

Me: ... but, in the future, everyone will learn how to program. Neal: no, in the future, no one will need to know how to program.

I suspect Neal was right, though I dislike the idea strongly.

Anyhow, machines will probably prove to be smarter with such details. So accountants can likely just shift the cost to another column. (Sorry to anyone who can conceivably be replaced by a robot.)

Not to sound like such a douche: parallelism, concurrency, distributed process; can raw human "wet" computing ever pretend to compete with clean binary process in such gambles?

-----


> Neal: no, in the future, no one will need to know how to program

Already true if you defined 'programming' as 'wiring up plugboards' or 'punching cards to feed into a mainframe'.

If you define 'programming' as 'telling non-sentient machines what to do in a way that they will actually execute your intentions' then not even developing human-level AI will kill programming. After all, why would a human-level AI be chained to the task of playing DOOM with me just because I decide I want to play DOOM?

-----


To claim this is a 'dilemma' is to ignore the increasingly common types of workloads that companies need to run, and the amazing technology stacks that have been created to run them. Namely, horizontally scalable systems on commodity hardware with linear performance vs node count.

Software is eating the world, and as the speed and availability of the network continues to improve dramatically, computing workloads will continue to centralize, and companies will serve their massive user bases with wimpy cores at a massive scale.

Amdahl's Law says that "the speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program."

So while you have a hard limit trying to improve performance of a single function through parallelization, there is no upper limit on running multiple instances of the same function to respond to multiple concurrent and independent requests.

Individual cores need only to be fast enough such that the latency of a single request/response is acceptable. After that, any increase single-core performance requires a commensurate increase in the complexity of the scheduler, leading to rapidly diminishing returns, not to mention significantly lowering 'Responses per kWh'.

The fundamental shift toward multi-core processors may have been necessary due to physics, but not coincidentally (I think) it's also exactly the architecture you want if you are trying to serve large numbers of requests over a network.

-----


The difficulty I have with his manifesto is that he proposes many ideas that sound quite similar to existing ideas (e.g. graph reduction model of computing, dataflow execution models, dataflow languages), but there isn't a direct comparison: I can't tell if this is a remix of older ideas (nothing wrong with that necessarily), or if there is some additional idea he has that mean the language would be a major advance over past work. It's just difficult to know what to make of it.

-----


He has additional ideas - the problem is they're mostly nonsense. For example, if you look at the page where he tries to explain what he means by program lattices and lattice based computing, he gets the definition of a lattice wrong! (despite actually linking to a correct definition!)

-----


What specifically is wrong with the definition of a lattice?

-----


You state that:

A lattice is a partial ordering with a single unique "top element" (least upper bound) and a single unique "bottom element" (greatest lower bound). Every node in the lattice is reachable by following some directed path from the single least upper bound, and there is a directed path from every node in the lattice to the single least lower bound.

A lattice is a poset where binary joins and meets exist. You define a bounded poset not a lattice. In general a lattice need not have either a top element or a bottom element, although all finite lattices are complete and thus do have top and bottom elements. The second sentence is also redundant, because for any element x of a lattice x ≤ ⊤, so why talk about paths?

-----


Cyaegha: Since we're talking about graphs generated from source code, they will always be finite. These are not abstract infinite mathematical objects. All finite lattices are posets, but not all finite posets are lattices, so I stand by the usage. And "Program lattice" sounds a heck of a lot better than "program bounded poset" :-)

The second sentence is intended to be redundant, it gives further explanation as to the implications of the first sentence. (This doc was never intended to be a tight mathematical definition, only to give a quick introduction to the concepts, with links to the strict mathematical definitions.)

The reality is that the lattice or bounded poset formulation is only needed to understand that the specific language constraints described ensure that the dependency graph of any valid program can be statically determined to be a partial ordering. The compiler doesn't even need to build a DAG internally, it just needs to ensure the specific language constraints described are in place, and as a result, the language will be able to be parallelized in any manner consistent with the partial ordering of data dependencies.

-----


All finite lattices are posets, but not all finite posets are lattices, so I stand by the usage.

Yes obviously, but nowhere is it clear that you actually have a lattice of any kind. Specifically, I see no reason why pruning DAGs of unused computations would yield a lattice, and this is a result you seem to rely on. Program lattice does indeed sound nicer than program bounded poset, but I would be more wary about misusing terminology than having a nice name! :)

-----


The least upper bound can be thought of as a virtual node consisting of the set of static constants (and the set of all inputs) fed into the rest of the lattice. The greatest lower bound is a virtual node consisting of the set of final output values that are "used" (written to an output file or database). If an intermediate value is computed but is not referenced anywhere, the data dependency DAG has a local lower bound node that is not connected to the greatest lower bound since the value is not written anywhere. If these unused nodes are pruned away, the result is a lattice.

-----


For that construction to work, these 'virtual' nodes need to exist in the poset. Your poset is just the vertex set of your DAG ordered by reachability, so they don't exist. I think your intuition is good, but the mathematics just doesn't add up. Perhaps you should look into things like trace monoids, pomsets and petri nets?

Please understand that I'm not trying to be mean here. You mention you've been applying for grants to work on this, but if you don't explain the mathematics adequately or demonstrate how it differs from existing formalisms then I imagine people are just going to reject you instantly. I would suggest seriously trying to formalise some of this stuff in a proof assistant like Coq. That'll quickly tell you if your theory works or not.

-----


Identifying parallelism is important, and the Flow project sounds cool. However it's by no means the only thing we have to do to solve the multicore conundrum. It turns out that there's a lot more to worry about in a cluster or a GPU than just the number and nature of their cores.

-----


This is only intended to be an optimal solution to the problem of building highly parallel data manipulation pipelines.

-----


Just a problem statement and initial idea which isn't that new, but could be interesting if he actually implements it.

There have been many failed systems that hinged on "sufficiently smart compilers". Some have been thread-parallel machines (like the Tera MTA, shipped 1998) and some have been VLIW/MIMD, like Multflow (shipped 1987). And Itanium.

-----


Nicely written article - clarified some points for me.

Isn't the wait for the unordered collection to be "finalized" a huge bottleneck, though. I can see this working nicely for low level parallelisation but not so well for more traditional producer/consumer threads or I/O operations.

-----


http://people.cs.umass.edu/~emery/pubs/flux-usenix06.pdf

-----


The Flux language proposed in the paper makes it easier to _manually_ decompose a large program into smaller "concurrent servers."

Flow (and note that this is not the only new language being called Flow) seems to go for automatic parallelization; if successful, that's a fundamentally different approach versus Flux.

-----


Yes, I have come across now fewer than five programming languages called Flow since posting the Flow Manifesto. It already has a new name, but the new name and info about the new directions of the project haven't been posted publicly yet.

-----


from the science prospective: after LAPACK and friends are rewritten into functional languages, maybe some people will be willing to switch from FORTRAN... but i doubt it.

-----


I think the main reason why scientists haven't moved away from FORTRAN is cultural.

I remember rewriting some unreadable F77 code with copy pasted functions from Numerical Recipes into C linking GSL for my advisor in grad school and making it much much faster.

Fast linear algebra has been available in partly-functional programming languages to scientists for at least two decades (think MatLab and Mathematica), but once you learn to use one tool and it works well enough, you'll keep using it for ages.

Furthermore there is no great advantage in parallelizing linear algebra, when most HPC workflows I've seen in physics are intrinsically parallel in that you can run 512 version of whatever your problem is and get good statistics out of it. Your mileage my vary though.

-----


> Furthermore there is no great advantage in parallelizing linear algebra

With all due respect, that is a ludicrous statement to make! Most nontrivial parallel scientific computing applications use parallel linear algebra of some sort.

> the main reason why scientists haven't moved away from FORTRAN is cultural

You can make that argument for Fortran vs C, not Fortran vs Matlab. Fortran is blazingly fast, and you dont have to pay an arm and a leg for a seat like with Matlab.

-----


I think the main reason why scientists haven't moved away from FORTRAN is cultural.

What should they move to? Fortran90 has vector and matrix notation build-in so it's just like typed, faster matlab. More safety, more speed, what's not to like?

Move to C? You lose the vector notation, and gain all the pointer mess, so not very lucrative unless you were a C-programmer to begin with.

Move to python? Sure, the language is nice, but it's not faster than matlab, so why bother. Unless you really appreciate the extra niceness. (I do)

Move to C++? Again, no build-in vector notation, so not very lucrative. Sure, you can use libraries that overload operators to get the vector notation, but the learning curve can be steep, and after all the extra learning, you just have what you had with fortran90 and matlab to begin with. Of course, if you already were a competent C++ programmer, then you may like to do everything with C++. But scientist may not have that much extra time to train themselves even half-fluent in C++.

Move to some other language? Which? Java? Scala? Rust? Haskell? Lisp?

-----


There will always be a monkey banging the drum for fantastical soft solutions, and then there are rocket scientists who will deliver the hard solution.

Would that you could see over the horizon.

-----




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: