

Solving the Multicore Dilemma - tel
http://www.flowlang.net/p/solving-multicore-dilemma.html?m=1

======
jandrewrogers
For massively parallel systems, functional language compilers do not address
the primary implementation challenges around software design. In fact, the C++
compilers for a number of supercomputers have functional language extensions
but I have almost never seen those extensions used because they do not solve
any of the problems parallel systems designers have.

To dispel one misconception, the reason it is simple to write massively
parallel code in C++ is that the basic model is a single process locked to a
single core with no context switching at all. It is all non-blocking,
coroutine-esque, cooperative scheduling. Writing this type of code is exactly
like old school single-process, interrupt-driven UNIX code from back in the
day when no one did threading. Simple to write, simple to analyze. Analogues
of this work well even on fine-grained systems.

The real parallelization challenge is often completely overlooked by people
offering nominal solutions. While your application may only be composed of two
algorithms _and_ those algorithms are independently massively parallel, that
does not imply that the algorithms are operating on identical representations
of the underlying data model, something that is tacitly assumed most times.
Representation transforms between algorithm steps inherently parallelize
poorly.

A canonical example of this is the hash join algorithm. It is a two-stage
algorithm, and both stages are obviously parallelizable, yet ad hoc hash joins
parallelize very poorly in practice because it requires a representation
transform.

In a sense, the real parallelism problem is a data structure problem. How do
you design a single data structure such that all operators required for your
algorithms and data models are both efficient and massively parallel? Using
functional programming languages does not speak to that and it is the
fundamental challenge regardless of the hardware.

~~~
zurn
> To dispel one misconception, the reason it is simple to write massively
> parallel code in C++ [...]

Are you joking?

~~~
jandrewrogers
Not at all. I've written massively parallel codes for a number of computing
platforms, both conventional and exotic. The vast majority are primarily
written in systems-style C++ (basically, C++ minus several popular features).

Most of the knocks against C++ are micro-optimizations that are not material
to the actual parallelization of the code. Parallelization is done by
carefully selecting and designing the distributed data structures, which is
simple to do in C++. For parallel systems these are required (theoretically)
to be space decomposition structures, which happen to be very elegant to
implement in C. Communication between cells in the space decomposition is
conventional message passing.

The usual compiler-based tricks such as loop parallelization are not all that
helpful even though supercomputing compilers support it. By the time you get
to that level of granularity, it is a single process locked to a single core,
so there is not much hardware parallelism to exploit. And even then, you have
to manually tune the behavior with pragmas if you want optimal performance.

Most compiler-based approaches to parallelism attack the problem at a level of
granularity that is not that useful for non-trivial parallelism. The lack of
multithreading in modern low-latency, high-performance software architectures
make the many deficiencies of C/C++ for heavily threaded environments largely
irrelevant.

------
ismarc
For a while, I've been pondering how to solve the same problem. I also reached
similar conclusions, that parallelism needs to be implicit at the language
level. I've even been working on an implementation that will run on the JVM
(it's all of a week old for writing code for it, but I've been documenting the
how it works for months in my notebooks). The language is based around the
idea of stream processing. By using type signatures, function calls can be
parallelized based on how the are called.

As an example, given two functions, A(Item) and B(List[Item]), if you call
them like:

    
    
        B << A << List[Item](x,y,z)
    

the A should be called once for each item in the list of x,y,z, the results of
which are collected into a list and passed to B. If you add in a function
C(Item) and call it like:

    
    
        B << C << A << List[Item](x,y,z)
    

then at any given point, A can be processing a list item, C can be processing
a list item and the total results of C are being collected. Which leads into,
what if it is order dependent?

    
    
        B << serial { A << List[Item](x,y,z) }
    

will guarantee that A is called serially and the order of the List (and
results) is preserved.

I really shouldn't advertise the link to the language at all here, considering
I have zero documentation and there's almost no comments in the current code,
but I'm going to anyways, <https://bitbucket.org/ismarc/muster> . It currently
will parse a single file, generate the AST and generate the symbol table. I'm
working on a non-parallel interpreter right now, with the intention of using
the interpreter to write the compiler so the language is fully exercised to
work out any potential issues with it. And on a final note, I'm targeting the
JVM mainly for existing libraries, but I'm open to discussions about it.

~~~
tel
I've been thinking about a similar thing, attaching Storm-like parallel
semantics to Pipes (probably Gabriel Gonzalez's version) in the Par monad.

------
Cyaegha
Maybe I'm wrong, but it seems to me that most of this guy's ideas are either a
bit half-baked or simply flat out wrong.

Consider the argument for his proposed language being Turing complete on the
flow manifesto page. He claims that this follows from the Böhm-Jacopini
theorem. However the Böhm-Jacopini theorem requires variable assignment is
allowed in the subprograms (which does not appear to be the case in flow), and
without this the theorem does not hold, as this paper by Kozen demonstrates:
www.cs.cornell.edu/~kozen/papers/bohmjacopini.pdf. Then he goes on to argue
that order duality implies the existence of a selection operator in flow,
which doesn't seem to make any sense.

I think it's telling that he has been working on this since 2010, yet has no
decent proofs of his claims, nor any working implementation.

~~~
lukehutch
Variable assignment is not strictly necessary, what is necessary is that one
loop iteration can modify the initial state of variables in the next loop
iteration. This is the second alternative to setting the current value of a
variable presented in Flow (specifically, recurrence relations, e.g. x' = x +
1, which are syntactic sugar but allow "variable assignment" but only allow
assignment to variables accessible in the next loop iteration). So the Flow
Manifesto doesn't cover the hole in the Böhm-Jacopini theorem specifically,
but that doesn't mean this computational model is not Turing complete.

As far as proofs or working implementation, sometimes life gets in the way of
progress :-)

------
someperson
Since this thread is young and full of expertise, I'm going to ask a question
about a nuance of Moore's Law that's been bugging me for a while. (I study CS
and currently have very little background in physics and electrical
engineering.)

Over the past few years I've read engineers talk about the manufacture and
design of CPUs in terms of an XX nanometer processes, where XX is some ever
smaller integer - presumably representing the density of transistors in the
CPU, or something proportional to that.

But for many decades, we've known that there are fundamental physical limits
of computation as-we-know-it with Moore's law predicting a smooth logarithmic
curve until that point, doubling every 18 months.

My question is why does it have to be a smooth increase? Don't we have enough
computation and understanding now to leapfrog straight to the physical limits
using standard run-of-the-mill CPU architecture? After all we can manipulate
individual atoms and I think we have transistors made of a just few atoms or
less.

Why do we have to go through the 20 nanometer processes then presumably
through the picometers until finally stopping at some point maybe in the
femtometers (where the lengths of many atomic nuclei can be expressed in
single digit integers). Development that Moore's Law presumably predicts will
take multiple decades to reach.

Why can't we build processors using modern CPU paradigms on the atomic scale
earlier? Is it mass-production issues? Or some physical bottlenecks? Do we
need to continually develop ever slightly more dense processors at great
expense slowly or can we skip the interval and develop straight at the atomic
scale?

I just feel like at some point there should be step function straight to the
physical limits of processors as-we-know-them. Why hasn't this happened?

~~~
jessriedel
First, you should know that computing is done with electrons, not nucleons.
That means the fundamental limit is at a length scale of the size of atoms
(i.e. electron orbits) and not nuclei. Atoms are about 0.1 nm in size, whereas
nuclei are about 0.00001 nm = 10 fm across. That's a difference of _four
orders of magnitude_ , so it's pretty important.

Second, your intuition is off. The jump from "being able to manipulate X in a
laboratory" to "being able to build useful devices with X on a large scale" is
large. We've been able to handle individual protons and neutrons with the
right equipment too, but that doesn't mean we're close to building computers
out of them.

------
tbenst
Fortress, a programming language funded by DARPA and developed by Sun, aimed
at solving this very issue by parallelizing core operations. For example,
'for' loops in the language are parallel operations that do not guarantee
linear iteration. Unfortunately, development wound down this summer due to
incompatibility with VM type systems.

I'm surprised that the author did not mention join lists in his discussion of
parallelizable functional program structure. A join list is either empty, a
single element, or contains two join lists. In a normal list, you recur on a
list that is only one smaller. In a join list, each recursion passes half of
the list to a new processor.

The switch to easy parallelization will require data structures that do not
guarantee linear iteration but gain scalability "for free." In my experience,
this is easy to add to lisp dialects, but difficult to implement and use in
imperative languages.

~~~
seanmcdirmid
Factual correction: Fortress lost the DARPA competition to IBM, I think. So
after the competition Sun funded it internally...maybe because it was GLS's
project.

~~~
frozenport
It also lost to Chapel from Cray.

------
tomrod
May I make a small, unrelated suggestion if the OP is the blog owner? Smaller
logo. Took me a full half-minute to figure out that I should scroll down. Took
up my entire screen, and thought I was supposed to click something special to
see the content.

/rant

~~~
lukehutch
I'm not the OP but I'm the blog owner. Whoever posted the link posted the
mobile link, and blogger.com has re-formatted the image to maximum width...
not sure how to fix it, but I'll take a look :-)

~~~
tomrod
Fair enough. I didn't mean to be a sour puss--I was just surprised by the
situation.

------
cbhl
A colleague of mine remarked that one of the things that bothered him about Go
(and uC++, and to some extent, VHDL for that matter) was how much the
languages resembled traditional imperative languages (like C). In particular,
it was the cases where these languages did something that was different from C
(i.e., "different from what the user expected") that caused him the most pain.

If the Flow language can figure out how to deal with this expectation, I think
there might be something here.

------
DasIch
While I agree with him in theory, I doubt this will work in practice.

I very much doubt that a compiler will be able to choose appropriate
parallelization schemes at runtime without hints from the user, Big-O
approximation will probably be insufficient in an annoying number of cases and
very hard for the compiler to implement. IO will require monads or that DAG
turns into a linked list. I also find it very telling that he doesn't show
what that DAG looks like for loops or recursive functions.

~~~
sounds
The flowlang mailing list has discussed these points in some detail; he
definitely has loops.

<https://groups.google.com/forum/?fromgroups#!forum/flowlang>

~~~
lukehutch
Right, a loop is a morphism (or sub-lattice) from the state of loop variables
at the beginning of a loop iteration to the state of loop variables in the
next iteration. Iteration represents unrolling copies of the loop body sub-
lattice in a linear string until termination. Recursion represents expanding
sub-lattices inside sub-lattices until termination.

------
jaipilot747
I did not understand everything in the article, but if Haskell has solved all
these problems already, why isn't everyone jumping ship en masse to functional
programming? Surely there must be a better reason than "It's too hard for
programmers"?

~~~
fauigerzigerk
We don't have that many cores yet. The biggest servers have 64, some phones
have 4. That's all within a range where we can parallelize by running
different processes on different cores without changing our code.

I guess we will see more urgency in a couple of years when hundereds of cores
are going to be the norm. At that point software that makes use of those cores
will pull away from software that doesn't.

I'm not saying functional programming will be the answer but it could be. I
don't think it's that difficult to grasp.

------
zvrba
He overlooks a lot of related work on data flow languages. Also, 1) and 3)
remind me of I-structures (<http://dl.acm.org/citation.cfm?id=69562>).

Otherwise he seems to have described asynchronous message-passing style,
except with convoluted programming language concepts. But Microsoft already
has a compiler that statically analyzes message passing in large programs;
look up the Singularity operating system. [TL;DR: Message-passing microkernel;
everything runs in supervisor mode; safety is checked statically at compile-
time.]

Unordered collections is just too lax to be useful: for example,
multiplication of 100 matrices can be nicely parallelized because
multiplication is associative, but not commutative. So you'd probably end up
defining write-once locations ("boxes") that contain (ordered) vectors of
elements and use that for communication. So, back to I-structures.

So, basically, just another mash-up of old ideas.

------
baxrob
I'll never forget this exchange I had with a CS professor:

Me: ... but, in the future, everyone will learn how to program. Neal: no, in
the future, no one will need to know how to program.

I suspect Neal was right, though I dislike the idea strongly.

Anyhow, machines will probably prove to be smarter with such details. So
accountants can likely just shift the cost to another column. (Sorry to anyone
who can conceivably be replaced by a robot.)

Not to sound like such a douche: parallelism, concurrency, distributed
process; can raw human "wet" computing ever pretend to compete with clean
binary process in such gambles?

~~~
derleth
> Neal: no, in the future, no one will need to know how to program

Already true if you defined 'programming' as 'wiring up plugboards' or
'punching cards to feed into a mainframe'.

If you define 'programming' as 'telling non-sentient machines what to do in a
way that they will actually execute your intentions' then not even developing
human-level AI will kill programming. After all, why would a human-level AI be
chained to the task of playing DOOM with me just because I decide I want to
play DOOM?

------
zaroth
To claim this is a 'dilemma' is to ignore the increasingly common types of
workloads that companies need to run, and the amazing technology stacks that
have been created to run them. Namely, horizontally scalable systems on
commodity hardware with linear performance vs node count.

Software is eating the world, and as the speed and availability of the network
continues to improve dramatically, computing workloads will continue to
centralize, and companies will serve their massive user bases with wimpy cores
at a massive scale.

Amdahl's Law says that "the speedup of a program using multiple processors in
parallel computing is limited by the sequential fraction of the program."

So while you have a hard limit trying to improve performance of a single
function through parallelization, there is _no upper limit_ on running
multiple instances of the same function to respond to multiple concurrent and
independent requests.

Individual cores need only to be fast enough such that the latency of a single
request/response is acceptable. After that, any increase single-core
performance requires a commensurate increase in the complexity of the
scheduler, leading to rapidly diminishing returns, not to mention
significantly lowering 'Responses per kWh'.

The fundamental shift toward multi-core processors may have been necessary due
to physics, but not coincidentally (I think) it's also exactly the
architecture you want if you are trying to serve large numbers of requests
over a network.

------
army
The difficulty I have with his manifesto is that he proposes many ideas that
sound quite similar to existing ideas (e.g. graph reduction model of
computing, dataflow execution models, dataflow languages), but there isn't a
direct comparison: I can't tell if this is a remix of older ideas (nothing
wrong with that necessarily), or if there is some additional idea he has that
mean the language would be a major advance over past work. It's just difficult
to know what to make of it.

~~~
Cyaegha
He has additional ideas - the problem is they're mostly nonsense. For example,
if you look at the page where he tries to explain what he means by program
lattices and lattice based computing, he gets the definition of a lattice
wrong! (despite actually linking to a correct definition!)

~~~
lukehutch
What specifically is wrong with the definition of a lattice?

~~~
Cyaegha
You state that:

 _A lattice is a partial ordering with a single unique "top element" (least
upper bound) and a single unique "bottom element" (greatest lower bound).
Every node in the lattice is reachable by following some directed path from
the single least upper bound, and there is a directed path from every node in
the lattice to the single least lower bound._

A lattice is a poset where binary joins and meets exist. You define a bounded
poset not a lattice. In general a lattice need not have either a top element
or a bottom element, although all finite lattices are complete and thus do
have top and bottom elements. The second sentence is also redundant, because
for any element x of a lattice x ≤ ⊤, so why talk about paths?

~~~
lukehutch
Cyaegha: Since we're talking about graphs generated from source code, they
will always be finite. These are not abstract infinite mathematical objects.
All finite lattices are posets, but not all finite posets are lattices, so I
stand by the usage. And "Program lattice" sounds a heck of a lot better than
"program bounded poset" :-)

The second sentence is intended to be redundant, it gives further explanation
as to the implications of the first sentence. (This doc was never intended to
be a tight mathematical definition, only to give a quick introduction to the
concepts, with links to the strict mathematical definitions.)

The reality is that the lattice or bounded poset formulation is only needed to
understand that the specific language constraints described ensure that the
dependency graph of any valid program can be statically determined to be a
partial ordering. The compiler doesn't even need to build a DAG internally, it
just needs to ensure the specific language constraints described are in place,
and as a result, the language will be able to be parallelized in any manner
consistent with the partial ordering of data dependencies.

~~~
Cyaegha
_All finite lattices are posets, but not all finite posets are lattices, so I
stand by the usage._

Yes obviously, but nowhere is it clear that you actually have a lattice of any
kind. Specifically, I see no reason why pruning DAGs of unused computations
would yield a lattice, and this is a result you seem to rely on. Program
lattice does indeed sound nicer than program bounded poset, but I would be
more wary about misusing terminology than having a nice name! :)

~~~
lukehutch
The least upper bound can be thought of as a virtual node consisting of the
set of static constants (and the set of all inputs) fed into the rest of the
lattice. The greatest lower bound is a virtual node consisting of the set of
final output values that are "used" (written to an output file or database).
If an intermediate value is computed but is not referenced anywhere, the data
dependency DAG has a local lower bound node that is not connected to the
greatest lower bound since the value is not written anywhere. If these unused
nodes are pruned away, the result is a lattice.

~~~
Cyaegha
For that construction to work, these 'virtual' nodes need to exist in the
poset. Your poset is just the vertex set of your DAG ordered by reachability,
so they don't exist. I think your intuition is good, but the mathematics just
doesn't add up. Perhaps you should look into things like trace monoids,
pomsets and petri nets?

Please understand that I'm not trying to be mean here. You mention you've been
applying for grants to work on this, but if you don't explain the mathematics
adequately or demonstrate how it differs from existing formalisms then I
imagine people are just going to reject you instantly. I would suggest
seriously trying to formalise some of this stuff in a proof assistant like
Coq. That'll quickly tell you if your theory works or not.

------
cliffbean
Identifying parallelism is important, and the Flow project sounds cool.
However it's by no means the only thing we have to do to solve the multicore
conundrum. It turns out that there's a lot more to worry about in a cluster or
a GPU than just the number and nature of their cores.

~~~
lukehutch
This is only intended to be an optimal solution to the problem of building
highly parallel data manipulation pipelines.

------
zurn
Just a problem statement and initial idea which isn't that new, but could be
interesting if he actually implements it.

There have been many failed systems that hinged on "sufficiently smart
compilers". Some have been thread-parallel machines (like the Tera MTA,
shipped 1998) and some have been VLIW/MIMD, like Multflow (shipped 1987). And
Itanium.

------
zwieback
Nicely written article - clarified some points for me.

Isn't the wait for the unordered collection to be "finalized" a huge
bottleneck, though. I can see this working nicely for low level
parallelisation but not so well for more traditional producer/consumer threads
or I/O operations.

------
eternalban
<http://people.cs.umass.edu/~emery/pubs/flux-usenix06.pdf>

~~~
sounds
The Flux language proposed in the paper makes it easier to _manually_
decompose a large program into smaller "concurrent servers."

Flow (and note that this is not the only new language being called Flow) seems
to go for automatic parallelization; if successful, that's a fundamentally
different approach versus Flux.

~~~
lukehutch
Yes, I have come across now fewer than _five_ programming languages called
Flow since posting the Flow Manifesto. It already has a new name, but the new
name and info about the new directions of the project haven't been posted
publicly yet.

------
af3
from the science prospective: after LAPACK and friends are rewritten into
functional languages, maybe some people will be willing to switch from
FORTRAN... but i doubt it.

~~~
carlob
I think the main reason why scientists haven't moved away from FORTRAN is
cultural.

I remember rewriting some unreadable F77 code with copy pasted functions from
Numerical Recipes into C linking GSL for my advisor in grad school and making
it much much faster.

Fast linear algebra has been available in partly-functional programming
languages to scientists for at least two decades (think MatLab and
Mathematica), but once you learn to use one tool and it works well enough,
you'll keep using it for ages.

Furthermore there is no great advantage in parallelizing linear algebra, when
most HPC workflows I've seen in physics are intrinsically parallel in that you
can run 512 version of whatever your problem is and get good statistics out of
it. Your mileage my vary though.

~~~
chengiz
> Furthermore there is no great advantage in parallelizing linear algebra

With all due respect, that is a ludicrous statement to make! Most nontrivial
parallel scientific computing applications use parallel linear algebra of some
sort.

> the main reason why scientists haven't moved away from FORTRAN is cultural

You can make that argument for Fortran vs C, not Fortran vs Matlab. Fortran is
blazingly fast, and you dont have to pay an arm and a leg for a seat like with
Matlab.

------
hindsightbias
There will always be a monkey banging the drum for fantastical soft solutions,
and then there are rocket scientists who will deliver the hard solution.

Would that you could see over the horizon.

