Many Core processors: Everything You Know (about Parallel Programming) Is Wrong

rauljara · on Jan 16, 2012

"The obstacle we shall have to overcome, if we are to successfully program manycore systems, is our cherished assumption that we write programs that always get the exactly right answers."

Most of the time, this is not a trade off worth making. I can't think of a researcher that would willingly trade replicability for speed. I can't think of a mathematician who would base a proof on the idea that a number is probably prime. I can't think of a bank customer who would be fine with the idea that the balance displayed on the ATM is pretty close to where it should be. I can't think of an airline passenger who would be totally fine with the flight computer usually being pretty good.

It would be a fine trade off for games, however. And I'm sure there is room for some fudging in complex simulations that have plenty of randomness already.

But given the choice between an answer that is correct, and an answer that is probably correct, I will take the correct answer. Even if I have to wait a little.

vindvaki · on Jan 16, 2012

I think I understand your point, but I don't agree with some of your examples.

"I can't think of a mathematician who would base a proof on the idea that a number is probably prime."

I can assure you that such a proof probably exists :) Just look at https://en.wikipedia.org/wiki/Probable_prime

Probability theory can be an extremely powerful tool when researching things that are otherwise difficult to reason about. And the theorem statement does not have to be probabilistic for the probabilistic method to be applicable. Just see http://en.wikipedia.org/wiki/Probabilistic_method http://en.wikipedia.org/wiki/Probabilistic_proofs_of_non-pro...

As for the following:

"I can't think of an airline passenger who would be totally fine with the flight computer usually being pretty good."

Actually, I would think it's pretty much the opposite. That is, the only type of airline passenger I can think of, is one who is fine with the flight computer (and the airplane in general) usually being pretty reliable. We already know that computers can malfunction and airplanes can crash. Now, of course, how reliable you want the airplane to be is up to you, but if you want it to be flawless, then you should never board an airplane.

msellout · on Jan 17, 2012

It's not just the examples that are flawed. In most practical situations, provably correct answers do not exist. In most cases, one can only choose a level of certainty. Sometimes not even the level of certainty is possible to know.

lordlicorice · on Jan 17, 2012

Also, no airline passenger cares exactly what trajectory they take through the air to reach their destination. If it's easier to program an autopilot which tacks slightly off course, nobody will care.

wladimir · on Jan 17, 2012

Those are examples of problems that exist within the space of mathematics and number games we created for ourselves. Computers as we have them now are great for those.

However, when interfacing to the real, non-exactly specified, incomplete information world, "good enough most of the time" is not only fine, it is all you can hope for. For example, robots that need to navigate rough terrain, or in living orgamisms at the nano-scale, or communicate in human language.

There is a huge class of things that simply cannot be addressed with the current "correct" model of computation because they are not well-specified to the lowest level in the first place.

Computers, in other words, needs to be more like brains. Yes, this does give up certain advantages, and safety guarantees. But it does not give up everything. People have trusted their life to animals and other humans, for example, for all history, even though they are far from perfectly safe and/or compliant...

(and even now, it is "fallible" humans that control the technology, so you could make a point that nothing is lost, as it was never there in the first place)

RyanMcGreal · on Jan 17, 2012

> Computers, in other words, needs to be more like brains.

I wonder about this. One of the most common human methods of making decisions with incomplete information and/or short time frames is to fall back on heuristics rather than try to reason through the data. Heuristics have the advantage of being fast and available, but they're also notoriously bad at producing the quality of decision that might come from a more thorough analysis of the data.

Frankly, I think I'd feel more comfortable with computers that made decisions through brute force processing and bayesian analysis.

wladimir · on Jan 17, 2012

Yes, we'd all like to live in the best of all worlds. But in any real-life scenario, there is a trade-off.

Brute forcing all possible combinations and finding the best one (for your specific goal, or taking into account all externalities as well) would be great. But it is impossible for two reasons:

1) incomplete, or even purposefully incorrect information

2) time constraints/compexity. For a non-linear system with more than a few variables, it is just not feasible to enumerate everything.

You end up with heuristics. The challenge is that remains is finding good heuristics. If the decision is given more time, you can probably come up with better heuristics and even sort-of brute force through the "best" options (which is what humans do when they reason).

arghnoname · on Jan 16, 2012

In most applications, using a IEEE float is fine. It's a trade-off where a loss in accuracy is acceptable. Lots of digital things (images for instance) are discrete approximations of continuous phenomenon. The question is the degree of inaccuracy that is tolerated and the sorts of things you have to do to prevent it from propagating (which should be familiar to anyone who has to do math and work with finite precision floating point numbers).

tomp · on Jan 16, 2012

floating points have well defined rules and have perfectly accurate calculations, they are only "inaccurate" when used as a computer representation/approximation of real numbers. However, they still do not exhibit any randomness (indeterminacy), are usually not a cause of strange, hard-to-reproduce errors (that concurrent, non-sequential memory instructions often cause).

aaronblohowiak · on Jan 16, 2012

It depends on if you are using "accurate" in the STEM jargon sense or if you are using it in the common parlance. In jargon, there is a difference between accuracy and precision. http://en.wikipedia.org/wiki/Accuracy_and_precision

Floats are inaccurate (sometimes far from real,) but precise (always yield the same output given the same input.)

Edit: reading the article, it calls out a special (different) meaning of the use of precise in the IEEE float spec

> In the case of full reproducibility, such as when rounding a number to a representable floating point number, the word precision has a meaning not related to reproducibility. For example, in the IEEE 754-2008 standard it means the number of bits in the significand, so it is used as a measure for the relative accuracy with which an arbitrary number can be represented.

brazzy · on Jan 17, 2012

Most "accuracy" issues that crop up with floats have nothing to do with either accuracy or precision, but simply with wrong expectations about decimal behaviour.

ajuc · on Jan 16, 2012

And yet our brains works that way, and we trust them.

If I understand correctly - no single neuron is neccesary for your brain to work ok. Brain does not depend on any single link working correctly. Somehow it works reliable enough.

It's possible to make reliable systems from unreliable parts.

kisielk · on Jan 16, 2012

As a software developer my brain is one thing I certainly don't trust, the brains of others I trust even less. That's exactly why we put things like tests and code reviews in place. Still despite our best efforts our software is full of defects.

From the non-software perspective human brains make big mistakes all the time. Consider car accidents, gross errors in judgement, etc.

andrewflnr · on Jan 16, 2012

We trust our brains, but they're not, in fact, very reliable, not without lots of training anyway. In everything from long-term decision making to remembering why you walked into the room, they make little errors that sometimes get us into big trouble. A large part of the reason computers are so useful is precisely because they don't usually make those kinds of mistakes.

ajuc · on Jan 16, 2012

They are reliable for things they are "designed" to. For example for motoric coordination, coordinationg body automatic functions - breathing, pumping blood, etc. They are also highly reliable for new things, if properly trained - like playing music, riding bike, flying plane, playing games, etc. Parsing natural language is not easy task, it involves symbolic processing, yet brains are better at this than computers.

Complaining that brains without training are not reliable is like complaining computers don't work correctly without software.

BTW - many of the errors brains make are not because of unreliable parts, but because of system design - we have some firmware that made sense before, and now is problematic. This firmware is reliable - it works as intended, only now the problems changed, and we can't change the firmware, and patching this in software causes some problems.

jiggy2011 · on Jan 17, 2012

It's difficult to say how reliable our brains are at doing things like driving cars or riding bikes because we don't really have anything to compare them to. I expect that it won't take long before we can build automated cars that are far safer drivers than any human.

Parsing natural language is probably inherently non deterministic since no natural language grammers are context free. When we parse a sentance there is almost always at least some ambiguity in the sentence itself, plus we also have to process other things that the person has said; in many cases years in the past and what motivation they may have for saying a certain thing. It is very unlikely that 2 people will get exactly the same meaning out of some natural language sentence because it comes with all kinds of side effects for example something somebody says may also change your opinion of them and other things they have said.

I suspect that if we cannot get deterministic behavior out of future computers because of the amount of parallelization required to make efficient use of their CPUs we will end up with 2 streams of computing , one of which will stay relatively static and be interested in using computers for the types of problems we currently do and another which will be interested in applying it to new problems that do not require determinism.

The software for these computers will be radically different so most likely you will have 2 computers on your desk (or in your pocket , or in VMs), one with < 10 cores and one with > 1000 cores.

I don't know much about how the brain works but I guess this is a process that uses allot of heuristics and psuedo randomness that probably lends itself well to being parallelized which is why we set up our languages this way.

jaylevitt · on Jan 16, 2012

It's not that the brain works "reliably enough"; any number of experiments will show that it doesn't.

It's more that one of the properties of the brain is that it believes itself to be reliable enough.

mikeash · on Jan 17, 2012

Isn't every real computer program inherently probabilistic? Even if the code is not, the hardware always has some chance of failure. Even with ECC RAM etc, there is always some chance that multiple bits will get simultaneously corrupted into a new but still valid configuration.

You can't get to 100% reliability. You just have to choose how many nines you want. Why should this be any different?

exDM69 · on Jan 17, 2012

This is different. If you get a failure, like a flipped bit in RAM, it's a failure. With many-core processors, the normal mode of operation is to provide correct results but they may be different from time to time, assuming more than one correct solution exists. Even if the results are exactly the same or only one solution exists, the process of getting there might be different.

mikeash · on Jan 17, 2012

Sure, it's different in terms of why it's not reliable, but the outcome is the same, so why does it matter?

exDM69 · on Jan 18, 2012

If I understood correctly, the outcome is not the same. In a memory failure you get incorrect results. In a concurrent program, you get correct results but they might not be the same results every time (provided that there's more than one correct answer) or the computation that took us there might be different (in terms of timing, power consumption, etc).

mikeash · on Jan 18, 2012

Still seems the same to me in principle. For problems where there are multiple correct answers, probabilistic methods are frequently used to find them (e.g. genetic algorithms). All kinds of things can change timing and power consumption, especially when using a probabilistic method.

TheEzEzz · on Jan 16, 2012

Due to round off error correct programs are not a subset of replicable programs. Scientists do trade replicability for speed.

rauljara · on Jan 16, 2012

You are perhaps not understanding what I mean by replicability. By replicability, I mean that if you run some statistics on a large dataset several times, you will get the same answer each time. Rounding errors in floating point arithmetic don't harm replicability, because all modern computers have that issue. There is some loss in accuracy, but floating point arithmetic is still deterministic.

TheEzEzz · on Jan 16, 2012

> There is some loss in accuracy, but floating point arithmetic is still deterministic.

Not in a multi-core setting. When you do a floating point reduce operation the order in which you reduce your floating point numbers matters, but this order is not necessarily deterministic.

palish · on Jan 16, 2012

Translation for readers:

Let's say you have some floating point numbers, like...

  float A = 42.90F;
  float B = 00.01F;
  float C = /* ... */;
  float D = /* ... */;

(A + B + C) will not necessarily produce the same result as (A + C + B).

In fact, if you do (A + B + B + B ...) very much, then you will wind up exacerbating the problem --- it can be a source of bugs if you repeatedly increment a counter by small floating point values. For example:

  const int Iterations = 10;
  for ( int i = 0; i < Iterations; ++i )
    A += B;
  printf( "%2.4f\n", A );

The output is 43.0000, as you'd expect. But what if we increase Iterations to 100? We get 43.8998.

Astute readers probably would note that 00.01F can't be exactly represented in base 2, and wonder whether something like 00.0125F would still suffer from this problem. Alas, yes:

  A = 42.9000F 
  B = 00.0125F

  8    iterations ==  43.0000 
  80   iterations ==  43.9001
  800  iterations ==  52.9006
  8000 iterations == 142.8820  /* we would expect 142.9 */

So if you want to parallelize some computations, and you want to do 4000 iterations on core #1 and 4000 iterations on core #2 and then sum the results, well... let's see what happens:

  float A1 = 42.9000F;
  float B1 = 00.0125F;

  float A2 = 42.9000F;
  float B2 = 00.0125F;

  const int Iterations = 4000;
  for ( int i = 0; i < Iterations; ++i )
  {
    A1 += B1;
    A2 += B2;
  }
  float A = A1 + A2 - 42.9000F;
  printf( "%2.4f\n", A );

The output is 142.8885 -- which is different from our earlier result of 142.8820!

You may be tempted to try various things to get around this fact... for example, what about counting from zero rather than from 42.9000? (Answer: the error can be even worse, which is a topic worthy of its own separate post) ... but you will quickly discover that this is a fundamental issue. The order of operations matters.

tomp · on Jan 16, 2012

Which is why every good numeric library should have a "correct" reduce functions implementations that guarantee minimal errors. in case of a sum of an floating point array, that would be a tree-based sum, possibly coupled by sorting the array beforehand.

Also, if you do some kind of map-reduce style work distribution where different nodes perform same operations on different data, it pays off to use a deterministic (pseudo-)random algorithm for work distribution, so that your results can be reproduced.

Silhouette · on Jan 17, 2012

> When you do a floating point reduce operation the order in which you reduce your floating point numbers matters, but this order is not necessarily deterministic.

There is no such thing as a "floating point reduce operation". You can do a reduction using a floating point calculation as your joining operation, but even in that case, each individual operation is still deterministic. Moreover, a classical reduce is also deterministic, being a fold in one direction or other based on the associativity of the joining operation. The non-determinism comes only from your choice to specify the behaviour of your "parallel reduce" function in a non-deterministic way, and if you do that without understanding the consequences that's hardly floating point's fault.

jedbrown · on Jan 17, 2012

In practice, the non-determinism is important for efficiency. Consider computing a norm or dot product in parallel. You sum the owned part of the distributed vector, then use MPI_Allreduce() to get the result of the global sum (the parallel implementation uses some tree structure, usually semi-redundantly, perhaps using dedicated hardware). If you run with a different number of processes, then the reduction will necessarily be done with different associativity. Of course any such result is as good as the others (unless the vector is somehow sorted in which case some orders will commit larger rounding error).

The MPI standard does not require that reductions use deterministic associativity for the same number of processes (independent of the physical network topology), but they do suggest that it is desirable and (AFAIK) most implementations are deterministic on the same number of processes.

justincormack · on Jan 17, 2012

Map reduce type calculations are less likely to be on kniwn or fixed numbers of pricesses, as they are designed for failing hardware and more flexible setups, unlike most MPI jobs. But most if the calculations are probabky less sensitive to errors too.

sliverstorm · on Jan 16, 2012

Exactly. So long as we're talking about accuracy in terms of precision, and not "5% of the time the result will be nonsense", people already make this sacrifice every day.

There are also plenty of existing models where accuracy is ultimately assured, but not always maintained internally. Branch prediction, for example. The computer guesses which way the branch will go! This can give a big speed boost depending on the workload and the predictor, and the computer assures accuracy by fact-checking the guess down the line.

mseebach · on Jan 17, 2012

The concept is less foreign that you think: When you do a Google search, do you get a replicable, provably correct result? Or do you get a result that 99% of the time is good enough?

rayiner · on Jan 17, 2012

The user of a search engine is probably willing to get the occasional poorly-ranked result. The user of an AI system is probably willing to get the occasional wrong answer, etc.

wisty · on Jan 17, 2012

Implicit numerical schemes tend to handle errors better.

wrs · on Jan 17, 2012

For those making off-the-cuff judgments of how crazy this idea is: In 1990 or so, Dave Ungar told me he was going to make his crazy Self language work at practical speed by using the crazy idea of running the compiler on every method call at runtime. Then he and his crazy students based the Hotspot Java compiler on that crazy idea, which is now the industry-standard way of implementing dynamic languages. So now I tend to pay close attention to Dave's crazy ideas...

jedbrown · on Jan 16, 2012

"Just as we learned to embrace languages without static type checking, and with the ability to shoot ourselves in the foot, we will need to embrace a style of programming without any synchronization whatsoever."

This is dangerous misinformation that is also being propagated by some managers of the "exascale" programs that seem to have lost sight of the underlying science. Some synchronization is algorithmically necessary for pretty much any useful application. The key is to find methods in which synchronization is distributed, with short critical paths (usually logarithmic in the problem size, with good constants).

rbanffy · on Jan 16, 2012

Albeit this regards the first part of the sentence, I can say I shot myself in the foot quite a few times with statically typed languages.

ken · on Jan 16, 2012

I think it was Rich Hickey who said something like: "What is true of every bug ever found in a program? It passed all the compiler's type checks!"

bunderbunder · on Jan 16, 2012

Fair, but at the same time I've also had a compiler quickly draw my attention to a lot of potential errors thanks to static type checking more than once in the past, particularly when doing hairy refactors.

The relative strengths and weaknesses of dynamic and static languages are greatly exaggerated. Doing type checks at compile time won't make your code magically bug-free. But neither will delaying type checks until run-time free you from the shackles of the datatype hegemony. The trade-off between keystrokes and CPU cycles isn't really even all that much of a thing anymore, what with jitters closing the gap on one side of the fence, and type inference and generics closing it on the other.

jff · on Jan 16, 2012

Better than just saying, "We'll continue to use MPI at exascale!"

jedbrown · on Jan 17, 2012

Well, we will.

Just not only MPI (at least not without significant extensions).

jff · on Jan 17, 2012

We'll probably end up using just MPI at Exascale. I was at Supercomputing 2011, I swear half the speakers might have just gone up on stage, stuck their fingers in their ears, and yelled "NAH NAH NAH CAN'T HEAR YOU, MPI IS ALL WE NEED NAH NAH NAH".

jedbrown · on Jan 17, 2012

Heh. Well the current threading models are much worse from the perspective of libraries and hierarchical memory. I work with some of the MPICH developers and members of the MPI Forum. Nobody is satisfied with the status quo, but we need a viable threading system. Some people think we'll end up with MPI+CUDA/OpenCL, others (myself included) would generally prefer a system with in-node communicators and collectives, with a concept of compatible memory allocation. The system-level tools are basically available in pthreads and libnuma (unfortunately Linux-only), but we're working on formulating better APIs because the annotation-based approach of OpenMP isn't very good for long-lived threads or heterogeneous task distributions and systems like TBB are too intrusive and still don't really address NUMA.

ChuckMcM · on Jan 16, 2012

"every (non-hand-held) computer’s CPU chip will contain 1,000 fairly homogeneous cores."

There are two problems with these visions one is memory and the other is the interconnect. 1000 cores, even at a modest clock rate, can easily demand 1 Terabyte of memory accesses per second. But memory has the same economies as 'cores' in that it's more cost effective when it is in fewer chips. But the chip is limited in how fast it can send signals over its pins to neighboring chips (see Intel's work on Light bridge).

So you end up with what are currently exotic chip on chip types of deals, or little Stonehenge like motherboards where this smoking hot chip is surrounded by a field of RAM shooting lasers at it.

The problem with that vision is that to date, the 'gains' we've been seeing have been when the chips got better but the assembly and manufacturing processes stayed more or less the same.

So when processors got better the existing manufacturing processes were just re-used.

That doesn't mean that at some point in the future we might have 1000 core machines, it just means that other stuff will change first (like packaging) before we get them. And if you are familiar with the previous 'everything will be VLIW (very large instruction world)' prediction you will recognize that a lack of those changes sometimes derail the progress. (in the VLIW case there have been huge compiler issues)

The interconnect issue is that 1000 cores can not only consume terabytes of memory bandwidth they can generate 10s of gigabytes of data to and from the compute center. That data, if it is being ferried to the network on non-volatile storage needs channels that run at those rates. Given that the number of 10GbE ports on 'common computers' is still quite small, another barrier to this vision coming to pass is that these machines will be starved for bandwidth to get to fresh data to work on, or to put out data they have digested or transformed.

rbanffy · on Jan 17, 2012

The obvious solution to the memory bandwidth problem is to partition it and connect the pieces directly to the cores. Even if there is a shared region, cores don't have to worry about coherence when using their own private memory.

Yes, computing gets very interesting when what we have is no longer an overgrown IBM 5150.

baq · on Jan 17, 2012

that's done on the PS3 and i hear it isn't easy to write efficient programs for it.

rbanffy · on Jan 17, 2012

Each SPE has a tiny amount of memory (256K, IIRC), severely limited connectivity to other SPEs and a downright cruel instruction set. A friendlier ISA and Transputer-like connectivity between the nodes would alleviate some of these problems.

thyrsus · on Jan 16, 2012

Could you point me to a description of the VLIW compiler problems? In 1981 a small group of us coerced the Unix verion 7 portable C compiler to generate VLIW assembler as a senior project. There was nothing astonishing going on; the pcc had perhaps a couple dozen things it needed to be able to generate (conditional execution, arithmetic, pointers), and it was a simple matter of not using stuff before the (very primitive and shallow) pipeline was able to deliver it. After graduating I lost touch with that kind of fun tech - I was hired to modify accounting software written in BASIC. I've recovered ;-).

ChuckMcM · on Jan 16, 2012

The most readily accessible one is the Itanium compiler. Intel (and SGI) worked to create a compiler that could maximize the use of the VLIW instruction set of the processor to achieve application specific performance.

This was presented by Intel to its enterprise customers as the 'secret sauce' that would give Itanium the edge over SPARC and other 64 bit architectures. They have invested millions in making this effort work.

However, reception of a workflow for the Itanium compiler was mixed at best. Some workloads it out performed, others it simply matched. The process for training the compiler, which seemed to me at the time to be an outgrowth of the branch prediction efforts, involved running synthetic benchmarks, collecting information about utilization of the execution units and then synthesizing new instruction mixes for the applications. The imposition of such a heavy weight process which needed to be repeated for nearly every change to the code base, worked against the benefits promised. Since code is likely to change often, proposals of waiting until you are 'ready to ship' before optimziation and tuning. But once shipped patches are made, bugs fixed, anomalies corrected. Changing any line of code could create a massive stall in the pipeline and crush performance until the system was re-tuned.

I don't know if that experience was universal, but it was common enough that such stories were everywhere at places like the Microprocessor Forum and other conferences.

nivertech · on Jan 16, 2012

One of the problems with VLIW architectures is lack of binary compatibility between CPU generations. Suppose you had 4-way VLIW architecture and the next generation become 8-way. Even if new CPU will be able to run old 4-way code, it will twice as slower, I.e you need to recompile your software.

bunderbunder · on Jan 16, 2012

I imagine just-in-time compilation has a lot to offer there. Managed runtimes already have the advantage of being able to automatically tailor the machine code to the target CPU.

CoffeeDregs · on Jan 17, 2012

I imagine that Transmeta had the same imagination.

bunderbunder · on Jan 17, 2012

In a sense, they did. And they did famously fail to meet expectations.

But considering that nowadays other stacks which rely on jitting regularly achieve real-world performance that is competitive with much native-compiled software, it seems safe to presume that Transmeta's performance problems stemmed from reasons beyond the basic idea behind CMS.

nivertech · on Jan 17, 2012

GPU vendors circumvent this limitation by not documenting actual ISA and only documenting virtual ISA (PTX for NVidia, CAL for AMD).

GPU drivers dynamically recompile "Virtual ISA" into actual ISA when running kernel or shader first time.

jacquesm · on Jan 17, 2012

GPUs today already routinely have 500+ cores and are doing quite well. There are very few real problems with these visions, they're realizing themselves as we speak.

Of course SIMD is not the same as 1K fully independent cores but they're both valid interpretations of parallel computing.

That kind of architecture only lends itself to a certain class of problems but they're a good indication of what can be done to circumvent the interconnect limitations.

A typical GPU hides access to slow resources by doing more computations.

ChuckMcM · on Jan 17, 2012

True, a dual card GPU system can have 1000 cores today, however the cores provide primarily computation on a small (or procedurally generated) data set. This makes them great for simulating atomic explosions, CFD, and bit coin mining but for systems which read data, do a computation on it, and then write new results they don't have the I/O channels to support that sort of flow effectively. Back when I was programming GPUs one issue was that channel bandwidth between the CPU's memory and the video card was insufficient so that textures had to be maximally compressed or procedurally generated or you would starve TPUs waiting for their next bit of Texel data. I'm not sure if that is still the case with 16x PCIe slots however.

pnathan · on Jan 16, 2012

http://en.wikipedia.org/wiki/Connection_Machine

Money quote: "The CM-1, depending on the configuration, had as many as 65,536 processors"

I would suggest that when someone wants to get excited about exascale computing, they review the Connection Machine literature. Manycore is not a radically new concept.

spitfire · on Jan 17, 2012

Danny Hills' thesis is an excellent read. I have a copy on my bookshelf behind me, right beside Knuth.

The fact is that the GPU (and the many micro cores) will be consumed by the cpu instruction set in the long run. Yes, that means we're going back to frame-buffers.

protomyth · on Jan 17, 2012

They had an interesting version of Lisp to go with those machines.

morphle · on Jan 16, 2012

At our startup we are creating our own many core processor SiliconSqueak and VM along the lines of David Ungars work. Writing non-deterministic software is fun, you just need to radically change your perspective on how to program. For Lisp and Smalltalk programmers this outlook change is easy to do. We welcome coders who want to learn about it.

jgw · on Jan 16, 2012

Sounds like a fascinating project!

I find it curious - and slightly scary - that as the world is stampeding towards increasingly-parallelized computing models, most of us in ASIC design are becoming increasingly thwarted by the limitations of functional simulation - which, by and large - is pretty much single-threaded. I mean, we're supposed to be designing to keep up with Moore, and our simulator performance has pretty much flat-lined. And even more alarming, I've heard very few ASIC people even talk about it.

I'm curious of your take on that.

morphle · on Jan 16, 2012

First of all, we try to circumvent simulating the ASIC design by debugging the design in FPGAs. We then simulate the working design in software on our own small supercomputer built with these FPGAs. Simulating on many cores and running the design in FPGAs should bring us to the point where we can make a wafer scale integration at 180nm. Imagine 10000 cores on an 8 inch wafer.

Our software stack uses adaptive compilation to reconfigurable hardware, so we can identify hotspots in the code that can be compiled to the FPGA at runtime. Eventually we will be able to write and debug the whole ASIC in our software at runtime on the FPGA.

Simulating a single core is not to hard because our microcode processors is small. The ring network connecting cores, caches, memory and four 10 Gbps off-chip communication channels are harder to simulate tough.

andrewcooke · on Jan 16, 2012

the way i imagine this working (please correct me if wrong) is that there is quite a bit of speculative computation, and that various things get thrown away. given that, what happens if you measure speed in terms of results per watt, rather than per second? does the per-watt value tend towards a limit, even as the per-second value increases? if so, will the re-introduction of determinism (via some new mechanism) be necessary to improve efficiency, down the line?

morphle · on Jan 16, 2012

I have not encountered any need for throwing away results, speculative computation as you describe it. An example of programming non-deterministically is the swarm model (ants). You can for example write the text editor of a word processor this way, as described in http://www.vpri.org/pdf/tr2011004_steps11.pdf

Our SiliconSqueak processor is in an FPGA stage right now, we will make the first ASIC version this year. At this time I have no hard numbers on the watt/performance and price/performance of the ASIC. The energy use of the FPGA per result is an order of magnitude below that of an Intel x86 core and we expect the ASIC to be much better.

Regardless of the efficiency of the code, I see no need to re-introduce determinism down the line.

pnathan · on Jan 16, 2012

Do you have a white paper I could read?

morphle · on Jan 16, 2012

Yes, mail me: aart at knoware dot nl and I'll send you the scientific paper.

Some older information can be found on http://www.siliconsqueak.org/

eternalban · on Jan 17, 2012

Haven't looked at the project yet, but some thoughts based on OP:

"Even lock-free algorithms will not be parallel enough. They rely on instructions that require communication and synchronization between cores’ caches."

Azul's Vega 3 with 864 cores/640GB mem (2008) with Azul JVM apparently works fine using lock-free java.util.concurrnt.* classes & would appear to be a counter point to the very premise of the OP.

It is also probably more likely we will see new drastic rethink of memory managers and cooperation between h/w and s/w designers (kernel /compiler level). Right now, everything is sitting on top of malloc() and fencing instructions. What is more painful? Write non-deterministic algorithms or bite the bullet and update h/w and kernels and compilers? See Doug Lea's talk at ScalaDays 2011 ([1] @66:30)

And this is not to mention anything about FP and STM approach to the same issue.

[1]: https://wiki.scala-lang.org/display/SW/ScalaDays+2011+Resour...

yvdriess · on Jan 17, 2012

Cliff Click's presentation on the subject are quite enlightening. Although, Azul's hardware and target are quite different from the 1000-core tilera idea. I doubt it is a counterpoint to Ungar's message.

BadassFractal · on Jan 16, 2012

A talk by David Ungar on this very subject is available at CMU-SV Talks on Computing Systems website: http://www.cmu.edu/silicon-valley/news-events/seminars/2011/...

andrewcooke · on Jan 16, 2012

something of a side issue, but when was there a trade-off between static checking and performance? fortran and c have pretty much always been the fastest languages around, haven't they? is he referring to assembler?

marshray · on Jan 16, 2012

I think what he's saying that some programmers have shifted to dynamic languages for productivity as faster processors have made the use of statically-typed languages less critical. He foresees a similar shift away from lock-based concurrency models due to the increased number of cores.

He may be right about the move away from the threads&locks popularized in the 1990's, but I agree with you that it's not such a great analogy.

Silhouette · on Jan 17, 2012

> I think what he's saying that some programmers have shifted to dynamic languages for productivity as faster processors have made the use of statically-typed languages less critical.

Given that this guy seems to think having programs that result in wrong answers is generally acceptable, it's hardly surprising that he also seems to think dynamic languages took over at some point. However, most software written today, and nearly all software written today on a large scale, is still written in statically typed languages, sometimes with controlled dynamic elements. And much of it doesn't need to be parallelised within a process, because it's I/O bound one way or another anyway.

bunderbunder · on Jan 17, 2012

And of the major stuff that's heavily parallelized, basically all of what I've seen is written in static languages.

With good reason, too. I submit that when you're writing code like that, it's generally because you're trying to do a CPU-bound operation as fast as you can. Which means that unnecessarily spending cycles on type checks that could have been put by the wayside at compile time is probably not your cup of tea.

tbrownaw · on Jan 16, 2012

I thought making as many checks as possibly static instead of dynamic generally made things faster, maybe it's worded backwards?

rwmj · on Jan 16, 2012

Static typing certainly makes things safer, which could be a good thing when talking about distributed systems with many communicating components.

marshray · on Jan 16, 2012

It's also what makes it possible for you to write, say, a high-performance floating point matrix multiplication function and have the compiler emit code accepts floating point inputs without having to consider the possibility that the arguments were supplied as strings.

VilleSalonen · on Jan 16, 2012

These edgy attention-grabbing titles are getting a bit too strong in my opinion.

6ren · on Jan 17, 2012

Well, no real progress has been made in parallel programming in the decades of research (apart from the embarrassingly parallelizable), so we're probably going to have to give up something in our concept of the problem. But I really like determinism. If the proposal works out, future computer geeks will have a very different cognitive style.

Another approach might be to recast every problem as finding a solution in a search-space - and then have as many cores as you like trying out solutions. Ideally, a search-space enables some hill-climbing (i.e. if you hit on a good solution, there's a greater than average probability that other good solutions will be nearby), and for this, it is very helpful to know the result of previous searches and thus sequential computation is ideal. But, if the hills aren't that great as predictors, and if you do feed-in other results as they become available, the many-cores would easily overcome this inefficiency.

An appealing thing about a search-space approach is that it may lend itself to mathematically described problems, i.e. declare the qualities that a solution must have, rather than how to compute it.

wglb · on Jan 16, 2012

Ok, let me understand how this is going to work. If I want to deposit 13 cents to my bank, and this transaction is mixed in with a blast of other parallelized transactions, sometimes the right answer gets there, and other times i get only 12?

Somehow, I don't think that is going to fly.

Additionally, the statement about type checking and program correctness is not really correct.

Let's try another thought experiment. Let's compile a linux kernel with this beast. We should be happy with sometimes getting the right answers? I am not sure that they have thought this through.

Does anyone remember in the early days of MySQL where it was really really really really fast because it didn't have locks. Some wiser heads said "but it is often giving the wrong answer!" The reply was "well it is really really really fast!" And we know how that came out.

Perhaps the expected output of this sea of devices is poetry, which in the minds of those on the project, might require less precision. But even there, some poetry does require lots of precision.

jws · on Jan 17, 2012

You are missing the "and repair" part.

Consider sorting data, because that's what computer scientists do…

If you can sort your data on a sea of 1000 processors in O(nlogn) with minimal errors (defined as out of order elements in the result set), then check the result in O(n) over 1000 processors and fix the handful of problems in a couple of O(n) moves then you will beat an error free sort that can only make use of 10 processors because of memory/lock contention.

For your bank (which should certainly start charging you a transaction fee on deposits), it might take the form of processing batches of transactions in a massively parallel way with a relatively fast audit step at the end to tell if something went poorly. In this case the repair strategy might be as simple as split and redo the halves.

wglb · on Jan 18, 2012

Not to quote the watchmen, but who repairs the repair?

I am kind of Knuth with this who seems to day that this whole parallelization/multicore stuff may be a result of the failure of imagination on the part of the hardware designers.

Peaker · on Jan 16, 2012

"Just as we learned to embrace languages without static type checking, and with the ability to shoot ourselves in the foot"

I've moved on from dynamic languages to "static typing that doesn't suck" (Haskell).

jxcole · on Jan 16, 2012

I think that this technology will eventually replace the current GPU processing that people have been doing. It has all sorts of cool crazy applications, but people will probably still need their good old fashioned deterministic CPU as an option.

morphle · on Jan 16, 2012

Indeed this is the low hanging fruit. With our own many core SiliconSqueak I can run all the 2d and 3D graphics rendering in software. Also the video encoding, etc.

d0mine · on Jan 16, 2012

Technology might allow to produce 1000-core computers but does Market need it?

Will they be common enough? Or like with dirigibles other solutions will dominate.

rorrr · on Jan 17, 2012

GeForce GTX 590 has 1024 cores, and it's definitely wanted by gamers.

gcp · on Jan 17, 2012

That's only "marketing" cores. It's a 32-core GPU.

jiggy2011 · on Jan 17, 2012

What's a "marketing" core?

tezza · on Jan 16, 2012

This form of exploratory computing has existed for a while in CPU instruction scheduling. Branch Prediction etc. is widely used.

The big trade off has to be power consumption.

If you diminish accuracy, fine. But if your handset dies because some genetic algorithm didn't converge in 3 minutes, that'll be a problem.

tbrownaw · on Jan 16, 2012

My email client is pretty much I/O-bound.

My word processor is perfectly well able to keep up with my typing speed.

My web browser is largely I/O-bound, except on pages that do stupid things with JavaScript.

There is no reason to try to rewrite any of these to use funny algorithms that can spread work over tons of cores. They generally don't provide enough work to even keep one core busy, the only concern is UI latency (mostly during blocking I/O).

Compiling things can take a while, but that's already easily parallelized by file.

I'm told image/video processing can be slow, but there are already existing algorithms that work on huge numbers of cores (or on GPUs).

Recalcing very large spreadsheets can be slow, but that should be rather trivial to parallelize (along the same lines as image processing).

...

So isn't the article pretty much garbage?

jxcole · on Jan 16, 2012

When I was in 5th grade the Nintendo 64 came out and a friend of mine commented that this was going to be the last major video game console release of all time. After all, the graphics were so mind bogglingly good that no one would ever have any desire to create a better system. The N64 perfectly fulfilled everyone's game requirements.

Your argument is similar. Everything you have on your machine right now works fine as it is. What you aren't taking into account is that the list of things on your machine is defined by the machine's limitations. A new style of computation will open doors and allow things onto your machine that are currently not possible because of limitations in your processor.

tikhonj · on Jan 17, 2012

Heh, the N64 was the first game console I ever used. I thought it was pretty spiffy. I still think it's pretty spiffy. In fact, looking at some screenshots, I would totally play some of those games over ones we have now.

Ignoring that, I agree with your point wholeheartedly: just because things work now does not mean we can't optimize. In fact, a very large part of the software field is all about optimizing things that already exist.

humbledrone · on Jan 16, 2012

With a little imagination, you can come up with countless ideas on how to use a thousand cores for each of the applications you listed. I've listed a few, which are likely a bunch of crap, but I'm just one person. There are millions and millions of other minds coming up with ideas better than these every day.

And who are you to say that today's parallel algorithms for solving things like these are optimal? To me, it certainly seems worthwhile to consider possible alternate parallelization models.

> My email client is pretty much I/O-bound. > My word processor is perfectly well able to keep up with my typing speed.

Better speech to text? Better prediction (so you have to type less)? Better compression for attachments? Better grammar checking?

> My web browser is largely I/O-bound, except on pages that do stupid things with JavaScript.

It's clear that web sites are slowly transitioning into a role that is similar to desktop apps, so there will be a million applications that can use hefty computational power.

> Compiling things can take a while, but that's already easily parallelized by file.

Better optimizations? Things like evolutionary algorithms for assembly generation come to mind, which could use any conceivable number of cores. Better static analysis?

> I'm told image/video processing can be slow, but there are already existing algorithms that work on huge numbers of cores (or on GPUs).

GPUs are typically not as good as CPUs for integer math, in particular they're not deeply pipelined, etc. This could conceivably become much faster.

> Recalcing very large spreadsheets can be slow, but that should be rather trivial to parallelize (along the same lines as image processing).

There are a ton of things that could happen here with massive computational power. Spreadsheets that use machine learning techniques to predict things? More powerful models?

>* So isn't the article pretty much garbage?*

Come on, people have been coming up with novel uses for our ever-increasing computational power for well over a hundred years now (from mechanical computers to octo-core desktop machines). Why would they suddenly stop now?

rayiner · on Jan 16, 2012

You've failed to consider tons of other applications. What about AI programs like Watson? Also, scientific computing isn't always of the "embarrassingly parallel" sort.

marshray · on Jan 16, 2012

Have you never written a server application?

tbrownaw · on Jan 16, 2012

Yes I have.

Each client connection is entirely independent of the other client connections. That's trivial to parallelize (thread pool, Erlang, whatever).

I also have one where the clients are almost entirely independent; that one connects to a database which handles the "almost" part.

marshray · on Jan 16, 2012

Not all server applications are so easy to parallelize. For example there's the database server itself, which is essentially a box that you're shoving all your data concurrency into hoping that once you start to hit its limits you will be able to rearchitect your app faster than your load is growing.

But maybe you're someone who's happy with the cores and algorithms he already has. That's OK with me. There will certainly always be problems where shared-nothing parallelism over commodity hardware is the most cost effective. But not everyone is mining Bitcoin or computing the Mandelbrot set.

arghnoname · on Jan 16, 2012

For most applications for a little while I suspect the benefit will be in having an OS that can schedule processes onto more processors. That's nice and will improve the experience for people. OS designers can probably use more daemons as a result to do other fun things. The browser, for its part as some slow, broken version of an operating system, gets to run each page as a process and helps that experience some too (which is already happening with each tab is a process).

I don't think your typical application will have to take advantage of all 200 cores in an explicit way. The use of dynamic languages on the desktop signifies that developers have been willing to throw away performance for development time for a while now. Why is that suddenly going to change, especially with the more difficult concurrent landscape before us?

thesash · on Jan 16, 2012

Watch the video-- the interviewer literally asks the question about application of the technology, and Ungar gives several including business analytics, intelligent cities, associate search, basically anything that has to do with extracting meaningful information from massive data sets.