Most of the time, this is not a trade off worth making. I can't think of a researcher that would willingly trade replicability for speed. I can't think of a mathematician who would base a proof on the idea that a number is probably prime. I can't think of a bank customer who would be fine with the idea that the balance displayed on the ATM is pretty close to where it should be. I can't think of an airline passenger who would be totally fine with the flight computer usually being pretty good.
It would be a fine trade off for games, however. And I'm sure there is room for some fudging in complex simulations that have plenty of randomness already.
But given the choice between an answer that is correct, and an answer that is probably correct, I will take the correct answer. Even if I have to wait a little.
"I can't think of a mathematician who would base a proof on the idea that a number is probably prime."
I can assure you that such a proof probably exists :) Just look at https://en.wikipedia.org/wiki/Probable_prime
Probability theory can be an extremely powerful tool when researching things that are otherwise difficult to reason about. And the theorem statement does not have to be probabilistic for the probabilistic method to be applicable. Just see
As for the following:
"I can't think of an airline passenger who would be totally fine with the flight computer usually being pretty good."
Actually, I would think it's pretty much the opposite. That is, the only type of airline passenger I can think of, is one who is fine with the flight computer (and the airplane in general) usually being pretty reliable. We already know that computers can malfunction and airplanes can crash. Now, of course, how reliable you want the airplane to be is up to you, but if you want it to be flawless, then you should never board an airplane.
However, when interfacing to the real, non-exactly specified, incomplete information world, "good enough most of the time" is not only fine, it is all you can hope for. For example, robots that need to navigate rough terrain, or in living orgamisms at the nano-scale, or communicate in human language.
There is a huge class of things that simply cannot be addressed with the current "correct" model of computation because they are not well-specified to the lowest level in the first place.
Computers, in other words, needs to be more like brains. Yes, this does give up certain advantages, and safety guarantees. But it does not give up everything. People have trusted their life to animals and other humans, for example, for all history, even though they are far from perfectly safe and/or compliant...
(and even now, it is "fallible" humans that control the technology, so you could make a point that nothing is lost, as it was never there in the first place)
I wonder about this. One of the most common human methods of making decisions with incomplete information and/or short time frames is to fall back on heuristics rather than try to reason through the data. Heuristics have the advantage of being fast and available, but they're also notoriously bad at producing the quality of decision that might come from a more thorough analysis of the data.
Frankly, I think I'd feel more comfortable with computers that made decisions through brute force processing and bayesian analysis.
Brute forcing all possible combinations and finding the best one (for your specific goal, or taking into account all externalities as well) would be great. But it is impossible for two reasons:
1) incomplete, or even purposefully incorrect information
2) time constraints/compexity. For a non-linear system with more than a few variables, it is just not feasible to enumerate everything.
You end up with heuristics. The challenge is that remains is finding good heuristics. If the decision is given more time, you can probably come up with better heuristics and even sort-of brute force through the "best" options (which is what humans do when they reason).
Floats are inaccurate (sometimes far from real,) but precise (always yield the same output given the same input.)
Edit: reading the article, it calls out a special (different) meaning of the use of precise in the IEEE float spec
> In the case of full reproducibility, such as when rounding a number to a representable floating point number, the word precision has a meaning not related to reproducibility. For example, in the IEEE 754-2008 standard it means the number of bits in the significand, so it is used as a measure for the relative accuracy with which an arbitrary number can be represented.
If I understand correctly - no single neuron is neccesary for your brain to work ok. Brain does not depend on any single link working correctly. Somehow it works reliable enough.
It's possible to make reliable systems from unreliable parts.
From the non-software perspective human brains make big mistakes all the time. Consider car accidents, gross errors in judgement, etc.
Complaining that brains without training are not reliable is like complaining computers don't work correctly without software.
BTW - many of the errors brains make are not because of unreliable parts, but because of system design - we have some firmware that made sense before, and now is problematic. This firmware is reliable - it works as intended, only now the problems changed, and we can't change the firmware, and patching this in software causes some problems.
Parsing natural language is probably inherently non deterministic since no natural language grammers are context free. When we parse a sentance there is almost always at least some ambiguity in the sentence itself, plus we also have to process other things that the person has said; in many cases years in the past and what motivation they may have for saying a certain thing. It is very unlikely that 2 people will get exactly the same meaning out of some natural language sentence because it comes with all kinds of side effects for example something somebody says may also change your opinion of them and other things they have said.
I suspect that if we cannot get deterministic behavior out of future computers because of the amount of parallelization required to make efficient use of their CPUs we will end up with 2 streams of computing , one of which will stay relatively static and be interested in using computers for the types of problems we currently do and another which will be interested in applying it to new problems that do not require determinism.
The software for these computers will be radically different so most likely you will have 2 computers on your desk (or in your pocket , or in VMs), one with < 10 cores and one with > 1000 cores.
I don't know much about how the brain works but I guess this is a process that uses allot of heuristics and psuedo randomness that probably lends itself well to being parallelized which is why we set up our languages this way.
It's more that one of the properties of the brain is that it believes itself to be reliable enough.
You can't get to 100% reliability. You just have to choose how many nines you want. Why should this be any different?
Not in a multi-core setting. When you do a floating point reduce operation the order in which you reduce your floating point numbers matters, but this order is not necessarily deterministic.
Let's say you have some floating point numbers, like...
float A = 42.90F;
float B = 00.01F;
float C = /* ... */;
float D = /* ... */;
In fact, if you do (A + B + B + B ...) very much, then you will wind up exacerbating the problem --- it can be a source of bugs if you repeatedly increment a counter by small floating point values. For example:
const int Iterations = 10;
for ( int i = 0; i < Iterations; ++i )
A += B;
printf( "%2.4f\n", A );
Astute readers probably would note that 00.01F can't be exactly represented in base 2, and wonder whether something like 00.0125F would still suffer from this problem. Alas, yes:
A = 42.9000F
B = 00.0125F
8 iterations == 43.0000
80 iterations == 43.9001
800 iterations == 52.9006
8000 iterations == 142.8820 /* we would expect 142.9 */
float A1 = 42.9000F;
float B1 = 00.0125F;
float A2 = 42.9000F;
float B2 = 00.0125F;
const int Iterations = 4000;
for ( int i = 0; i < Iterations; ++i )
A1 += B1;
A2 += B2;
float A = A1 + A2 - 42.9000F;
printf( "%2.4f\n", A );
You may be tempted to try various things to get around this fact... for example, what about counting from zero rather than from 42.9000? (Answer: the error can be even worse, which is a topic worthy of its own separate post) ... but you will quickly discover that this is a fundamental issue. The order of operations matters.
Also, if you do some kind of map-reduce style work distribution where different nodes perform same operations on different data, it pays off to use a deterministic (pseudo-)random algorithm for work distribution, so that your results can be reproduced.
There is no such thing as a "floating point reduce operation". You can do a reduction using a floating point calculation as your joining operation, but even in that case, each individual operation is still deterministic. Moreover, a classical reduce is also deterministic, being a fold in one direction or other based on the associativity of the joining operation. The non-determinism comes only from your choice to specify the behaviour of your "parallel reduce" function in a non-deterministic way, and if you do that without understanding the consequences that's hardly floating point's fault.
The MPI standard does not require that reductions use deterministic associativity for the same number of processes (independent of the physical network topology), but they do suggest that it is desirable and (AFAIK) most implementations are deterministic on the same number of processes.
There are also plenty of existing models where accuracy is ultimately assured, but not always maintained internally. Branch prediction, for example. The computer guesses which way the branch will go! This can give a big speed boost depending on the workload and the predictor, and the computer assures accuracy by fact-checking the guess down the line.
This is dangerous misinformation that is also being propagated by some managers of the "exascale" programs that seem to have lost sight of the underlying science. Some synchronization is algorithmically necessary for pretty much any useful application. The key is to find methods in which synchronization is distributed, with short critical paths (usually logarithmic in the problem size, with good constants).
The relative strengths and weaknesses of dynamic and static languages are greatly exaggerated. Doing type checks at compile time won't make your code magically bug-free. But neither will delaying type checks until run-time free you from the shackles of the datatype hegemony. The trade-off between keystrokes and CPU cycles isn't really even all that much of a thing anymore, what with jitters closing the gap on one side of the fence, and type inference and generics closing it on the other.
Just not only MPI (at least not without significant extensions).
There are two problems with these visions one is memory and the other is the interconnect. 1000 cores, even at a modest clock rate, can easily demand 1 Terabyte of memory accesses per second. But memory has the same economies as 'cores' in that it's more cost effective when it is in fewer chips. But the chip is limited in how fast it can send signals over its pins to neighboring chips (see Intel's work on Light bridge).
So you end up with what are currently exotic chip on chip types of deals, or little Stonehenge like motherboards where this smoking hot chip is surrounded by a field of RAM shooting lasers at it.
The problem with that vision is that to date, the 'gains' we've been seeing have been when the chips got better but the assembly and manufacturing processes stayed more or less the same.
So when processors got better the existing manufacturing processes were just re-used.
That doesn't mean that at some point in the future we might have 1000 core machines, it just means that other stuff will change first (like packaging) before we get them. And if you are familiar with the previous 'everything will be VLIW (very large instruction world)' prediction you will recognize that a lack of those changes sometimes derail the progress. (in the VLIW case there have been huge compiler issues)
The interconnect issue is that 1000 cores can not only consume terabytes of memory bandwidth they can generate 10s of gigabytes of data to and from the compute center. That data, if it is being ferried to the network on non-volatile storage needs channels that run at those rates. Given that the number of 10GbE ports on 'common computers' is still quite small, another barrier to this vision coming to pass is that these machines will be starved for bandwidth to get to fresh data to work on, or to put out data they have digested or transformed.
Yes, computing gets very interesting when what we have is no longer an overgrown IBM 5150.
This was presented by Intel to its enterprise customers as the 'secret sauce' that would give Itanium the edge over SPARC and other 64 bit architectures. They have invested millions in making this effort work.
However, reception of a workflow for the Itanium compiler was mixed at best. Some workloads it out performed, others it simply matched. The process for training the compiler, which seemed to me at the time to be an outgrowth of the branch prediction efforts, involved running synthetic benchmarks, collecting information about utilization of the execution units and then synthesizing new instruction mixes for the applications. The imposition of such a heavy weight process which needed to be repeated for nearly every change to the code base, worked against the benefits promised. Since code is likely to change often, proposals of waiting until you are 'ready to ship' before optimziation and tuning. But once shipped patches are made, bugs fixed, anomalies corrected. Changing any line of code could create a massive stall in the pipeline and crush performance until the system was re-tuned.
I don't know if that experience was universal, but it was common enough that such stories were everywhere at places like the Microprocessor Forum and other conferences.
But considering that nowadays other stacks which rely on jitting regularly achieve real-world performance that is competitive with much native-compiled software, it seems safe to presume that Transmeta's performance problems stemmed from reasons beyond the basic idea behind CMS.
GPU drivers dynamically recompile "Virtual ISA" into actual ISA when running kernel or shader first time.
Of course SIMD is not the same as 1K fully independent cores but they're both valid interpretations of parallel computing.
That kind of architecture only lends itself to a certain class of problems but they're a good indication of what can be done to circumvent the interconnect limitations.
A typical GPU hides access to slow resources by doing more computations.
Money quote: "The CM-1, depending on the configuration, had as many as 65,536 processors"
I would suggest that when someone wants to get excited about exascale computing, they review the Connection Machine literature. Manycore is not a radically new concept.
The fact is that the GPU (and the many micro cores) will be consumed by the cpu instruction set in the long run. Yes, that means we're going back to frame-buffers.
I find it curious - and slightly scary - that as the world is stampeding towards increasingly-parallelized computing models, most of us in ASIC design are becoming increasingly thwarted by the limitations of functional simulation - which, by and large - is pretty much single-threaded. I mean, we're supposed to be designing to keep up with Moore, and our simulator performance has pretty much flat-lined. And even more alarming, I've heard very few ASIC people even talk about it.
I'm curious of your take on that.
Our software stack uses adaptive compilation to reconfigurable hardware, so we can identify hotspots in the code that can be compiled to the FPGA at runtime. Eventually we will be able to write and debug the whole ASIC in our software at runtime on the FPGA.
Simulating a single core is not to hard because our microcode processors is small. The ring network connecting cores, caches, memory and four 10 Gbps off-chip communication channels are harder to simulate tough.
Our SiliconSqueak processor is in an FPGA stage right now, we will make the first ASIC version this year. At this time I have no hard numbers on the watt/performance and price/performance of the ASIC. The energy use of the FPGA per result is an order of magnitude below that of an Intel x86 core and we expect the ASIC to be much better.
Regardless of the efficiency of the code, I see no need to re-introduce determinism down the line.
Some older information can be found on http://www.siliconsqueak.org/
"Even lock-free algorithms will not be parallel enough. They rely on instructions that require communication and synchronization between cores’ caches."
Azul's Vega 3 with 864 cores/640GB mem (2008) with Azul JVM apparently works fine using lock-free java.util.concurrnt.* classes & would appear to be a counter point to the very premise of the OP.
It is also probably more likely we will see new drastic rethink of memory managers and cooperation between h/w and s/w designers (kernel /compiler level). Right now, everything is sitting on top of malloc() and fencing instructions. What is more painful? Write non-deterministic algorithms or bite the bullet and update h/w and kernels and compilers? See Doug Lea's talk at ScalaDays 2011 ( @66:30)
And this is not to mention anything about FP and STM approach to the same issue.
He may be right about the move away from the threads&locks popularized in the 1990's, but I agree with you that it's not such a great analogy.
Given that this guy seems to think having programs that result in wrong answers is generally acceptable, it's hardly surprising that he also seems to think dynamic languages took over at some point. However, most software written today, and nearly all software written today on a large scale, is still written in statically typed languages, sometimes with controlled dynamic elements. And much of it doesn't need to be parallelised within a process, because it's I/O bound one way or another anyway.
With good reason, too. I submit that when you're writing code like that, it's generally because you're trying to do a CPU-bound operation as fast as you can. Which means that unnecessarily spending cycles on type checks that could have been put by the wayside at compile time is probably not your cup of tea.
Another approach might be to recast every problem as finding a solution in a search-space - and then have as many cores as you like trying out solutions. Ideally, a search-space enables some hill-climbing (i.e. if you hit on a good solution, there's a greater than average probability that other good solutions will be nearby), and for this, it is very helpful to know the result of previous searches and thus sequential computation is ideal. But, if the hills aren't that great as predictors, and if you do feed-in other results as they become available, the many-cores would easily overcome this inefficiency.
An appealing thing about a search-space approach is that it may lend itself to mathematically described problems, i.e. declare the qualities that a solution must have, rather than how to compute it.
Somehow, I don't think that is going to fly.
Additionally, the statement about type checking and program correctness is not really correct.
Let's try another thought experiment. Let's compile a linux kernel with this beast. We should be happy with sometimes getting the right answers? I am not sure that they have thought this through.
Does anyone remember in the early days of MySQL where it was really really really really fast because it didn't have locks. Some wiser heads said "but it is often giving the wrong answer!" The reply was "well it is really really really fast!" And we know how that came out.
Perhaps the expected output of this sea of devices is poetry, which in the minds of those on the project, might require less precision. But even there, some poetry does require lots of precision.
Consider sorting data, because that's what computer scientists do…
If you can sort your data on a sea of 1000 processors in O(nlogn) with minimal errors (defined as out of order elements in the result set), then check the result in O(n) over 1000 processors and fix the handful of problems in a couple of O(n) moves then you will beat an error free sort that can only make use of 10 processors because of memory/lock contention.
For your bank (which should certainly start charging you a transaction fee on deposits), it might take the form of processing batches of transactions in a massively parallel way with a relatively fast audit step at the end to tell if something went poorly. In this case the repair strategy might be as simple as split and redo the halves.
I am kind of Knuth with this who seems to day that this whole parallelization/multicore stuff may be a result of the failure of imagination on the part of the hardware designers.
I've moved on from dynamic languages to "static typing that doesn't suck" (Haskell).
Will they be common enough? Or like with dirigibles other solutions will dominate.
The big trade off has to be power consumption.
If you diminish accuracy, fine. But if your handset dies because some genetic algorithm didn't converge in 3 minutes, that'll be a problem.
My word processor is perfectly well able to keep up with my typing speed.
There is no reason to try to rewrite any of these to use funny algorithms that can spread work over tons of cores. They generally don't provide enough work to even keep one core busy, the only concern is UI latency (mostly during blocking I/O).
Compiling things can take a while, but that's already easily parallelized by file.
I'm told image/video processing can be slow, but there are already existing algorithms that work on huge numbers of cores (or on GPUs).
Recalcing very large spreadsheets can be slow, but that should be rather trivial to parallelize (along the same lines as image processing).
So isn't the article pretty much garbage?
Your argument is similar. Everything you have on your machine right now works fine as it is. What you aren't taking into account is that the list of things on your machine is defined by the machine's limitations. A new style of computation will open doors and allow things onto your machine that are currently not possible because of limitations in your processor.
Ignoring that, I agree with your point wholeheartedly: just because things work now does not mean we can't optimize. In fact, a very large part of the software field is all about optimizing things that already exist.
And who are you to say that today's parallel algorithms for solving things like these are optimal? To me, it certainly seems worthwhile to consider possible alternate parallelization models.
> My email client is pretty much I/O-bound.
> My word processor is perfectly well able to keep up with my typing speed.
Better speech to text? Better prediction (so you have to type less)? Better compression for attachments? Better grammar checking?
It's clear that web sites are slowly transitioning into a role that is similar to desktop apps, so there will be a million applications that can use hefty computational power.
> Compiling things can take a while, but that's already easily parallelized by file.
Better optimizations? Things like evolutionary algorithms for assembly generation come to mind, which could use any conceivable number of cores. Better static analysis?
> I'm told image/video processing can be slow, but there are already existing algorithms that work on huge numbers of cores (or on GPUs).
GPUs are typically not as good as CPUs for integer math, in particular they're not deeply pipelined, etc. This could conceivably become much faster.
> Recalcing very large spreadsheets can be slow, but that should be rather trivial to parallelize (along the same lines as image processing).
There are a ton of things that could happen here with massive computational power. Spreadsheets that use machine learning techniques to predict things? More powerful models?
>* So isn't the article pretty much garbage?*
Come on, people have been coming up with novel uses for our ever-increasing computational power for well over a hundred years now (from mechanical computers to octo-core desktop machines). Why would they suddenly stop now?
Each client connection is entirely independent of the other client connections. That's trivial to parallelize (thread pool, Erlang, whatever).
I also have one where the clients are almost entirely independent; that one connects to a database which handles the "almost" part.
But maybe you're someone who's happy with the cores and algorithms he already has. That's OK with me. There will certainly always be problems where shared-nothing parallelism over commodity hardware is the most cost effective. But not everyone is mining Bitcoin or computing the Mandelbrot set.
I don't think your typical application will have to take advantage of all 200 cores in an explicit way. The use of dynamic languages on the desktop signifies that developers have been willing to throw away performance for development time for a while now. Why is that suddenly going to change, especially with the more difficult concurrent landscape before us?