Hacker News new | past | comments | ask | show | jobs | submit login
Linus: Parallel computing is a huge waste of everybody's time (realworldtech.com)
118 points by zurn on Dec 12, 2014 | hide | past | web | favorite | 128 comments

I read the headline, thought to myself "that can't be right, Linus is too smart for that" — then read the article. And the headline is wrong. So are most of the comments.

He explicitly says "Parallel code makes sense in the few cases I mentioned, where we already largely have it covered, because in the server space, people have been parallel for a long time" — so most of that rant is about desktop and mobile computers, often called "client-side". Where indeed four cores seem to be more than enough for most needs.

On the "server" side, though (I hate the outdated name), the story is quite different. I can easily make use of many cores, especially with tools like Clojure, that make writing correct concurrent programs much easier.

Why do you feel the distinction between client and server is outdated?

It seems pretty clear-cut to me... A "client" computer is one that's designed around single-user interaction sessions. It has human interface devices (display, input) on fast interfaces with prioritized OS-level support (interrupts, graphics acceleration, etc).

A "server" computer is one where interaction happens over a network connection, and multiple sessions are typically taking place simultaneously.

The difference between client and server can be a bit grey, but I agree, the distinctions are important. A client is about mobility (small size), low power use (low heat, small battery), and very low latency, focus on human-computer interaction experience.

A server has fewer power concerns, and is all about throughput. Heat, size, are not as relevant (but not irrelevant).

iPhone and iPad are obviously clients. A Mac Pro is a Server. A macbook air is probably a client (though, ironically, it is far more powerful than the $10K servers I had in 1999). A MacBook Pro falls somewhat in the middle, but leans towards client. An iMac is also in the middle, but leans towards server.

I think the angst (which I have to some degree as well) about the distinction between Client vs Server, is because it's not completely clear how to position the Mac Pro and iMac. In some benchmarks, the iMac meets, or exceeds the MacPro. But, in terms of sheer throughput for bulk tasks (Video Rendering), you can get a higher top-line performance from the MacPro. And, for day-day interaction, the 5K Display of the iMac (human-computer interaction), beats out the MacPro.

[edit: 99.99% of the time, downvotes are immaterial to me, but I'm genuinely intrigued in what the contrary opinion is here - I expect to learn something, please share!]

I certainly didn't downvote you, but I don't understand why you'd consider the iMac or Mac Pro as servers. Very few people use them in that fashion.

The iMac has a 5K display. The Mac Pro has two high-power GPUs. These are not server features because they're designed to provide explosive graphics power for a single user at a time. Using '80s terminology, both those Apple computers would qualify as workstations, IMO...

I think an interesting analogy could be made with physical training. Some athletes train for endurance, e.g. running a marathon. Others train for muscle strength, e.g. powerlifting.

A server is an "endurance-oriented computer" -- its power needs to be distributed evenly over multiple active sessions. One remote client can't be allowed to hobble the server.

In contrast, a client is a "strength-oriented computer". For much of the time, it's sitting idle because the human in front of it is so slow. But when the human makes a decision, the computer needs to do its best to fulfill the task immediately (compositing windows, rendering a web page or 4K video effects, etc.)

It's not as simple as that WRT the Mac Pro. One of the graphics cards can't actualy be used to drive displays, it's only there to handle intensive compute tasks.

The Mac Pro also scales to 12 cores. If that's not about distributing power evenly over multiple active 'session' (not sure what 'sessions' are supposed to mean in this context), I don't know what is.

I agree about the iMac though. It is a very powerful computer, but it's clearly optimised for tasks requiring immediate responsiveness. The fact that it also happens to have a very powerful CPU is not the factor that is driving the overall design.

The Mac Pro has those capabilities because it's optimized to be a graphics/video workstation, and those tasks are very amenable to parallelization. Those 12 cores are not there to serve 100 different clients per second, but rather to provide maximum render power for the user sitting in front of the computer. (That's what I meant by a single "session".)

Of course you could put a Mac Pro in a server room or data center, but realistically very few people do that... It's just not designed for that.

There are plenty of large graphics/3D/video render farms, and they don't use expensive workstations like the Mac Pro. You get more bang for the buck by going with traditional server form factors. Cool black cylinders with ridiculous amounts of desktop-oriented I/O don't make ideal servers.

Ironically, given your definition, a Mac Mini, which is about 1/4th to 1/8th the power of a Mac Pro, would be considered a server. There are even Data Centers (MacMini Colo) that are dedicated to shelving Mac Mini's for such a purpose.

Perhaps the whole "Server/Workstation/Client" segmentation doesn't make sense after all.

When I try and differentiate a "client" from a "server" - I usually think in terms of "Display/Interact" versus "Processing/Transacting" - the more emphasis there is on displaying, and interacting with the user, the more I consider it a client. The more emphasis there is on "Processing/Transacting" - the more I consider it to be a server.

But, you are right - there is an interesting middle ground that I did not consider - the device that is designed to simultaneously do massive amounts of processing/storing/transacting while interacting with the user. Workstation should probably enter the nomenclature - but then things get very grey - your average 2013 MacBook Air is more powerful than any workstation any engineer in the world had in 1998. Does this mean the only difference between client and workstation is comparison with it's peers? And, a client today, is a workstation 15 (heck, maybe only 10) years from now?

"Client" and "Server" has nothing to do with the form factor of the device. It's a designation of roles. Typically, servers 'provide' and clients 'consume'.


The terms are self descriptive: Servers serve, clients request. So I don't understand why anyone would bring form factors into the equation. Particularly when it's well documented that you can turn anything into a server (old laptops / desktops, developer boards such as the Raspberry Pi, etc), and equally a server can act as a client (eg server-to-server API calls)

Being a server is not about power, a $5 Digital Ocean VPS is a server and it has way less power than Macbook Air you mention as a client. The only thing about a server is that it serves to non local user(s) - it could be could be low powered ARM based RPi or a multi-core Intel thing, it really just depends on it's workload. The Mac Pro is designed as client and same for all current Macs (they don't sell a server anymore), which is how they are typically used.

A server usually offers some kind of service to other computers over a network and runs an OS optimized for that (different scheduler, for example). Neither Mac Pros nor Macbooks are designed for that, so they are clearly not servers.

the notion of client and server is dependant on the kind of application. For a single player game, an Ipad is both a client and a server.

To your comment, (I haven't read the article) every endpoint should have server functionality. Even if it's a phone. That's how the internet should work, IMO.

Totally agree with you on principle. As a person that has always been ultra full stack, jack of all trades developer (lots of browsers, hundreds of tabs, IDEs, servers and the occasional WoW window). The huge benefit of multi core was jump from 1 to 2 (mostly to fix the annoying 100% cpu usage with no way to kill). The benefits from 2 to 4 were marginal at best. So we just don't have to climb much more from there. And for really parallelizable loads we have the GPU.

Site seems to be down, saved copy: http://pastebin.com/Z08gtRFj

Should have had a parallel server there..

I actually thought the link was a practical joke until I looked at the comments.

Thanks! It's still down

ty.But a site down for so long in these days?Sort of unusual

Thanks man.

Heavily editorialized title. Actual quote:

> Parallel stupid small cores without caches are horrible unless you have a very specific load that is hugely regular (ie graphics).


> The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless.

Edit: "editorialized" meaning "choosing the most extreme quote of a rant"

As zurn pointed out, it's not taken out of context at all. Right before your excerpt comes:

"The whole "let's parallelize" thing is a huge waste of everybody's time. There's this huge body of "knowledge" that parallel is somehow more efficient, and that whole huge body is pure and utter garbage."

The excerpt talks about widespread beliefs about how parallelizing everything is supposed to be good. OP's title suggests parallel computing in general. How's that not out of context?

True, that is in there.

But in a Linus rant, the most inflammatory quotes are not necessarily what he is trying to say. (That statement is also editorializing, of course)

I intuitively tried to come up with more examples to counter that claim, but all I can think of on the client side is:

- 3D Graphics

- 2D Graphics (Photoshop, video editing and encoding)

- Web Browsers

This is where we really use parallel computing. Maybe compilers, file compression and PAR2 on top of that. What else? So theres some truth to that.

For the other argument, that a small number of complex OoO cores is better than a huge number cores with slow single-thread performance for client PCs should be obvious, imho.

How many processes do most users have on their desktops? Hundreds. I have 397 on my OSX laptop, and many of them are multithreaded. Firefox alone is using 61 threads.

Dedicated compression/decompression threads, threads for encryption, threads for sound, threads for... well lots of things. Remove the caches and you get more consistent performance. Dedicate tasks to their own core, and they can use that cache for themselves. They don't have to swap it out every time.

- Science...

And those processes and threads will run faster on a four core machine than an equivalent eight core machine. Keep in mind, going from four cores to eight cores does not buy you any extra transistors - it's the same number of transistors. What it does do is introduce extra cache contention, and likely smaller caches. It also introduces extra layout complexity.

Most laptops / desktops perform much better with at least two cores than one core. Particularly if you have a CPU hogging process - your UI can't remain buttery smooth interactive. There is even some argument to be be made from going from two to four. (Background application like a render, plus another App doing something nasty killing your CPU - you can still launch activity manager and kill the errant task).

But, It's not clear to me there's any value for the client for taking a four core machine and breaking it up any further as opposed to taking the increased transistor budget and improving those existing four cores.

Remember - the tradeoff is not "Do we want more, faster, transistors" - of course we do. Rather, the tradeoff is do we make smaller, lower cache cores, or do we make larger, bigger cache cores.

The evidence tends to suggest for the near future, that bigger cache, faster cores are the way to go on your average desktop/laptop.

> taking the increased transistor budget and improving those existing four cores

There is not much space for improvement left. ILP had been stagnating quite for a while, and there is no hope of improvement with the current ISAs. More OoO-friendly ISAs are being developed, but are unlikely to hit the market any time soon.

Caches are also approaching the upper limit, and for the bigger caches we need either some much smarter cache management techniques (explicit prefetching, etc.), or a totally different programming model (e.g., using a flat scratchpad memory explicitly instead of transparent caches).

Core count grows for a reason - there is not really much stuff we can put into a single core any further, not without breaking the whole architecture.

In order to utilize those increased cores, though, developers will need to write code that can take advantage of them. Is there any indication of that happening? That's the second angle on this - even though writing parallel/concurrent code might in theory improve performance, it won't help if people continue to write single-threaded applications for everything except IO/Graphics.

I.E. It might be still more useful to build bigger 4-core processors, simply because all the developers and code out there are designed to take advantage of them - and a theoretically better 6-core processor would just end up with 2-3 cores always idling.

> developers will need to write code that can take advantage of them

In the said case of parallel compilation of, say, JS, it's only a tiny minority of the developers who need to be able to do so.

> It might be still more useful to build bigger 4-core processors

It would have been great if we could build any bigger cores. Unfortunately, there is not much scope for improvement left, not unless we ditch the existing ISAs and programming models.

But surely is does buy you extra transistors in practice since the number of core tends to only increase every few years - meaning the additional cores come with new technological advances and are how the new transistors are put to use

His argument is that running each of those 400 threads on its own tiny simple core will be a lot slower than splitting them up over 2-4 really fast cores. Even assuming a generous power budget of 100W that leaves only 0.25W pr core and that doesn't give you much processing power. It's probably a better use of time to focus on writing better context switching and multitasking algorithms for 2-4 core CPUs.

>Firefox alone is using 61 threads.

If you're getting this number from Activity Monitor, then I doubt it has any relevance to the current conversation.

Activity Monitor tells me Emacs is currently using 4 "threads", and Emacs is famously non-threaded.

The "threads" mentioned by Activity Monitor probably have something to do with OS X's being implemented on a Mach microkernel rather than the kind of threads that would matter for this conversation.

I have 150 processes on my Linux laptop, but very rarely are more than 4 of them runnable at the same time.

Most of those will be waiting on IO.

But I sort of agree with you (see my long rant elsewhere in this thread).

Another thing is how many of these threads are not blocked waiting for something to happen.

Web browsers don't belong on the list yet. They have some fits and starts in the direction, Servo is one attempt.

Yeah its not quite there yet, but I included them because e.g. in Chrome we have web workers, separate processes for plugins, a separate renderer process, separate processes for tabs etc., and multithreading via NaCL. Maybe a corner case.

Chrome is using separate sandboxed processes for the security boundaries, and it actually costs some performance. Tabs get mostly suspended when they're not active.

This is really the gist of the parallelism problem: it only helps when applied to the bottlenecks. Browsers aren't really using it to layout single pages faster, or running JS faster.

Yeah, there's some work on parallelism in rendering and GC in production browsers too but it's so far nibbling at the edges. Long way to go to get even 2x speedup compared to single core.

Actual quotes:

> The whole "let's parallelize" thing is a huge waste of everybody's time.


> Give it up. The whole "parallel computing is the future" is a bunch of crock.

Actual quotes, but out of context. Make the title "Linus: Parallelize everything is a huge waste of everybody's time", and that's what he's actually talking about.

Can a mod edit the title to something less inflammatory?

Heavily editorialized title.

lkml in a news post is like clickbait for a certain susbet of the techy crowd. I think we need a collective understanding that if someone other than lwn is reposting lkml it's just drama-mongering, pot-stirring or muckraking.

I think it attracts people because it carries the possibility of being something significant given the crowd there and kernel itself. But it's just the mundane day to day of people working on subject matter too dry for one to subscribe to in the first place.

This is the forum at David Kanter's Real World Tech site where people come to discuss computer architecture (think comp.arch heir), not LKML.

The site was down, the pastebin mirror was just a email message I assumed lkml. I think the point still stands though.

This reads as a fairly direct attack on the idea of that many core architectures such as the Intel Xeon Phi (which does exactly that: replaces 8 powerful Xeon cores with 60+ slower ones per socket) will ever become the norm on the client side.

It's an interesting argument but rests upon all new algorithms (he brings up machine vision as an example) having dedicated hardware. Ultimately, sure, but there's still a hell of a gap between viable algorithm and dedicated mobile-ready hardware. If the pace of invention slows I'd agree with Linus.

I think the pace of invention will continue to accelerate and parallel processing on the client will be a valuable resource to have.

> This reads as a fairly direct attack on the idea of that many core architectures such as the Intel Xeon Phi

Whoever's pushing that should read the overview page for Intel Xeon Phi [1]:

> While a majority of applications (80 to 90 percent) will continue to achieve maximum performance on Intel Xeon processors, certain highly parallel applications will benefit dramatically by using Intel Xeon Phi coprocessors. To take full advantage of Intel Xeon Phi coprocessors, an application must scale well to over 100 software threads and either make extensive use of vectors or efficiently use more local memory bandwidth than is available on an Intel Xeon processor. Examples of segments with highly parallel applications include: animation, energy, finance, life sciences, manufacturing, medical, public sector, weather, and more.

Following that is a picture that says "pick the right tool for the job".

[1] http://www.intel.com.au/content/www/au/en/processors/xeon/xe...

So, it's a rebirth of Larabee?

(Essentially a Intel "GPGPU")

Yes, although the graphics card version (which was to be the first product based on the design) got cancelled, they persevered with the chip design itself.

Yes, it's essentially Larabee without framebuffer.

OK, I admit I'm commenting based on a pastebin page purporting to be a comment from Linus on RealWorldTech and I can't get to the forums to read the actual content... but having said all that, I don't agree.

While your characterization of Phi as "Replacing 8 powerful Xeon cores with 60+ slower ones per socket" is sort of/mostly fair. Phi isn't really the dumb cacheless cores that are mentioned. Moreover, Phi very much is a niche product designed specifically for a market that is not exactly enjoying substantial performance gains in the last few generations of mass market Xeons.

Instead I thought of some other products... like the 8 core ARM SoCs for handys and other mobile devices; which, lets be honest, is much less than a well designed & implemented SoC that is ideally suited for the use it's being marketed for.

The Adepteva Epiphany chips (available on their Parallella boards) fit this description almost perfectly, as do Kalray and GreenArray chips. Though I guess one could make the argument that they're all very much niche products.

Anyway, I'm not sure we've really got enough information to comment further.

I make 'parallel' stuff for living and I sort of agree. Often optimizing your code is better choice than going parallel. Compact code can fit into single CPU cache which brings huge performance boost.

On other side parallel programming does not have to be hard. Fork-Join framework in Java and parallel collections in Java 8, are trivial to use and scale vertically pretty well.

And finally I think there is not actual demand for parallel programming. 99% computers have 4 cpus or less. GPUs are useless for most tasks. I have prototype of my program which scaled well to 20 cores, but nobody is interested.

I don't like predicting the future.

> So give up on parallelism already. It's not going to happen.

> End users are fine with roughly on the order of four cores,

> and you can't fit any more anyway without using too much

> energy to be practical in that space.

End users was fine with a single core Pentium 4 on their workstation. We progressed. How would even Linus know that we won't find a way to make parallel work en masse?

First of all I think that many people are still fine with single core Pentium 4 workstations and that what we have today is not that much better. For example the "death of the PC" was greatly exaggerated by the simple fact that people aren't upgrading so often because their 4-6 year old workstations are still good enough for most purposes.

Of course, many of us do need the extra power. But what Linus is saying and I agree with him, is that for mobile devices (phones, tablets, laptops), Moore's law doesn't work so well, as batteries aren't keeping up with Moore's law. A mobile device that doesn't last for 2 hours of screen-on usage is a completely useless mobile device (and here I'm including laptops as well).

End users was fine with a single core Pentium 4 on their workstation.

Not really. 2-4 cores have been available on workstations for decades, so no one is arguing the 1 core is all you need. Even Linus is saying that having 4 cores is probably a good thing in many cases. The argument is not 1 vs 4, but more 4 vs 64, especially if you assume a fixed power budget.

I'm not saying they would be fine now, but once (10 years ago), that was what you had. And now we have 4-8 cores on our laptops, and 2-4 on our phones. Why shouldn't we have 64 cores in the future if we solve the programming problems, and can make them energy efficient? Just because 4 cores are fine now, doesn't mean we shouldn't try to increase that.

but once (10 years ago), that was what you had.

No it wasn't. Multi-processor Intel based workstation have been available since the very early 90s. People have realized for a very long time that having 2-4 cores is useful.

I'm still not convinced that, given that I have X Watts to spend, that I'm not better off with 4 CPUs using X/4 Watts each rather than 64 cores using X/64 Watts each. But I'm willing to be proven wrong.

Really? Do you mean people chained together multiple processors, or that Intel produced something? For I can't find it and would be interested in reading about those old systems.

Intel was relatively late to the game, their multiprocessor support started getting decent around the Pentium Pro. The Unix workstation vendors (Sun etc) had dual CPU workstations a while earlier, but mostly SMP was used in servers.


Feather in the hat for first multi-core CPU on single die goes to IBM and the Power4 in 2001, preceding Intel's attempt by ~4 years. (Trivia: IBM also sold a Power4 MCM with 4 Power4 chips in a single package).

(Yes some people managed to stitch together earlier x86 processors too with custom hardware, but it wasn't pretty or cheap or fast).

Sequent had proprietary multi-processor 386 systems. The Intel MultiProcessor Specification dates back to 1993. Most Pentium II chipsets supported dual processors, which drove pricing down enough for enthusiasts to build them for home use.

There where a handful of companies making 2-8 socket motherboards for 486 processors. I know Compaq and NCR were early in offering workstation based around those motherboard designs.

I understand Linux position.

We have used parallel computing quite extensively for: 1-3D, 2D vector graphics. 2-Audio processing 3-Image processing 4-All of the above(video).

We could parallelize something to be more than 100 times more efficient(ops per Watt) than on the CPU(). But proper parallelization comes at a cost: Efficient memory management is hell.

I mean, people are afraid of c manual memory management, that is nothing compared with the complexity of parallel memory management. You need semaphores or mutexes to access common memory, but the most important thing is that you need to make independent as much memory as you can from each other, replace sequential steps, etc...

So if the reason for making the kernel parallel is a 10% THEORETICAL increase, forget about it, 10% is nothing with the complexity you have to add.

() Power consumption normally increases with the square of frequency, so using more cores instead of higher frequency you get very efficient. The brain itself runs very slow but with a high amount of cores(neuron clusters).

In the case of software which is widely used a complexity increase matters little. Imagine what companies like Google or Amazon save with a 1% improvement in their data centers. They could employ 10 more kernel developers to deal with the complexity in return.

Consider that we have parallel stupid small cores, and have had them for years: Most harddrives have full CPU's on the controllers these days, and there are full CPU cores all over the place outside of our "view". E.g. consider things like the Transcend Wifi SD cards with ARM SoC's on board. Many hard drive controllers have full CPUs. You find embedded CPUs "hidden" in all kinds of PC hardware these days.

The PC architecture started out being hampered by the CPU + dumb peripherals architecture, but this was a big departure from the norm during that era. In the "home" market, most machines either had CPUs too weak to drive the peripherals (my first "parallel stupid small cores" system was a Commodore 64 + a 1541 disk drive - you could, and people did, and wrote book about, download your own code to the 1541 and do computations on it) or explicitly used CPUs or co-processors all over the place to offload things, like the Amiga.

My A2000 had of course the 68000 (and later a 68020 accelerator), but also had a 6502 compatible core running the keyboard, a Z80 on the SCSI controller card, on top of the Amiga's custom chips which had the copper, blitter and sound chips that all had limited programmability. The irony is that it perhaps made a lot of us overly gung ho on the 680x0 line (though it does have a beautiful instruction set) because our machines felt so much faster than comparably clocked PCs - hence many of use were ok with not upgrading CPU models and clock rates as fast as in the PC world.

It was first when PC's started sprouting co-processors (graphics cards and sound cards with advanced capabilities first), that the Amiga truly lost its edge (at the same time Motorola failed to deliver fast enough versions of their newest 680x0 models, faster CPU's would have been insufficient) - until then the co-processors and philosophy offloading everything possible had compensated for the by then anaemic average CPU speeds.

Though the 3rd party expansions gave one more fascinating multi-processing step: Systems that would let you run 680x0 and PPC code on the same machine (PPC cpu on the expansion card, 680x0 on the motherboard).

PCs have been steadily sprouting more small cores. They're just not as visible.

On the server side it is extreme: I have single CPU servers at work where the main CPU may have 6-8 x86 cores, but where there may be 30+ ARM or MIPS cores when you tally up the harddrives (dual or tri-core in many cases), RAID controllers, IPMI cards, some networking hardware etc.. But these cores are getting so cheap and so small that we should expect to continue to see offloading of more functionality.

On the Amiga, this was what was our day to day reality. SCSI was always favoured, for example, because it was lighter on the CPU than IDE was (hence universal disdain for the guy that forced engineering to put IDE in the last Amiga models - the A4000 and 1200 - which helped cripple them at a time when they were lagging in overall CPU performance; the man in charge at Commodore the time was the guy that had been responsible for the PCjr disaster..), because it offloaded more logic (and hence was more expensive, hence the cut).

Rather than be exposed to the parallelism, expect that we'll see more higher level functionality being subsumed into peripherals so that the main CPU gets to focus on running your code. E.g. there are TCP offload functionality on some high end network cards.

Just like in the architectures born in the 60's and 70's out of necessity because of wimpy CPUs...

The "one CPU to rule them all" PC is an aberration, born out of a time when CPUs where getting to a performance level where it was possible, and where the norm of single-tasking OSs for personal computers meant it for a short few years seemingly made little difference if the main CPU spent all its time serving disk interrupts while loading stuff.

The norm in computing have been multiple parallel cores. Even multiple parallel CPUs. Often on multiple buses.

And over the last few years we've gone full circle.

I agree with everything you've written here (mostly) but I think you are dodging the argument Linus is making. He isn't (at least in this post) discussing the value of offloading certain processes (indeed, he highlights GPUs and Vision Processing as great places for offloading).

Instead, he talks about general processing tasks as not being great targets for parallelism.

The point is "general processing tasks" tends to include lots of very generic parts that are great to offload.

E.g. go back to the 80's and it was not uncommon for loading data to involve having your main CPU load data a bit at a time via a GPIO pin, during which time your application code was blocked until the transfer was complete. Then a byte at time. Then a block at a time. Then suddenly we got DMA.

These days we expect other threads to go on executing, and application code that wants to, will expect async background transfer of data to not consume all that much CPU.

Here's a contrived example of possible many-wimp-core offloading for you (that would be complicated, but possible):

Consider an architecture where many wimpy cores can "hang" on reads from a memory location, and start executing as soon as there's an update. A "smart" version would use a hyper-threading type approach to get better utilization.

Congratulations, you know have a "spreadsheet CPU" that automatically handles dependencies (would need some deadlock/loop resolution mechanism) and does recalculation as parallel as possible by the combination of data dependencies and number of cores/threads.

It's also incidentally an architecture where caches would be a nightmare, and where the performance of individual cores would not be a big deal.

Of course not many of us have spreadsheets where recalculation times is an issue, and it's easy to handle recalculation on a single big CPU too. Where the sweet spot is in terms of power usage, latency and throughput is hard to say, though.

(not saying it'd necessarily be a good idea, but I now want to do a test-implementation for my 16-core Parallela's just because).

But PC users dismissed the parallelism of the Amiga too. Until Windows 95 had proper multi-tasking and they all had graphics and sound cards. Suddenly they saw the value.

I don't think we'll know how far we can push offloading without trying.

I think the line between "offloading" as opposed to "Breaking up General Purpose task" is clearer than you are making it out to be.

IO controllers, Graphics, Sound, are all obvious targets for offloading.

Perhaps the grey area (which is a much more specific than things you are talking about) are things like TCP Offloading (TOE) - Linux, currently seems to be opposed to the concept.


Anything that takes time and that can be farmed out without creating lots of contention over access to the same memory is an "obvious target for offloading".

Consider that many people were arguing that IO, graphics and sound offloading was totally unnecessary even in the face of seeing what it did for architectures like the Amiga until costs came down for it and CPU speeds remained unable to do the stuff that the offloading made possible.

IO in particular seemed pointless to many people: After all you're still going to wait for your file to load, aren't you? But loading data can often be made into a massively parallel task: For starters, you can widen the amount of data transferred with each unit of time. But secondly: you rarely just want to dump your data into memory; you usually wants to do some processing on it (e.g. build up data structures).

AmigaOS went further, and demonstrated that there were architectural benefits from even increased OS-level parallelism via multitasking for even basic stuff like terminal handling: One of the reasons the Amigas felt so fast for its time was that the OS pretty consistently traded total throughput for reduced latency by removing blocking throughout. E.g. the terminal/shell windows on the Amiga typically involved half a dozen tasks (Amiga threads/processes - no MMU so not really a distinction) at a minimum: one to handle keyboard and mouse inputs, one "cooking" the raw events into higher level events and responding to low level requests for windowing updates, one handling higher level events and responding with things like cursor movements, the shell itself, one mediating cut and paste requests (which again would involved multiple other tasks to store the cut/copied data to the clipboard device (which would usually be mapped on top of a ram disk, but could be put on any filesystem - potentially involving even more separate tasks))

Many of the "primitives" of that kind of architecture can be offloaded:

The process appears largely sequential, but it has numerous points where it's possible to do things in parallel, and more importantly: even sequential operations can be interleaved to a large extent, so you can start processing later events sooner. Many contemporary systems appeared laggy in comparison despite higher throughput because AmigaOS interleaved so many operations by processing smaller subsets of the total processing in small slices in many individual tasks. While you can do that by simply taking the CPU away from a bigger tasks doing everything in parallel, that takes control over the "chunking" away from the developer. I did some work on the AROS (AmigaOS reimplementation) console handling a few years back, and it was amazing how much difference tuning the interactions between those components affected responsiveness (running on the same single core of an x86 box).

The limit is whether 1) you can do things faster (reduce latency) - if you can do things faster with offloading, it's a candidate, 2) if your main CPU has other stuff it can do while waiting, if so, you have a candidate.

Consider the complex font engines we run these days, for example. Prime candidate for offloading, because it's largely a pipeline: "Put this text here rendered with this font", and we usually render a lot of text with a small set of fonts. We treat it as a sequential task when we usually we can interleave it with other work and just need to be able to have sequencing points where we say "don't return until all the rendering tasks are complete".

We can do this with multi-core architectures, but it's hard to do it efficiently without extremely cheap context switches (which are hard to do if you do it as a user-level task running under a memory protected general purpose OS), and we rarely want to dedicate cores of our big expensive CPUs to tasks like that.

Have an array of cheap, wimpy cores, and it becomes a different calculation.

> The PC architecture started out being hampered by the CPU + dumb peripherals architecture

So it's a desktop scale retelling of the mainframe (centralized processing with passive terminals) architecture cost reduction benefits.

Also people willingly put a sequential abstraction on top of event based systems, its all signals, interruptions below. Despite this, events do reappear in kernel, user space and applications ...

Sort of.

But mainframes were/are centralized, but not sequential/single threaded apart from the very earliest systems. On the contrary, one of the key aspects of typical mainframes is heavy reliance of offloading - e.g. dedicated IO processors.

Basically we have "single big core" computing usually mainly in niches when/where there are for short periods an intersection between sufficiently high performance CPUs to be desirable, and sufficiently high cost not to justify putting cheaper CPUs or dedicated co-processors around to offload, and either operating systems that are single tasking, or where there's sufficient control to not expose the user to delays (e.g. embedded use).

The moment you can't easily accelerate a system (within cost constraints) by adding faster big cores, the number of small, wimpy cores in the system tends to start adding up very quickly.

True, I didn't mean the internal[1] arch of a mainframe, more the 'network' arch.

Your last sentence summarize the regular cycles between centralized and decentralized trends.

[1] Not long, I found some old IBM Z specs, and was surprised how many coprocessors were there.

Totally agree with Linus. These >10 cores smarphones etc. look just like marketing trick, not that user would feel any meaningful difference. Same goes for PCs -- usually you won't feel any difference between running, let's say, 4 and 8 core CPU (of the same architecture), except synthetic benchmark tests that have not much to do with actual performance. Of course, there are some corner cases where it makes sense (scientific calculations, heavy graphics, simulations, compiling etc.), but a common user does not benefit much.

Compiling is not such a corner case. Even non-developers, "common users" are often enjoying the source-based Linux distributions, things like Mac Ports and alike.

Even so, I doubt it's an every day task for them. Also, most popular open source projects provide ready binaries of the stable versions for such users. Still, I hardly imagine my grandma compiling some open source projects for her use..

Almost every time I install or update something from mac ports, it's ending up compiling, binaries are very rarely cached. I doubt my setup is unusual (the latest OS X and the latest XCode).

And I suspect that, for example, JS engines in the browsers are going to be more and more parallel. And that kind of compilation is something that every user does on a daily basis. More cores => smoother web browsing experience. Not that I approve this whole thing that's going on with the JS craze, but it's the fact everyone have to cope with, unfortunately.

More cores => smoother web browsing experience.

When going from 1->4 cores, agreed. When going from 4->16->64 cores I'm not convinced. Especially if those 64 cores are slower and have less cache.

My point is that because of stagnation of ILP and cache size, the number of cores will increase anyway, so we'd have to find a way to utilise this resource anyway. Of course, trading single core performance for the number of cores is pointless in most of the desktop use cases.

I think what people seems to not understand about neuron networks is the memory locality. Neurons pass messages directly between each other, so they always retain their own memory state, which is very different from a CPU or a GPU. CPU and GPU have a shared memory architecture, and thus memory access is generally slower overall, because of how they are designed for programming simplicity. Don't ever forget that computer speed is never about calculation, but always about memory access. A computer speed is always limited by its L1 and L2 cache size.

If you want to draw one image, GPUs are fine, but they are not really able to simulate neuron networks because they don't have as much parallelism or memory locality a neuron network needs. GPU are more parallized than a CPU, yes, thanks to OpenCL. But they're still specialized towards image blitting, not towards massively parallelism.

Neuron networks are not very hard to understand, but if you think about simulating one, you quickly understand that most hardware is just not adequate and unable to run a neuron network properly, it will be just too slow. Computers were never designed and intended to simulate something like a brain.

Also don't forget most algorithm we use today are not parallelizable, most are sequential. We could rethink many algorithms and adapt them, but parallelism is often a huge constraint. Sequentiality is a specialized case of computability if you think about it.

Absolutes are dumb. Practically, a CPU is throttled on memory access, but it is not hard to imagine scenarios where the computation is the bottle neck. For instance, if most of the program is run in tight loops, where cache misses are few, and the whole loop can fit in L1, almost the whole overhead would be in computation.

Have you looked at the Parallela / Epiphany from Adapteva?

Their current top model is 64 core, but their roadmap is targetting 4K cores or more per chip.

They have local (on core) memory, and a routing mesh to let all cores access each others memory - either for communication or for extra storage, and optionally access the host systems RAM. The chips also expose multiple 10Gbps links that can be used to connect the chips themselves into a bigger mesh.

You "pay" for accessing remote memory with additional cycles latency based on distance to the node you want to address, so it massively favours focusing on memory locality.

I have two 16 core Parallelas sitting around, but have had less time than I'd hoped to play with them.

That seems really interesting, I still wonder about the latency of the board connector, but 6GB/s is really great.

Although the real issue with this board is being able to program it effectively.

Yeah, the current boards are very much a means for people to experiment with the architecture first and foremost. I'm very curious to see what will start to happen once they get one step further up from the 64 core chips and start getting more per-core memory... Even more so if they get it onto a PCIe card so I can stick one in my home server to play with instead of yet another small ARM machine (I have a drawer full of ARM computers at this point).

I'm more interested about inter card latency...

>Their current top model is 64 core.

Sadly it isn't. They only made prototypes for the kickstarters, it isn't mass produced or available.

Yes but human brains are not random access -- computer memory is. Your analogy holds over distributed networks, where transmission time and data loss are positively correlated with distance. Your attempt to compare human neurons to a CPU core, however, has little value.

Neural nets != Human brains

I can't seem to get access to the site, "Database Error". Did we just cause the site to crash?

Naw. Wordpress on a budget VPS caused it to crash.

Maybe if his database had some kind of parallelism...

Or caching...

Or load balancing

Parallel programming is a waste of time, obviously.

And I was wondering in the morning, who broke RWT :)

I found the last paragraph the most interesting one:

"[Parallelism] does not necessarily make sense elsewhere. Even in completely new areas that we don't do today because you cant' afford it. If you want to do low-power ubiquotous computer vision etc, I can pretty much guarantee that you're not going to do it with code on a GP CPU. You're likely not even going to do it on a GPU because even that is too expensive (power wise), but with specialized hardware, probably based on some neural network model."

Esp the last bit. I wonder what he means by "specialized hardware, probably based on some neural network model". These ones? http://www.research.ibm.com/cognitive-computing/neurosynapti...

I'm doing Machine Learning / Deep Neural Network research, and of course, parallelism is very important for us. Our chair mostly trains on GPUs at the moment, but the big companies use heavily more parallelism and much bigger sizes, e.g. look at DistBelief from Google.

No, it's not a huge waste. Maybe it is in the short run (<5 years), but there isn't any other way forward (quantum computing is not something we can count on for a while). Yes, with the current architectures and state of compilers it it kind of hopeless, but it doesn't mean we should give up on it.

The point is the spending of huge effort in turning general processing programs into parallel programs is a waste of time. It's better to write simple single threaded single purpose program to run in a dedicated core. Since we will get more cores, offload single purpose program to a dedicated core is pretty cheap.

Parallelism usually occurs in very coarse grain level. Forcing fine grain parallelism at language level is not very productive. The sad state of parallel support in languages and compilers is probably because we are pursuing the wrong goal.

We can scale horizontally (more machines) as well as vertically (faster machines). If reasoning about more machines is easier than imagining faster machines (implicitly here: ones with more cores), then going the other route could be better

The problem is exactly that we can't scale vertically. Look at single threaded performance charts -- they're already far below Moore's Law. STP/watt is increasing quickyl, but peak performance per core isn't really. (It is, but the second derivative of performance vs time is already negative.)

"Four cores ought to be enough for anybody!"

my thought exactly. Unless we see some big breakthrough in Ghz for single cores, more cores will come, and be properly used, in time.

> more cores will come, and be properly used

This is usually not possible; see Amdahl's law :(

Amdahl's law is almost useless on practice. If a resource come, people will consume it. If they can't make their code run faster, they'll use it to run more code at the same time.

Yet, all that seems to miss Linus point.

Maybe he means more like "four cores is as good as it gets" (for certain domains).

We can always want things to get harder, faster, stronger, but delivering on that is another matter.

I cannot access the article because of an error "Error establishing a database connection". Perhaps if there were parallel computing resources set up to handle increased load, I could read Linus's argument against parallel computing.

>Where the hell do you envision that those magical parallel algorithms would be used?

Parallel sort. Parallel n-nary search. Parallel linear search. A stupid core for the zero-page thread. Hardware blas. Time stepping multi object systems.

i reallt cant get where his comments are directed. is 6-core or 8-core marketing ARMs? very very high core count niche processors? then he says on the server side we already have it. what is that we have?

i dont know optimum number of cores in mobile cpus. when phone is in your pocket and not used very low power core can get phone going even with bg notifications.

What Linus means is that "You guys are so stupid to write parallel code." Learn physics first!

Site seems down. Umm... should have run some parallel instances and loadbalanced?

He simply has to be talking about the kernel, not parallelism in general.

It says "Error establishing a database connection".

same here.

Of course Linux wouldn't run on a parallel system, so "four cores ought to be enough for everybody" Linus is correct, as far as Linux goes.

holy crap , seems science going to wrong direction !

Reading his comment, he seems to be referring to normal users, not things like scientific cluster jobs.

Still, I'm kind of upset with his comment.

Even as someone who does large parallel cluster calculations I kind of agree with him. A CPU with 64 tiny cores and basically no cache will never be as useful to the vast majority of people as a CPU with 4 large cores and lots of cache, even if the 64 cores are in theory able to produce more flops.

Also a lot of scientific computation is split up into lots of smaller jobs with each job running on 1-4 cores, rather than a a few jobs running on 64 cores. So even that won't benefit that much from a heavy focus parallelism. And the single jobs that are best parallelized across lots and lots of cores are almost always best handled by something like a GPU and CUDA, and so falls outside the domain of the Linux kernel.

Why are you upset with it? Without having seen his comment, so far parallel computing has brought little to consumer end users. Out of that little, by far the most of the benefit is about graphics processing in GPUs.

The way I see it, from a practical point of view, usually the program does not need to execute faster than the user can issues modifications.

Usually, the user does only a few tens of operations per minute maximum. There is no need for the processes to execute faster than that. Most stuff people generally do with computers are nowadays blazingly fast in single core. Of course, we can add architectural baggage to slow this to a crawl. Or ignore memory optimization, and waste most of the cycles for cache flushes, reads, writes, allocations, etc. Which do not magically go away once there are more cores...

There is a time and place for multicore but that is not everyhwhere like Linus wrote.

He also excluded ASICs and the like and focused on GP CPUs. Not sure why that'd upset you - what would you use a desktop (assuming x86 or x64) CPU to do that actually requires explicit parallelism?

We dont need parallelism, we need performance. We have been sitting on 4GHz for the last 4 years now? Things like games require more and more cpu cycles, and the only way to add more is additional cores.

Linus post reads like he thinks there is a choice between fast fat cores versus lot of thin ones. There is no choice, there are no 6-8GHz cpus around the corner. We are where we are and there is only one way to "progress".

I'm not sure what you mean. Intel's Core progressed about 30-50% per clock performance during last 4 years (Nehalem -> Haswell).

I think robots are an interesting use case: Image/speech recognition, motion/task planning, etc. Lots of things to do in parallel in (soft) real time on a mobile device. How to do it is still very unclear and for experimentation you want general purpose processors.

FPGAs are programmable and faster that GP processors...

If you have a saved copy, could you just paste it as a edit to your comment? I'd definitely like to read it.

It's interesting to note that "science" and scientific computing has been back and forth along this road several times already. Back when I started out it was much more common to have large single image machines with 64-512 cores each. Then clusters (image a Beowulf cluster of... and all that) became a thing and all of a sudden people where buying machines 1-4 cores. Then we started getting more and more cores, but even then rarely more than 24-48 cores. Now we're seeing a turn back down again with more and more people buying machines with 4-16 cores and bunch of coprocessor cards.

actually my comment was sarcastic ! I don't know why people down voted my comment

sarcasm gets lost in translation into text.

combine that with people who's only form of aggression is passive aggression.. well, you're gonna have a bad time. i too am sometimes guilty of this..

People don't like sarcastic comments.

Hmm site is down, must have needed moar coars.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact