He explicitly says "Parallel code makes sense in the few cases I mentioned, where we already largely have it covered, because in the server space, people have been parallel for a long time" — so most of that rant is about desktop and mobile computers, often called "client-side". Where indeed four cores seem to be more than enough for most needs.
On the "server" side, though (I hate the outdated name), the story is quite different. I can easily make use of many cores, especially with tools like Clojure, that make writing correct concurrent programs much easier.
It seems pretty clear-cut to me... A "client" computer is one that's designed around single-user interaction sessions. It has human interface devices (display, input) on fast interfaces with prioritized OS-level support (interrupts, graphics acceleration, etc).
A "server" computer is one where interaction happens over a network connection, and multiple sessions are typically taking place simultaneously.
A server has fewer power concerns, and is all about throughput. Heat, size, are not as relevant (but not irrelevant).
iPhone and iPad are obviously clients. A Mac Pro is a Server. A macbook air is probably a client (though, ironically, it is far more powerful than the $10K servers I had in 1999). A MacBook Pro falls somewhat in the middle, but leans towards client. An iMac is also in the middle, but leans towards server.
I think the angst (which I have to some degree as well) about the distinction between Client vs Server, is because it's not completely clear how to position the Mac Pro and iMac. In some benchmarks, the iMac meets, or exceeds the MacPro. But, in terms of sheer throughput for bulk tasks (Video Rendering), you can get a higher top-line performance from the MacPro. And, for day-day interaction, the 5K Display of the iMac (human-computer interaction), beats out the MacPro.
[edit: 99.99% of the time, downvotes are immaterial to me, but I'm genuinely intrigued in what the contrary opinion is here - I expect to learn something, please share!]
The iMac has a 5K display. The Mac Pro has two high-power GPUs. These are not server features because they're designed to provide explosive graphics power for a single user at a time. Using '80s terminology, both those Apple computers would qualify as workstations, IMO...
I think an interesting analogy could be made with physical training. Some athletes train for endurance, e.g. running a marathon. Others train for muscle strength, e.g. powerlifting.
A server is an "endurance-oriented computer" -- its power needs to be distributed evenly over multiple active sessions. One remote client can't be allowed to hobble the server.
In contrast, a client is a "strength-oriented computer". For much of the time, it's sitting idle because the human in front of it is so slow. But when the human makes a decision, the computer needs to do its best to fulfill the task immediately (compositing windows, rendering a web page or 4K video effects, etc.)
The Mac Pro also scales to 12 cores. If that's not about distributing power evenly over multiple active 'session' (not sure what 'sessions' are supposed to mean in this context), I don't know what is.
I agree about the iMac though. It is a very powerful computer, but it's clearly optimised for tasks requiring immediate responsiveness. The fact that it also happens to have a very powerful CPU is not the factor that is driving the overall design.
Of course you could put a Mac Pro in a server room or data center, but realistically very few people do that... It's just not designed for that.
There are plenty of large graphics/3D/video render farms, and they don't use expensive workstations like the Mac Pro. You get more bang for the buck by going with traditional server form factors. Cool black cylinders with ridiculous amounts of desktop-oriented I/O don't make ideal servers.
Perhaps the whole "Server/Workstation/Client" segmentation doesn't make sense after all.
But, you are right - there is an interesting middle ground that I did not consider - the device that is designed to simultaneously do massive amounts of processing/storing/transacting while interacting with the user. Workstation should probably enter the nomenclature - but then things get very grey - your average 2013 MacBook Air is more powerful than any workstation any engineer in the world had in 1998. Does this mean the only difference between client and workstation is comparison with it's peers? And, a client today, is a workstation 15 (heck, maybe only 10) years from now?
The terms are self descriptive: Servers serve, clients request. So I don't understand why anyone would bring form factors into the equation. Particularly when it's well documented that you can turn anything into a server (old laptops / desktops, developer boards such as the Raspberry Pi, etc), and equally a server can act as a client (eg server-to-server API calls)
> Parallel stupid small cores without caches are horrible unless you have a very specific load that is hugely regular (ie graphics).
> The only place where parallelism matters is in graphics or on the server side, where we already largely have it. Pushing it anywhere else is just pointless.
Edit: "editorialized" meaning "choosing the most extreme quote of a rant"
"The whole "let's parallelize" thing is a huge waste of everybody's time. There's this huge body of "knowledge" that parallel is somehow more efficient, and that whole huge body is pure and utter garbage."
But in a Linus rant, the most inflammatory quotes are not necessarily what he is trying to say. (That statement is also editorializing, of course)
- 3D Graphics
- 2D Graphics (Photoshop, video editing and encoding)
- Web Browsers
This is where we really use parallel computing. Maybe compilers, file compression and PAR2 on top of that. What else? So theres some truth to that.
For the other argument, that a small number of complex OoO cores is better than a huge number cores with slow single-thread performance for client PCs should be obvious, imho.
Dedicated compression/decompression threads, threads for encryption, threads for sound, threads for... well lots of things. Remove the caches and you get more consistent performance. Dedicate tasks to their own core, and they can use that cache for themselves. They don't have to swap it out every time.
Most laptops / desktops perform much better with at least two cores than one core. Particularly if you have a CPU hogging process - your UI can't remain buttery smooth interactive. There is even some argument to be be made from going from two to four. (Background application like a render, plus another App doing something nasty killing your CPU - you can still launch activity manager and kill the errant task).
But, It's not clear to me there's any value for the client for taking a four core machine and breaking it up any further as opposed to taking the increased transistor budget and improving those existing four cores.
Remember - the tradeoff is not "Do we want more, faster, transistors" - of course we do. Rather, the tradeoff is do we make smaller, lower cache cores, or do we make larger, bigger cache cores.
The evidence tends to suggest for the near future, that bigger cache, faster cores are the way to go on your average desktop/laptop.
There is not much space for improvement left. ILP had been stagnating quite for a while, and there is no hope of improvement with the current ISAs. More OoO-friendly ISAs are being developed, but are unlikely to hit the market any time soon.
Caches are also approaching the upper limit, and for the bigger caches we need either some much smarter cache management techniques (explicit prefetching, etc.), or a totally different programming model (e.g., using a flat scratchpad memory explicitly instead of transparent caches).
Core count grows for a reason - there is not really much stuff we can put into a single core any further, not without breaking the whole architecture.
I.E. It might be still more useful to build bigger 4-core processors, simply because all the developers and code out there are designed to take advantage of them - and a theoretically better 6-core processor would just end up with 2-3 cores always idling.
In the said case of parallel compilation of, say, JS, it's only a tiny minority of the developers who need to be able to do so.
> It might be still more useful to build bigger 4-core processors
It would have been great if we could build any bigger cores. Unfortunately, there is not much scope for improvement left, not unless we ditch the existing ISAs and programming models.
If you're getting this number from Activity Monitor, then I doubt it has any relevance to the current conversation.
Activity Monitor tells me Emacs is currently using 4 "threads", and Emacs is famously non-threaded.
The "threads" mentioned by Activity Monitor probably have something to do with OS X's being implemented on a Mach microkernel rather than the kind of threads that would matter for this conversation.
But I sort of agree with you (see my long rant elsewhere in this thread).
This is really the gist of the parallelism problem: it only helps when applied to the bottlenecks. Browsers aren't really using it to layout single pages faster, or running JS faster.
Yeah, there's some work on parallelism in rendering and GC in production browsers too but it's so far nibbling at the edges. Long way to go to get even 2x speedup compared to single core.
> The whole "let's parallelize" thing is a huge waste of everybody's time.
> Give it up. The whole "parallel computing is the future" is a bunch of crock.
lkml in a news post is like clickbait for a certain susbet of the techy crowd. I think we need a collective understanding that if someone other than lwn is reposting lkml it's just drama-mongering, pot-stirring or muckraking.
I think it attracts people because it carries the possibility of being something significant given the crowd there and kernel itself. But it's just the mundane day to day of people working on subject matter too dry for one to subscribe to in the first place.
It's an interesting argument but rests upon all new algorithms (he brings up machine vision as an example) having dedicated hardware. Ultimately, sure, but there's still a hell of a gap between viable algorithm and dedicated mobile-ready hardware. If the pace of invention slows I'd agree with Linus.
I think the pace of invention will continue to accelerate and parallel processing on the client will be a valuable resource to have.
Whoever's pushing that should read the overview page for Intel Xeon Phi :
> While a majority of applications (80 to 90 percent) will continue to achieve maximum performance on Intel Xeon processors, certain highly parallel applications will benefit dramatically by using Intel Xeon Phi coprocessors. To take full advantage of Intel Xeon Phi coprocessors, an application must scale well to over 100 software threads and either make extensive use of vectors or efficiently use more local memory bandwidth than is available on an Intel Xeon processor. Examples of segments with highly parallel applications include: animation, energy, finance, life sciences, manufacturing, medical, public sector, weather, and more.
Following that is a picture that says "pick the right tool for the job".
(Essentially a Intel "GPGPU")
While your characterization of Phi as "Replacing 8 powerful Xeon cores with 60+ slower ones per socket" is sort of/mostly fair. Phi isn't really the dumb cacheless cores that are mentioned. Moreover, Phi very much is a niche product designed specifically for a market that is not exactly enjoying substantial performance gains in the last few generations of mass market Xeons.
Instead I thought of some other products... like the 8 core ARM SoCs for handys and other mobile devices; which, lets be honest, is much less than a well designed & implemented SoC that is ideally suited for the use it's being marketed for.
The Adepteva Epiphany chips (available on their Parallella boards) fit this description almost perfectly, as do Kalray and GreenArray chips. Though I guess one could make the argument that they're all very much niche products.
Anyway, I'm not sure we've really got enough information to comment further.
On other side parallel programming does not have to be hard. Fork-Join framework in Java and parallel collections in Java 8, are trivial to use and scale vertically pretty well.
And finally I think there is not actual demand for parallel programming. 99% computers have 4 cpus or less. GPUs are useless for most tasks. I have prototype of my program which scaled well to 20 cores, but nobody is interested.
> So give up on parallelism already. It's not going to happen.
> End users are fine with roughly on the order of four cores,
> and you can't fit any more anyway without using too much
> energy to be practical in that space.
End users was fine with a single core Pentium 4 on their workstation. We progressed. How would even Linus know that we won't find a way to make parallel work en masse?
Of course, many of us do need the extra power. But what Linus is saying and I agree with him, is that for mobile devices (phones, tablets, laptops), Moore's law doesn't work so well, as batteries aren't keeping up with Moore's law. A mobile device that doesn't last for 2 hours of screen-on usage is a completely useless mobile device (and here I'm including laptops as well).
Not really. 2-4 cores have been available on workstations for decades, so no one is arguing the 1 core is all you need. Even Linus is saying that having 4 cores is probably a good thing in many cases. The argument is not 1 vs 4, but more 4 vs 64, especially if you assume a fixed power budget.
No it wasn't. Multi-processor Intel based workstation have been available since the very early 90s. People have realized for a very long time that having 2-4 cores is useful.
I'm still not convinced that, given that I have X Watts to spend, that I'm not better off with 4 CPUs using X/4 Watts each rather than 64 cores using X/64 Watts each. But I'm willing to be proven wrong.
Feather in the hat for first multi-core CPU on single die goes to IBM and the Power4 in 2001, preceding Intel's attempt by ~4 years. (Trivia: IBM also sold a Power4 MCM with 4 Power4 chips in a single package).
(Yes some people managed to stitch together earlier x86 processors too with custom hardware, but it wasn't pretty or cheap or fast).
We have used parallel computing quite extensively for:
1-3D, 2D vector graphics.
4-All of the above(video).
We could parallelize something to be more than 100 times more efficient(ops per Watt) than on the CPU(). But proper parallelization comes at a cost: Efficient memory management is hell.
I mean, people are afraid of c manual memory management, that is nothing compared with the complexity of parallel memory management. You need semaphores or mutexes to access common memory, but the most important thing is that you need to make independent as much memory as you can from each other, replace sequential steps, etc...
So if the reason for making the kernel parallel is a 10% THEORETICAL increase, forget about it, 10% is nothing with the complexity you have to add.
() Power consumption normally increases with the square of frequency, so using more cores instead of higher frequency you get very efficient. The brain itself runs very slow but with a high amount of cores(neuron clusters).
The PC architecture started out being hampered by the CPU + dumb peripherals architecture, but this was a big departure from the norm during that era. In the "home" market, most machines either had CPUs too weak to drive the peripherals (my first "parallel stupid small cores" system was a Commodore 64 + a 1541 disk drive - you could, and people did, and wrote book about, download your own code to the 1541 and do computations on it) or explicitly used CPUs or co-processors all over the place to offload things, like the Amiga.
My A2000 had of course the 68000 (and later a 68020 accelerator), but also had a 6502 compatible core running the keyboard, a Z80 on the SCSI controller card, on top of the Amiga's custom chips which had the copper, blitter and sound chips that all had limited programmability. The irony is that it perhaps made a lot of us overly gung ho on the 680x0 line (though it does have a beautiful instruction set) because our machines felt so much faster than comparably clocked PCs - hence many of use were ok with not upgrading CPU models and clock rates as fast as in the PC world.
It was first when PC's started sprouting co-processors (graphics cards and sound cards with advanced capabilities first), that the Amiga truly lost its edge (at the same time Motorola failed to deliver fast enough versions of their newest 680x0 models, faster CPU's would have been insufficient) - until then the co-processors and philosophy offloading everything possible had compensated for the by then anaemic average CPU speeds.
Though the 3rd party expansions gave one more fascinating multi-processing step: Systems that would let you run 680x0 and PPC code on the same machine (PPC cpu on the expansion card, 680x0 on the motherboard).
PCs have been steadily sprouting more small cores. They're just not as visible.
On the server side it is extreme: I have single CPU servers at work where the main CPU may have 6-8 x86 cores, but where there may be 30+ ARM or MIPS cores when you tally up the harddrives (dual or tri-core in many cases), RAID controllers, IPMI cards, some networking hardware etc.. But these cores are getting so cheap and so small that we should expect to continue to see offloading of more functionality.
On the Amiga, this was what was our day to day reality. SCSI was always favoured, for example, because it was lighter on the CPU than IDE was (hence universal disdain for the guy that forced engineering to put IDE in the last Amiga models - the A4000 and 1200 - which helped cripple them at a time when they were lagging in overall CPU performance; the man in charge at Commodore the time was the guy that had been responsible for the PCjr disaster..), because it offloaded more logic (and hence was more expensive, hence the cut).
Rather than be exposed to the parallelism, expect that we'll see more higher level functionality being subsumed into peripherals so that the main CPU gets to focus on running your code. E.g. there are TCP offload functionality on some high end network cards.
Just like in the architectures born in the 60's and 70's out of necessity because of wimpy CPUs...
The "one CPU to rule them all" PC is an aberration, born out of a time when CPUs where getting to a performance level where it was possible, and where the norm of single-tasking OSs for personal computers meant it for a short few years seemingly made little difference if the main CPU spent all its time serving disk interrupts while loading stuff.
The norm in computing have been multiple parallel cores. Even multiple parallel CPUs. Often on multiple buses.
And over the last few years we've gone full circle.
Instead, he talks about general processing tasks as not being great targets for parallelism.
E.g. go back to the 80's and it was not uncommon for loading data to involve having your main CPU load data a bit at a time via a GPIO pin, during which time your application code was blocked until the transfer was complete. Then a byte at time. Then a block at a time. Then suddenly we got DMA.
These days we expect other threads to go on executing, and application code that wants to, will expect async background transfer of data to not consume all that much CPU.
Here's a contrived example of possible many-wimp-core offloading for you (that would be complicated, but possible):
Consider an architecture where many wimpy cores can "hang" on reads from a memory location, and start executing as soon as there's an update. A "smart" version would use a hyper-threading type approach to get better utilization.
Congratulations, you know have a "spreadsheet CPU" that automatically handles dependencies (would need some deadlock/loop resolution mechanism) and does recalculation as parallel as possible by the combination of data dependencies and number of cores/threads.
It's also incidentally an architecture where caches would be a nightmare, and where the performance of individual cores would not be a big deal.
Of course not many of us have spreadsheets where recalculation times is an issue, and it's easy to handle recalculation on a single big CPU too. Where the sweet spot is in terms of power usage, latency and throughput is hard to say, though.
(not saying it'd necessarily be a good idea, but I now want to do a test-implementation for my 16-core Parallela's just because).
But PC users dismissed the parallelism of the Amiga too. Until Windows 95 had proper multi-tasking and they all had graphics and sound cards. Suddenly they saw the value.
I don't think we'll know how far we can push offloading without trying.
IO controllers, Graphics, Sound, are all obvious targets for offloading.
Perhaps the grey area (which is a much more specific than things you are talking about) are things like TCP Offloading (TOE) - Linux, currently seems to be opposed to the concept.
Consider that many people were arguing that IO, graphics and sound offloading was totally unnecessary even in the face of seeing what it did for architectures like the Amiga until costs came down for it and CPU speeds remained unable to do the stuff that the offloading made possible.
IO in particular seemed pointless to many people: After all you're still going to wait for your file to load, aren't you? But loading data can often be made into a massively parallel task: For starters, you can widen the amount of data transferred with each unit of time. But secondly: you rarely just want to dump your data into memory; you usually wants to do some processing on it (e.g. build up data structures).
AmigaOS went further, and demonstrated that there were architectural benefits from even increased OS-level parallelism via multitasking for even basic stuff like terminal handling: One of the reasons the Amigas felt so fast for its time was that the OS pretty consistently traded total throughput for reduced latency by removing blocking throughout. E.g. the terminal/shell windows on the Amiga typically involved half a dozen tasks (Amiga threads/processes - no MMU so not really a distinction) at a minimum: one to handle keyboard and mouse inputs, one "cooking" the raw events into higher level events and responding to low level requests for windowing updates, one handling higher level events and responding with things like cursor movements, the shell itself, one mediating cut and paste requests (which again would involved multiple other tasks to store the cut/copied data to the clipboard device (which would usually be mapped on top of a ram disk, but could be put on any filesystem - potentially involving even more separate tasks))
Many of the "primitives" of that kind of architecture can be offloaded:
The process appears largely sequential, but it has numerous points where it's possible to do things in parallel, and more importantly: even sequential operations can be interleaved to a large extent, so you can start processing later events sooner. Many contemporary systems appeared laggy in comparison despite higher throughput because AmigaOS interleaved so many operations by processing smaller subsets of the total processing in small slices in many individual tasks. While you can do that by simply taking the CPU away from a bigger tasks doing everything in parallel, that takes control over the "chunking" away from the developer. I did some work on the AROS (AmigaOS reimplementation) console handling a few years back, and it was amazing how much difference tuning the interactions between those components affected responsiveness (running on the same single core of an x86 box).
The limit is whether 1) you can do things faster (reduce latency) - if you can do things faster with offloading, it's a candidate, 2) if your main CPU has other stuff it can do while waiting, if so, you have a candidate.
Consider the complex font engines we run these days, for example. Prime candidate for offloading, because it's largely a pipeline: "Put this text here rendered with this font", and we usually render a lot of text with a small set of fonts. We treat it as a sequential task when we usually we can interleave it with other work and just need to be able to have sequencing points where we say "don't return until all the rendering tasks are complete".
We can do this with multi-core architectures, but it's hard to do it efficiently without extremely cheap context switches (which are hard to do if you do it as a user-level task running under a memory protected general purpose OS), and we rarely want to dedicate cores of our big expensive CPUs to tasks like that.
Have an array of cheap, wimpy cores, and it becomes a different calculation.
So it's a desktop scale retelling of the mainframe (centralized processing with passive terminals) architecture cost reduction benefits.
Also people willingly put a sequential abstraction on top of event based systems, its all signals, interruptions below. Despite this, events do reappear in kernel, user space and applications ...
But mainframes were/are centralized, but not sequential/single threaded apart from the very earliest systems. On the contrary, one of the key aspects of typical mainframes is heavy reliance of offloading - e.g. dedicated IO processors.
Basically we have "single big core" computing usually mainly in niches when/where there are for short periods an intersection between sufficiently high performance CPUs to be desirable, and sufficiently high cost not to justify putting cheaper CPUs or dedicated co-processors around to offload, and either operating systems that are single tasking, or where there's sufficient control to not expose the user to delays (e.g. embedded use).
The moment you can't easily accelerate a system (within cost constraints) by adding faster big cores, the number of small, wimpy cores in the system tends to start adding up very quickly.
Your last sentence summarize the regular cycles between centralized and decentralized trends.
 Not long, I found some old IBM Z specs, and was surprised how many coprocessors were there.
And I suspect that, for example, JS engines in the browsers are going to be more and more parallel. And that kind of compilation is something that every user does on a daily basis. More cores => smoother web browsing experience. Not that I approve this whole thing that's going on with the JS craze, but it's the fact everyone have to cope with, unfortunately.
When going from 1->4 cores, agreed. When going from 4->16->64 cores I'm not convinced. Especially if those 64 cores are slower and have less cache.
If you want to draw one image, GPUs are fine, but they are not really able to simulate neuron networks because they don't have as much parallelism or memory locality a neuron network needs. GPU are more parallized than a CPU, yes, thanks to OpenCL. But they're still specialized towards image blitting, not towards massively parallelism.
Neuron networks are not very hard to understand, but if you think about simulating one, you quickly understand that most hardware is just not adequate and unable to run a neuron network properly, it will be just too slow. Computers were never designed and intended to simulate something like a brain.
Also don't forget most algorithm we use today are not parallelizable, most are sequential. We could rethink many algorithms and adapt them, but parallelism is often a huge constraint. Sequentiality is a specialized case of computability if you think about it.
Their current top model is 64 core, but their roadmap is targetting 4K cores or more per chip.
They have local (on core) memory, and a routing mesh to let all cores access each others memory - either for communication or for extra storage, and optionally access the host systems RAM. The chips also expose multiple 10Gbps links that can be used to connect the chips themselves into a bigger mesh.
You "pay" for accessing remote memory with additional cycles latency based on distance to the node you want to address, so it massively favours focusing on memory locality.
I have two 16 core Parallelas sitting around, but have had less time than I'd hoped to play with them.
Although the real issue with this board is being able to program it effectively.
Sadly it isn't. They only made prototypes for the kickstarters, it isn't mass produced or available.
Neural nets != Human brains
"[Parallelism] does not necessarily make sense elsewhere. Even in completely new areas that we don't do today because you cant' afford it. If you want to do low-power ubiquotous computer vision etc, I can pretty much guarantee that you're not going to do it with code on a GP CPU. You're likely not even going to do it on a GPU because even that is too expensive (power wise), but with specialized hardware, probably based on some neural network model."
Esp the last bit. I wonder what he means by "specialized hardware, probably based on some neural network model". These ones? http://www.research.ibm.com/cognitive-computing/neurosynapti...
I'm doing Machine Learning / Deep Neural Network research, and of course, parallelism is very important for us. Our chair mostly trains on GPUs at the moment, but the big companies use heavily more parallelism and much bigger sizes, e.g. look at DistBelief from Google.
Parallelism usually occurs in very coarse grain level. Forcing fine grain parallelism at language level is not very productive. The sad state of parallel support in languages and compilers is probably because we are pursuing the wrong goal.
This is usually not possible; see Amdahl's law :(
Yet, all that seems to miss Linus point.
We can always want things to get harder, faster, stronger, but delivering on that is another matter.
Parallel sort. Parallel n-nary search. Parallel linear search. A stupid core for the zero-page thread. Hardware blas. Time stepping multi object systems.
i dont know optimum number of cores in mobile cpus. when phone is in your pocket and not used very low power core can get phone going even with bg notifications.
Still, I'm kind of upset with his comment.
Also a lot of scientific computation is split up into lots of smaller jobs with each job running on 1-4 cores, rather than a a few jobs running on 64 cores. So even that won't benefit that much from a heavy focus parallelism. And the single jobs that are best parallelized across lots and lots of cores are almost always best handled by something like a GPU and CUDA, and so falls outside the domain of the Linux kernel.
Usually, the user does only a few tens of operations per minute maximum. There is no need for the processes to execute faster than that. Most stuff people generally do with computers are nowadays blazingly fast in single core. Of course, we can add architectural baggage to slow this to a crawl. Or ignore memory optimization, and waste most of the cycles for cache flushes, reads, writes, allocations, etc. Which do not magically go away once there are more cores...
There is a time and place for multicore but that is not everyhwhere like Linus wrote.
Linus post reads like he thinks there is a choice between fast fat cores versus lot of thin ones. There is no choice, there are no 6-8GHz cpus around the corner. We are where we are and there is only one way to "progress".
combine that with people who's only form of aggression is passive aggression.. well, you're gonna have a bad time. i too am sometimes guilty of this..