
Linus: Parallel computing is a huge waste of everybody's time - zurn
http://www.realworldtech.com/forum/?threadid=146066&curpostid=146227
======
jwr
I read the headline, thought to myself "that can't be right, Linus is too
smart for that" — then read the article. And the headline is wrong. So are
most of the comments.

He explicitly says "Parallel code makes sense in the few cases I mentioned,
where we already largely have it covered, because in the server space, people
have been parallel for a long time" — so most of that rant is about desktop
and mobile computers, often called "client-side". Where indeed four cores seem
to be more than enough for most needs.

On the "server" side, though (I hate the outdated name), the story is quite
different. I can easily make use of many cores, especially with tools like
Clojure, that make writing correct concurrent programs much easier.

~~~
pavlov
Why do you feel the distinction between client and server is outdated?

It seems pretty clear-cut to me... A "client" computer is one that's designed
around single-user interaction sessions. It has human interface devices
(display, input) on fast interfaces with prioritized OS-level support
(interrupts, graphics acceleration, etc).

A "server" computer is one where interaction happens over a network
connection, and multiple sessions are typically taking place simultaneously.

~~~
ghshephard
The difference between client and server can be a bit grey, but I agree, the
distinctions are important. A client is about mobility (small size), low power
use (low heat, small battery), and very low latency, focus on human-computer
interaction experience.

A server has fewer power concerns, and is all about throughput. Heat, size,
are not as relevant (but not irrelevant).

iPhone and iPad are obviously clients. A Mac Pro is a Server. A macbook air is
probably a client (though, ironically, it is far more powerful than the $10K
servers I had in 1999). A MacBook Pro falls somewhat in the middle, but leans
towards client. An iMac is also in the middle, but leans towards server.

I think the angst (which I have to some degree as well) about the distinction
between Client vs Server, is because it's not completely clear how to position
the Mac Pro and iMac. In some benchmarks, the iMac meets, or exceeds the
MacPro. But, in terms of sheer throughput for bulk tasks (Video Rendering),
you can get a higher top-line performance from the MacPro. And, for day-day
interaction, the 5K Display of the iMac (human-computer interaction), beats
out the MacPro.

[edit: 99.99% of the time, downvotes are immaterial to me, but I'm genuinely
intrigued in what the contrary opinion is here - I expect to learn something,
please share!]

~~~
pavlov
I certainly didn't downvote you, but I don't understand why you'd consider the
iMac or Mac Pro as servers. Very few people use them in that fashion.

The iMac has a 5K display. The Mac Pro has two high-power GPUs. These are not
server features because they're designed to provide explosive graphics power
for a single user at a time. Using '80s terminology, both those Apple
computers would qualify as workstations, IMO...

I think an interesting analogy could be made with physical training. Some
athletes train for endurance, e.g. running a marathon. Others train for muscle
strength, e.g. powerlifting.

A server is an "endurance-oriented computer" \-- its power needs to be
distributed evenly over multiple active sessions. One remote client can't be
allowed to hobble the server.

In contrast, a client is a "strength-oriented computer". For much of the time,
it's sitting idle because the human in front of it is so slow. But when the
human makes a decision, the computer needs to do its best to fulfill the task
immediately (compositing windows, rendering a web page or 4K video effects,
etc.)

~~~
simonh
It's not as simple as that WRT the Mac Pro. One of the graphics cards can't
actualy be used to drive displays, it's only there to handle intensive compute
tasks.

The Mac Pro also scales to 12 cores. If that's not about distributing power
evenly over multiple active 'session' (not sure what 'sessions' are supposed
to mean in this context), I don't know what is.

I agree about the iMac though. It is a very powerful computer, but it's
clearly optimised for tasks requiring immediate responsiveness. The fact that
it also happens to have a very powerful CPU is not the factor that is driving
the overall design.

~~~
pavlov
The Mac Pro has those capabilities because it's optimized to be a
graphics/video workstation, and those tasks are very amenable to
parallelization. Those 12 cores are not there to serve 100 different clients
per second, but rather to provide maximum render power for the user sitting in
front of the computer. (That's what I meant by a single "session".)

Of course you could put a Mac Pro in a server room or data center, but
realistically very few people do that... It's just not designed for that.

There are plenty of large graphics/3D/video render farms, and they don't use
expensive workstations like the Mac Pro. You get more bang for the buck by
going with traditional server form factors. Cool black cylinders with
ridiculous amounts of desktop-oriented I/O don't make ideal servers.

~~~
ghshephard
Ironically, given your definition, a Mac Mini, which is about 1/4th to 1/8th
the power of a Mac Pro, would be considered a server. There are even Data
Centers (MacMini Colo) that are dedicated to shelving Mac Mini's for such a
purpose.

Perhaps the whole "Server/Workstation/Client" segmentation doesn't make sense
after all.

------
zurn
Site seems to be down, saved copy:
[http://pastebin.com/Z08gtRFj](http://pastebin.com/Z08gtRFj)

~~~
JoachimS
Should have had a parallel server there..

~~~
kokey
I actually thought the link was a practical joke until I looked at the
comments.

------
VMG
Heavily editorialized title. Actual quote:

> Parallel stupid small cores without caches are horrible unless you have a
> very specific load that is hugely regular (ie graphics).

And

> The only place where parallelism matters is in graphics or on the server
> side, where we already largely have it. Pushing it anywhere else is just
> pointless.

Edit: "editorialized" meaning "choosing the most extreme quote of a rant"

~~~
ezequiel-garzon
As zurn pointed out, it's not taken out of context at all. Right before your
excerpt comes:

"The whole "let's parallelize" thing is a huge waste of everybody's time.
There's this huge body of "knowledge" that parallel is somehow more efficient,
and that whole huge body is pure and utter garbage."

~~~
patal
The excerpt talks about widespread beliefs about how parallelizing everything
is supposed to be good. OP's title suggests parallel computing in general.
How's that not out of context?

------
moconnor
This reads as a fairly direct attack on the idea of that many core
architectures such as the Intel Xeon Phi (which does exactly that: replaces 8
powerful Xeon cores with 60+ slower ones per socket) will ever become the norm
on the client side.

It's an interesting argument but rests upon all new algorithms (he brings up
machine vision as an example) having dedicated hardware. Ultimately, sure, but
there's still a hell of a gap between viable algorithm and dedicated mobile-
ready hardware. If the pace of invention slows I'd agree with Linus.

I think the pace of invention will continue to accelerate and parallel
processing on the client will be a valuable resource to have.

~~~
taspeotis
> This reads as a fairly direct attack on the idea of that many core
> architectures such as the Intel Xeon Phi

Whoever's pushing that should read the overview page for Intel Xeon Phi [1]:

> While a majority of applications (80 to 90 percent) will continue to achieve
> maximum performance on Intel Xeon processors, certain highly parallel
> applications will benefit dramatically by using Intel Xeon Phi coprocessors.
> To take full advantage of Intel Xeon Phi coprocessors, an application must
> scale well to over 100 software threads and either make extensive use of
> vectors or efficiently use more local memory bandwidth than is available on
> an Intel Xeon processor. Examples of segments with highly parallel
> applications include: animation, energy, finance, life sciences,
> manufacturing, medical, public sector, weather, and more.

Following that is a picture that says "pick the right tool for the job".

[1]
[http://www.intel.com.au/content/www/au/en/processors/xeon/xe...](http://www.intel.com.au/content/www/au/en/processors/xeon/xeon-
phi-coprocessor-overview.html)

~~~
raverbashing
So, it's a rebirth of Larabee?

(Essentially a Intel "GPGPU")

~~~
pmjordan
Yes, although the graphics card version (which was to be the first product
based on the design) got cancelled, they persevered with the chip design
itself.

------
qwerta
I make 'parallel' stuff for living and I sort of agree. Often optimizing your
code is better choice than going parallel. Compact code can fit into single
CPU cache which brings huge performance boost.

On other side parallel programming does not have to be hard. Fork-Join
framework in Java and parallel collections in Java 8, are trivial to use and
scale vertically pretty well.

And finally I think there is not actual demand for parallel programming. 99%
computers have 4 cpus or less. GPUs are useless for most tasks. I have
prototype of my program which scaled well to 20 cores, but nobody is
interested.

------
calibwam
I don't like predicting the future.

> So give up on parallelism already. It's not going to happen.

> End users are fine with roughly on the order of four cores,

> and you can't fit any more anyway without using too much

> energy to be practical in that space.

End users was fine with a single core Pentium 4 on their workstation. We
progressed. How would even Linus know that we won't find a way to make
parallel work en masse?

~~~
dagw
_End users was fine with a single core Pentium 4 on their workstation._

Not really. 2-4 cores have been available on workstations for decades, so no
one is arguing the 1 core is all you need. Even Linus is saying that having 4
cores is probably a good thing in many cases. The argument is not 1 vs 4, but
more 4 vs 64, especially if you assume a fixed power budget.

~~~
calibwam
I'm not saying they would be fine now, but once (10 years ago), that was what
you had. And now we have 4-8 cores on our laptops, and 2-4 on our phones. Why
shouldn't we have 64 cores in the future if we solve the programming problems,
and can make them energy efficient? Just because 4 cores are fine now, doesn't
mean we shouldn't try to increase that.

~~~
dagw
_but once (10 years ago), that was what you had._

No it wasn't. Multi-processor Intel based workstation have been available
since the very early 90s. People have realized for a very long time that
having 2-4 cores is useful.

I'm still not convinced that, given that I have X Watts to spend, that I'm not
better off with 4 CPUs using X/4 Watts each rather than 64 cores using X/64
Watts each. But I'm willing to be proven wrong.

~~~
calibwam
Really? Do you mean people chained together multiple processors, or that Intel
produced something? For I can't find it and would be interested in reading
about those old systems.

~~~
fulafel
Intel was relatively late to the game, their multiprocessor support started
getting decent around the Pentium Pro. The Unix workstation vendors (Sun etc)
had dual CPU workstations a while earlier, but mostly SMP was used in servers.

[https://en.wikipedia.org/wiki/SPARCstation_10](https://en.wikipedia.org/wiki/SPARCstation_10)

Feather in the hat for first multi-core CPU on single die goes to IBM and the
Power4 in 2001, preceding Intel's attempt by ~4 years. (Trivia: IBM also sold
a Power4 MCM with 4 Power4 chips in a single package).

(Yes some people managed to stitch together earlier x86 processors too with
custom hardware, but it wasn't pretty or cheap or fast).

------
Htsthbjig
I understand Linux position.

We have used parallel computing quite extensively for: 1-3D, 2D vector
graphics. 2-Audio processing 3-Image processing 4-All of the above(video).

We could parallelize something to be more than 100 times more efficient(ops
per Watt) than on the CPU( _). But proper parallelization comes at a cost:
Efficient memory management is hell.

I mean, people are afraid of c manual memory management, that is nothing
compared with the complexity of parallel memory management. You need
semaphores or mutexes to access common memory, but the most important thing is
that you need to make independent as much memory as you can from each other,
replace sequential steps, etc...

So if the reason for making the kernel parallel is a 10% THEORETICAL increase,
forget about it, 10% is nothing with the complexity you have to add.

(_) Power consumption normally increases with the square of frequency, so
using more cores instead of higher frequency you get very efficient. The brain
itself runs very slow but with a high amount of cores(neuron clusters).

~~~
qznc
In the case of software which is widely used a complexity increase matters
little. Imagine what companies like Google or Amazon save with a 1%
improvement in their data centers. They could employ 10 more kernel developers
to deal with the complexity in return.

------
vidarh
Consider that we _have_ parallel stupid small cores, and have had them for
years: Most harddrives have full CPU's on the controllers these days, and
there are full CPU cores all over the place outside of our "view". E.g.
consider things like the Transcend Wifi SD cards with ARM SoC's on board. Many
hard drive controllers have full CPUs. You find embedded CPUs "hidden" in all
kinds of PC hardware these days.

The PC architecture started out being hampered by the CPU + dumb peripherals
architecture, but this was a big departure from the norm during that era. In
the "home" market, most machines either had CPUs too weak to drive the
peripherals (my first "parallel stupid small cores" system was a Commodore 64
+ a 1541 disk drive - you could, and people did, and wrote book about,
download your own code to the 1541 and do computations on it) or explicitly
used CPUs or co-processors all over the place to offload things, like the
Amiga.

My A2000 had of course the 68000 (and later a 68020 accelerator), but also had
a 6502 compatible core running the keyboard, a Z80 on the SCSI controller
card, on top of the Amiga's custom chips which had the copper, blitter and
sound chips that all had limited programmability. The irony is that it perhaps
made a lot of us overly gung ho on the 680x0 line (though it does have a
beautiful instruction set) because our machines felt so much faster than
comparably clocked PCs - hence many of use were ok with not upgrading CPU
models and clock rates as fast as in the PC world.

It was first when PC's started sprouting co-processors (graphics cards and
sound cards with advanced capabilities first), that the Amiga truly lost its
edge (at the same time Motorola failed to deliver fast enough versions of
their newest 680x0 models, faster CPU's would have been insufficient) - until
then the co-processors and philosophy offloading everything possible had
compensated for the by then anaemic average CPU speeds.

Though the 3rd party expansions gave one more fascinating multi-processing
step: Systems that would let you run 680x0 and PPC code on the same machine
(PPC cpu on the expansion card, 680x0 on the motherboard).

PCs have been steadily sprouting more small cores. They're just not as
visible.

On the server side it is extreme: I have single CPU servers at work where the
main CPU may have 6-8 x86 cores, but where there may be 30+ ARM or MIPS cores
when you tally up the harddrives (dual or tri-core in many cases), RAID
controllers, IPMI cards, some networking hardware etc.. But these cores are
getting so cheap and so small that we should expect to continue to see
offloading of more functionality.

On the Amiga, this was what was our day to day reality. SCSI was always
favoured, for example, because it was lighter on the CPU than IDE was (hence
universal disdain for the guy that forced engineering to put IDE in the last
Amiga models - the A4000 and 1200 - which helped cripple them at a time when
they were lagging in overall CPU performance; the man in charge at Commodore
the time was the guy that had been responsible for the PCjr disaster..),
because it offloaded more logic (and hence was more expensive, hence the cut).

Rather than be exposed to the parallelism, expect that we'll see more higher
level functionality being subsumed into peripherals so that the main CPU gets
to focus on running your code. E.g. there are TCP offload functionality on
some high end network cards.

Just like in the architectures born in the 60's and 70's out of necessity
because of wimpy CPUs...

The "one CPU to rule them all" PC is an aberration, born out of a time when
CPUs where getting to a performance level where it was possible, and where the
norm of single-tasking OSs for personal computers meant it for a short few
years seemingly made little difference if the main CPU spent all its time
serving disk interrupts while loading stuff.

The norm in computing have been multiple parallel cores. Even multiple
parallel CPUs. Often on multiple buses.

And over the last few years we've gone full circle.

~~~
ghshephard
I agree with everything you've written here (mostly) but I think you are
dodging the argument Linus is making. He isn't (at least in this post)
discussing the value of offloading certain processes (indeed, he _highlights_
GPUs and Vision Processing as great places for offloading).

Instead, he talks about _general processing tasks_ as not being great targets
for _parallelism_.

~~~
vidarh
The point is "general processing tasks" tends to include lots of very generic
parts that are great to offload.

E.g. go back to the 80's and it was not uncommon for loading data to involve
having your main CPU load data a _bit_ at a time via a GPIO pin, during which
time your application code was blocked until the transfer was complete. Then a
byte at time. Then a block at a time. Then suddenly we got DMA.

These days we expect other threads to go on executing, and application code
that wants to, will expect async background transfer of data to not consume
all that much CPU.

Here's a contrived example of possible many-wimp-core offloading for you (that
would be complicated, but possible):

Consider an architecture where many wimpy cores can "hang" on reads from a
memory location, and start executing as soon as there's an update. A "smart"
version would use a hyper-threading type approach to get better utilization.

Congratulations, you know have a "spreadsheet CPU" that automatically handles
dependencies (would need some deadlock/loop resolution mechanism) and does
recalculation as parallel as possible by the combination of data dependencies
and number of cores/threads.

It's also incidentally an architecture where caches would be a nightmare, and
where the performance of individual cores would not be a big deal.

Of course not many of us have spreadsheets where recalculation times is an
issue, and it's easy to handle recalculation on a single big CPU too. Where
the sweet spot is in terms of power usage, latency and throughput is hard to
say, though.

(not saying it'd necessarily be a good idea, but I now want to do a test-
implementation for my 16-core Parallela's just because).

But PC users dismissed the parallelism of the Amiga too. Until Windows 95 had
proper multi-tasking and they all had graphics and sound cards. Suddenly they
saw the value.

I don't think we'll know how far we can push offloading without trying.

~~~
ghshephard
I think the line between "offloading" as opposed to "Breaking up General
Purpose task" is clearer than you are making it out to be.

IO controllers, Graphics, Sound, are all obvious targets for offloading.

Perhaps the grey area (which is a much more specific than things you are
talking about) are things like TCP Offloading (TOE) - Linux, currently seems
to be opposed to the concept.

[http://www.linuxfoundation.org/collaborate/workgroups/networ...](http://www.linuxfoundation.org/collaborate/workgroups/networking/toe)

~~~
vidarh
Anything that takes time and that can be farmed out without creating lots of
contention over access to the same memory is an "obvious target for
offloading".

Consider that many people were arguing that IO, graphics and sound offloading
was totally unnecessary even in the face of seeing what it did for
architectures like the Amiga until costs came down for it and CPU speeds
remained unable to do the stuff that the offloading made possible.

IO in particular seemed pointless to many people: After all you're still going
to wait for your file to load, aren't you? But loading data can often be made
into a massively parallel task: For starters, you can widen the amount of data
transferred with each unit of time. But secondly: you rarely just want to dump
your data into memory; you usually wants to do some processing on it (e.g.
build up data structures).

AmigaOS went further, and demonstrated that there were architectural benefits
from even increased OS-level parallelism via multitasking for even basic stuff
like terminal handling: One of the reasons the Amigas felt so fast for its
time was that the OS pretty consistently traded total throughput for reduced
latency by removing blocking throughout. E.g. the terminal/shell windows on
the Amiga typically involved half a dozen tasks (Amiga threads/processes - no
MMU so not really a distinction) at a minimum: one to handle keyboard and
mouse inputs, one "cooking" the raw events into higher level events and
responding to low level requests for windowing updates, one handling higher
level events and responding with things like cursor movements, the shell
itself, one mediating cut and paste requests (which again would involved
multiple other tasks to store the cut/copied data to the clipboard device
(which would usually be mapped on top of a ram disk, but could be put on any
filesystem - potentially involving even more separate tasks))

Many of the "primitives" of that kind of architecture can be offloaded:

The process appears largely sequential, but it has numerous points where it's
possible to do things in parallel, and more importantly: even sequential
operations can be _interleaved_ to a large extent, so you can start processing
later events sooner. Many contemporary systems appeared laggy in comparison
despite higher _throughput_ because AmigaOS interleaved so many operations by
processing smaller subsets of the total processing in small slices in many
individual tasks. While you can do that by simply taking the CPU away from a
bigger tasks doing everything in parallel, that takes control over the
"chunking" away from the developer. I did some work on the AROS (AmigaOS
reimplementation) console handling a few years back, and it was amazing how
much difference tuning the interactions between those components affected
responsiveness (running on the same single core of an x86 box).

The limit is whether 1) you can do things faster (reduce latency) - if you can
do things faster with offloading, it's a candidate, 2) if your main CPU has
other stuff it can do while waiting, if so, you have a candidate.

Consider the complex font engines we run these days, for example. Prime
candidate for offloading, because it's largely a pipeline: "Put this text here
rendered with this font", and we usually render a lot of text with a small set
of fonts. We treat it as a sequential task when we usually we can interleave
it with other work and just need to be able to have sequencing points where we
say "don't return until all the rendering tasks are complete".

We can do this with multi-core architectures, but it's hard to do it
_efficiently_ without extremely cheap context switches (which are hard to do
if you do it as a user-level task running under a memory protected general
purpose OS), and we rarely want to dedicate cores of our big expensive CPUs to
tasks like that.

Have an array of cheap, wimpy cores, and it becomes a different calculation.

------
kbart
Totally agree with Linus. These >10 cores smarphones etc. look just like
marketing trick, not that user would feel any meaningful difference. Same goes
for PCs -- usually you won't feel any difference between running, let's say, 4
and 8 core CPU (of the same architecture), except synthetic benchmark tests
that have not much to do with actual performance. Of course, there are some
corner cases where it makes sense (scientific calculations, heavy graphics,
simulations, compiling etc.), but a common user does not benefit much.

~~~
sklogic
Compiling is not such a corner case. Even non-developers, "common users" are
often enjoying the source-based Linux distributions, things like Mac Ports and
alike.

~~~
kbart
Even so, I doubt it's an every day task for them. Also, most popular open
source projects provide ready binaries of the stable versions for such users.
Still, I hardly imagine my grandma compiling some open source projects for her
use..

~~~
sklogic
Almost every time I install or update something from mac ports, it's ending up
compiling, binaries are very rarely cached. I doubt my setup is unusual (the
latest OS X and the latest XCode).

And I suspect that, for example, JS engines in the browsers are going to be
more and more parallel. And that kind of compilation is something that every
user does on a daily basis. More cores => smoother web browsing experience.
Not that I approve this whole thing that's going on with the JS craze, but
it's the fact everyone have to cope with, unfortunately.

~~~
dagw
_More cores = > smoother web browsing experience._

When going from 1->4 cores, agreed. When going from 4->16->64 cores I'm not
convinced. Especially if those 64 cores are slower and have less cache.

~~~
sklogic
My point is that because of stagnation of ILP and cache size, the number of
cores will increase anyway, so we'd have to find a way to utilise this
resource anyway. Of course, trading single core performance for the number of
cores is pointless in most of the desktop use cases.

------
jokoon
I think what people seems to not understand about neuron networks is the
memory locality. Neurons pass messages directly between each other, so they
always retain their own memory state, which is very different from a CPU or a
GPU. CPU and GPU have a shared memory architecture, and thus memory access is
generally slower overall, because of how they are designed for programming
simplicity. Don't ever forget that computer speed is never about calculation,
but always about memory access. A computer speed is always limited by its L1
and L2 cache size.

If you want to draw one image, GPUs are fine, but they are not really able to
simulate neuron networks because they don't have as much parallelism or memory
locality a neuron network needs. GPU are more parallized than a CPU, yes,
thanks to OpenCL. But they're still specialized towards image blitting, not
towards massively parallelism.

Neuron networks are not very hard to understand, but if you think about
simulating one, you quickly understand that most hardware is just not adequate
and unable to run a neuron network properly, it will be just too slow.
Computers were never designed and intended to simulate something like a brain.

Also don't forget most algorithm we use today are not parallelizable, most are
sequential. We could rethink many algorithms and adapt them, but parallelism
is often a huge constraint. Sequentiality is a specialized case of
computability if you think about it.

~~~
vidarh
Have you looked at the Parallela / Epiphany from Adapteva?

Their current top model is 64 core, but their roadmap is targetting 4K cores
or more per chip.

They have local (on core) memory, and a routing mesh to let all cores access
each others memory - either for communication or for extra storage, and
optionally access the host systems RAM. The chips also expose multiple 10Gbps
links that can be used to connect the chips themselves into a bigger mesh.

You "pay" for accessing remote memory with additional cycles latency based on
distance to the node you want to address, so it massively favours focusing on
memory locality.

I have two 16 core Parallelas sitting around, but have had less time than I'd
hoped to play with them.

~~~
jokoon
That seems really interesting, I still wonder about the latency of the board
connector, but 6GB/s is really great.

Although the real issue with this board is being able to program it
effectively.

~~~
vidarh
Yeah, the current boards are very much a means for people to experiment with
the architecture first and foremost. I'm very curious to see what will start
to happen once they get one step further up from the 64 core chips and start
getting more per-core memory... Even more so if they get it onto a PCIe card
so I can stick one in my home server to play with instead of yet another small
ARM machine (I have a drawer full of ARM computers at this point).

~~~
jokoon
I'm more interested about inter card latency...

------
gamesbrainiac
I can't seem to get access to the site, "Database Error". Did we just cause
the site to crash?

~~~
thaumaturgy
Naw. Wordpress on a budget VPS caused it to crash.

~~~
raverbashing
Maybe if his database had some kind of parallelism...

~~~
corford
Or caching...

~~~
raverbashing
Or load balancing

------
albertzeyer
I found the last paragraph the most interesting one:

"[Parallelism] does not necessarily make sense elsewhere. Even in completely
new areas that we don't do today because you cant' afford it. If you want to
do low-power ubiquotous computer vision etc, I can pretty much guarantee that
you're not going to do it with code on a GP CPU. You're likely not even going
to do it on a GPU because even that is too expensive (power wise), but with
specialized hardware, probably based on some neural network model."

Esp the last bit. I wonder what he means by "specialized hardware, probably
based on some neural network model". These ones?
[http://www.research.ibm.com/cognitive-
computing/neurosynapti...](http://www.research.ibm.com/cognitive-
computing/neurosynaptic-chips.shtml)

I'm doing Machine Learning / Deep Neural Network research, and of course,
parallelism is very important for us. Our chair mostly trains on GPUs at the
moment, but the big companies use heavily more parallelism and much bigger
sizes, e.g. look at DistBelief from Google.

------
sz4kerto
No, it's not a huge waste. Maybe it is in the short run (<5 years), but there
isn't any other way forward (quantum computing is not something we can count
on for a while). Yes, with the current architectures and state of compilers it
it kind of hopeless, but it doesn't mean we should give up on it.

~~~
rtpg
We can scale horizontally (more machines) as well as vertically (faster
machines). If reasoning about more machines is easier than imagining faster
machines (implicitly here: ones with more cores), then going the other route
could be better

~~~
sz4kerto
The problem is exactly that we can't scale vertically. Look at single threaded
performance charts -- they're already far below Moore's Law. STP/watt is
increasing quickyl, but peak performance per core isn't really. (It is, but
the second derivative of performance vs time is already negative.)

------
BuckRogers
"Four cores ought to be enough for anybody!"

~~~
rpwverheij
my thought exactly. Unless we see some big breakthrough in Ghz for single
cores, more cores will come, and be properly used, in time.

~~~
willvarfar
> more cores will come, and be properly used

This is usually not possible; see Amdahl's law :(

~~~
marcosdumay
Amdahl's law is almost useless on practice. If a resource come, people will
consume it. If they can't make their code run faster, they'll use it to run
more code at the same time.

Yet, all that seems to miss Linus point.

------
koliber
I cannot access the article because of an error "Error establishing a database
connection". Perhaps if there were parallel computing resources set up to
handle increased load, I could read Linus's argument against parallel
computing.

------
im2w1l
>Where the hell do you envision that those magical parallel algorithms would
be used?

Parallel sort. Parallel n-nary search. Parallel linear search. A stupid core
for the zero-page thread. Hardware blas. Time stepping multi object systems.

------
jetm9
i reallt cant get where his comments are directed. is 6-core or 8-core
_marketing_ ARMs? very very high core count niche processors? then he says on
the server side we already have it. what is that we have?

i dont know optimum number of cores in mobile cpus. when phone is in your
pocket and not used very low power core can get phone going even with bg
notifications.

------
eccstartup
What Linus means is that "You guys are so stupid to write parallel code."
Learn physics first!

------
noonespecial
Site seems down. Umm... should have run some parallel instances and
loadbalanced?

------
Beltiras
He simply _has to be_ talking about the kernel, not parallelism in general.

------
edem
It says "Error establishing a database connection".

~~~
touristtam
same here.

------
eleitl
Of course Linux wouldn't run on a parallel system, so "four cores ought to be
enough for everybody" Linus is correct, as far as Linux goes.

------
0xFFC
holy crap , seems science going to wrong direction !

~~~
noobermin
Reading his comment, he seems to be referring to normal users, not things like
scientific cluster jobs.

Still, I'm kind of upset with his comment.

~~~
r00fus
He also excluded ASICs and the like and focused on GP CPUs. Not sure why
that'd upset you - what would you use a desktop (assuming x86 or x64) CPU to
do that actually requires explicit parallelism?

~~~
rasz_pl
We dont need parallelism, we need performance. We have been sitting on 4GHz
for the last 4 years now? Things like games require more and more cpu cycles,
and the only way to add more is additional cores.

Linus post reads like he thinks there is a choice between fast fat cores
versus lot of thin ones. There is no choice, there are no 6-8GHz cpus around
the corner. We are where we are and there is only one way to "progress".

~~~
szatkus
I'm not sure what you mean. Intel's Core progressed about 30-50% per clock
performance during last 4 years (Nehalem -> Haswell).

------
BuckRogers
Hmm site is down, must have needed moar coars.

