
The problem with parallelisation - 00_NOP
https://cartesianproduct.wordpress.com/2020/01/03/the-problem-with-parallelisation/
======
darknoon
This guy really misses the point. You don't put 1024 x86 cores on a chip
because the programming model of individual threads being scheduled ad-hoc
doesn't scale that well (and imposes a lot of overhead on the control
apparatus).

But on a GPU, it is typical to have well over 1000 cores using a different
programming model that accounts for memory access and grouping threads
together to accomplish shared work and shared memory accesses.

So, the same work is now being done with slightly different algorithms that
exploit this programming model.

~~~
zozbot234
CUDA/SPIR-V "cores" are not true cores - they're the equivalent of CPU
execution units (ALU's). You don't really "group threads together" on a GPU,
so much as use a single real "thread" to execute a whole bunch of computations
in lockstep, using a lightly-extended SIMD model. That's the only way you can
make such a huge amount of execution units/ALU's useful, and not all domains
are well-suited to this approach. (Which is one reason why a _lot_ of code
still runs on plain multicore CPU's.)

~~~
sdenton4
Yah, though I think there is more to be gained than often realized. Many
algorithms can be rewritten to run much faster on gpus, though it takes some
effort (and a different perspective, as a designer and programmer).

I rewrote a toeplitz matrix solver not so long ago; it was pretty fun! Treat
conditional statements as plagues, and don't worry as much about early
stopping... Every example takes the worst case time to handle, but you can do
thousands of them at the same time. Ended up getting about a 1000x speed up
against the baseline.

------
timerol
Counterpoint: The NVIDIA Titan RTX has 4608 CUDA cores. We won't see 1024-core
CPUs because GPUs cover that use case already.

~~~
jcranmer
If you're treating that number as accurate, then I'm running on a 896-core x86
computer.

As a comparison between GPUs and CPUs, each "CUDA core" is approximately equal
to a SIMD lane on an x86 core. So 56 cores, times 16 lanes (AVX-512) per core,
is 896 cores.

~~~
chrchang523
(initial comment deleted since one part addressed integer 'lanes' instead of
the 32-bit FP that's more directly relevant in a comparison with a GPU, and
the other part was incorrect.)

~~~
jcranmer
I'm counting the number of single-precision FP units. Although, strictly
speaking, AVX-512 hardware actually has two FMA units per core.

~~~
vardump
I thought AVX-512 ALU resources varied between models?

~~~
jcranmer
This is Skylake server part (which should be implied by the high core count),
which is using this block diagram:
[https://www.researchgate.net/figure/Skylake-
Microarchitectur...](https://www.researchgate.net/figure/Skylake-
Microarchitecture-CPU-Core-Block-Diagram-6_fig5_332543387)

------
mattnewport
"For code that is 99.9% parallel then using 1000 processors (each of which is
about 250 times slower than the one faster chip they collectively replace) we
can double the speed, more or less."

Where does that 250x slower come from? That doesn't seem a very reasonable
assumption.

~~~
theandrewbailey
It's almost as if the author took a conventional quad core CPU and imagined it
divided up into 1000 cores instead of 4.

~~~
00_NOP
If you read the paper referenced in the article you will see it comes from the
formula used there to define the Pareto frontier of CPU performance.

------
QuadmasterXLII
It's true that you can't do the same work 1000 times faster with 1000 cores,
but what analysis like this misses is that often you can do 1000 times more
work, work that you wouldn't even attempt with a single core system.

~~~
sgt101
Yes - so if you put 1000 cores with 1000 caches on a chip you could run 1000
vm's - more or less, perhaps 10000 in fact? Why aren't there chips like
this... I think because it would blow apart the server manufacturers business
models, and Intel's business model as well.

~~~
eesmith
Those caches take up space. Or rather, if you could have 1000 cores each with
16 MiB L3 on-die cache, then why don't we see machines now with 10 cores and
1.6 GiB on-die L3 cache?

After all, there already exists plenty of software (including some I wrote)
which is limited by memory bandwidth, or by latency.

Also, all of those 1000 cores would compete for the same main-memory
bandwidth, so are much more likely to be bandwidth starved.

------
jupp0r
OP does not talk about data parallelism vs task parallelism at all.

SIMD instructions and GPGPU has come a long way and while clock rates have
somewhat stagnated, we can solve many problems on consumer hardware today that
were impossible in Pentium 3 days (big ML models, augmented reality, live ray
tracing, ...). It was a painful transition from existing computation models,
but it has been a gigantic success imho.

------
proc0
This is interesting and would enjoy reading more about it in the white paper
when it's shared. Could it be the next paradigm shift in computing is really
the mastery of parallelization techniques baked into the hardware? I'm seeing
a similar theme to recent posts about microservices as biological systems[1]
and that Cerebras chip[2], which could mean that instead of improving
monolothic performance, multiple monoliths converge into a new, higher level
processing paradigm that is able to go above and beyond the performance of the
individual architecture improvements.

[1] [https://battlepenguin.com/tech/microservices-and-
biological-...](https://battlepenguin.com/tech/microservices-and-biological-
systems/)

[2]
[https://spectrum.ieee.org/semiconductors/processors/cerebras...](https://spectrum.ieee.org/semiconductors/processors/cerebrass-
giant-chip-will-smash-deep-learnings-speed-barrier)

------
Symmetry
I wonder, have people done research on the parallelizability of different
tasks? This is hard in that there obviously different algorithms with
different characteristics regarding the same problem where you might tradeoff
paralizability for computational difficulty. But there seem to be some
domains, like graphics, where million way parallelism aren't uncommon.

We tend to think that replacing a O(n) algorithm with an O(n ln(n)) algorithm
is a bad idea. But if it lets you spread out your problem over multiple
threads it might very well be worth it. Within a certain range the power and
silicon required to execute sequentially at a certain speed is, to simplify
hugely, a bit more than the square of that speed so there's a lot of potential
for improvement.

Certainly anything a human brain can do can be done in an embarrassingly
parallel way which ought to be of some comfort.

------
petermcneeley
I think Cliff click of Java on 1000 cores might disagree.

[https://youtu.be/5uljtqyBLxI](https://youtu.be/5uljtqyBLxI)

~~~
eesmith
That was 54 cores per die, 864 cores max.

This essay concerns 1024-core _chips_. ("When I started the PhD as a part-time
student in 2012 the firm expectation in industry was that we would by now be
well into the era of 1024-core chips. That simply hasn’t happened because, at
least in part, there is no firm commercial reason for it")

------
gnufx
I'm not sure what the cores actually are, but the PEZY chips have 1024 and
2048 cores, end Epiphany V taped out at 1024. There are others, at least
Kalray and Sunway (Taihulight), with 256 cores. They may not be sufficiently
general-purpose -- Taihulight is imbalanced against memory bandwidth -- but
can clearly do well on the right sort of problem.

~~~
00_NOP
Epiphany did indeed tape out at 1024 - which is why there is the
acknowledgement that large scale manufacture of such chips is probably
perfectly possible. But Epiphany never went to manufacture, which is sort of
the point here.

------
Google234
See:
[https://en.wikipedia.org/wiki/Amdahl%27s_law](https://en.wikipedia.org/wiki/Amdahl%27s_law)
Can't sure what this PhD is adding.

