
Pessimism about parallelism - ingve
http://esr.ibiblio.org/?p=8223
======
pcwalton
This pessimism about whether software can effectively use CPU cores beyond two
is easily refuted with a simple observation: High-end games are now running
pretty close to 100% _CPU_ utilization on octa-core console hardware.

Parallelizing our software is mostly a question of investment at this point.
In my experience, there isn't a lot of software that wouldn't benefit from CPU
parallelism in some way or another. The bigger issues are (a) that a lot of
software simply doesn't prioritize performance, for both good and bad reasons,
and (b) that a lot of foundational software has been in maintenance mode for
years and there isn't enough interest or funding to improve it.

~~~
chrisseaton
> Parallelizing our software is mostly a question of investment at this point.

This just isn't true - for some algorithms we are simply not aware of any way
to effectively parallelise. No amount of investment is going to suddenly come
up with a solution to that problem.

We're just lucky that games are easy to parallelise.

~~~
gnarbarian
>This just isn't true - for some algorithms we are simply not aware of any way
to effectively parallelise.

That just means we need new algorithms, different habits, and languages which
make conncurency more accessible.

The vast majority of research into algorithms and computer science in general
over the past 75 yeas has focused on single threaded linear computation.

I think the thing we can do right now is spin up threads as often as possible.
It should be your default approach when writing a new function to consider how
to make it functionally pure. If you can, spin that out into a new thread
(Depending on the overhead in your language). Even If you can't make it
functionally pure there still a good chance you can write it in a way that
prevents a potential deadlock or race condition.

My argument is basically this:

1) we will find more algorithms that are better at exploiting concurrent
resources.

2) as programmers we need to change our default behavior from writing linear
programs and then trying to make them take advantage of concurrent hardware to
writing concurrent solutions at every level of application design by default
from the ground up.

~~~
vmchale
> as programmers we need to change our default behavior from writing linear
> programs and then trying to make them take advantage of concurrent hardware
> to writing concurrent solutions at every level of application design by
> default from the ground up.

With parallelism in particular, this is quite easy. Haskell's accelerate
library allows you to write code using maps/folds which then runs on multiple
CPU cores or a GPU (J does something similar but it's CPU-bound).

It turns out that maps and fold can already handle quite a lot.

With concurrency - well, concurrency is hard, and many basic data structures
are in fact quite recent!

------
david-gpu
> Your odds of being able to apply parallel-computing techniques to a problem
> are inversely proportional to the degree of irreducible semantic nonlocality
> in your input data.

That's not right. GPUs do pretty well at tasks that have poor locality
compared to CPUs, as long as there's a lot of parallelism to be exploited --
see for example ray tracing.

The reason for this is that CPUs depend on their large caches, which are
useless when there is little locality, while GPUs simply switch to another
warp while data is being loaded from memory, and they have much higher memory
bandwidth anyway.

Also, GPUs are distinctively _not_ systolic arrays. Hennesy's "Computer
Architecture: a quantitative approach" has a good primer on Single instruction
Multiple Threads (SIMT) architectures typically seen in GPUs.

~~~
wtracy
SIMT architecture is why GPUs are actually not that great for ray tracing. In
a nontrivial scene, some rays will bounce more times than others, and that
means that the whole parallel execution unit is tied up until every single ray
terminates. (The other "threads" essentially execute NOP instructions in the
meantime.)

Despite what Nvidia would have you believe, a lot of VFX rendering is done on
CPUs, not GPUs, and this is one of the reasons why. (For example, last I
heard, Pixar does all their render work on Intel Xeons.)

------
pornel
I've become parallelism optimist thanks to Rust.

The borrow checker basically doesn't even have a concept of a single-threaded
program. Everything you write in Rust is required to be thread-safe. This
makes Rust appear more difficult, but it's just front-loading the effort
required to make the whole program thread-safe (which to me felt nearly
impossible for non-trivial C programs). From there you can sprinkle Rayon all
over the place, send async work over channels, etc.

The piece I'm missing is "Amdahl's-law-aware" profiler that would show not
just where any code spends time, but only where the unparallelized bottlenecks
spend time.

~~~
vmchale
Rust is nice, but it's missing the ease of parallelism you get from things
like Haskell's accelerate or repa.

I prefer it for concurrency, but it's not a panacea. In particular, laziness
is necessary for efficient immutable data structures, and immutable structures
are a nice way to avoid some of the problems inherent in writing concurrent
programs.

~~~
samatman
Is this actually true? Clojure is eager and has efficient immutable data
structures.

~~~
vmchale
Some immutable data structures must be implemented strictly, others lazily.

There are known algorithms/data structures where you either have to use
immutability or laziness in order to get an efficient implementation.

------
jandrewrogers
Assertions about the difficulty of implementing effective parallelism reflects
assumptions about software architecture that are common but not universal. The
tools most people reach for tend to be either multithreaded locking of some
type or functional "everything is immutable" style, neither of which scale
well on real hardware. Complex applications that do use large numbers of cores
efficiently use software architecture originally developed on supercomputers
but is equally suited to ordinary machines with many cores: fine-grained
latency hiding. In practice, this means extremely cheap logical concurrency,
like proper coroutines, combined with a latency-aware scheduler (the harder
part). This may not be "parallelism" in some strict sense but it has the same
effect in terms of utilization and is sometimes used on supercomputers to
maximize throughput.

Programming languages often have some level of support for necessary
primitives, such as goroutines, but then hide them behind an unsuitable
scheduler. On the machines where latency-hiding was originally developed as an
architecture (notably Tera's MTA), optimal scheduling was done by the hardware
itself which had perfect visibility into relative latency with no overhead.
Things are not that simple on x86/ARM. If you want to implement effective
latency-hiding, you'll be implementing an equivalent scheduler in software,
and often by inference because sometimes the only way to know the true latency
of something is by doing it. I don't think it is that difficult but schedule
design is not something most programmers know how to do well -- it is magic
buried in operating systems and libraries. Nonetheless, you see more examples
of this kind of software design for performance reasons.

There are very few real-world applications that are so serial that latency-
hiding is not effective at using all available cores, particularly in the
server realm where the workload is inherently concurrent. The original target
application for latency-hiding architectures was scaling graph analytics,
which are often not amenable to more classical embarrassing parallelism.

~~~
jungler
My understanding of AAA game engine coding concurs with this analysis. It's
become increasingly common to get more frames of graphics rendered in the same
time span by kicking off the next frame before the first has finished, which
can be extrapolated to a preconfigured buffer of many frames latency, similar
to what is seen today with digital audio. Since game programmers already have
to breathe concurrency problems just to achieve real-time stateful systems,
and even more so to give an online multiplayer experience, this isn't a huge
leap: many issues of this type have been resolved from the beginning with a
double buffer or an intentional frame of latency.

That said, the benefits of scaling up runtime game performance are diminishing
as asset costs increasingly dominate the discussion: the extra scene detail
comes with a price to the production cycle, and that may be what breaks the
games-graphics-tech alliance before any fundamental limits to the technology
are reached.

------
lambda
> 3\. Consequently, most of the processing units now deployed in 4-core-and-up
> machines are doing nothing most of the time but generating waste heat.

While it's true that most of the time a 4-core or more machine has idle cores,
I'm not convinced this is a problem.

Yes, most of the time spent on a modern computer consists of doing things that
aren't particularly computationally expensive. But our perception of the speed
of a personal computer isn't based on some kind of steady state performance on
batch jobs, I think that worst-case latency is a lot closer to how we perceive
the speed of a computer. Of course, in specialized domains like HPC the steady
state performance on batch jobs is important, but this article seems to be
discussing personal computers.

Even if they come up rarely, you still hit computational bottlenecks at times
when latency is important. Whether it's some bloated web page, some big Excel
computation, a compilation for us programmers, or the like, there are plenty
of times where you hit some kind of bottleneck in which you can make effective
use of all of your cores, so even if 99% of the time you have idle cores, that
1% of the time can shape your perception of how fast your computer is.

And the fact that most of your CPUs are idle most of the time doesn't mean you
are wasting power. Modern CPU cores conserve power when idle, via P-States
(frequency scaling) and C-States (turning off idle cores).

It is probably true that there is a lot of software out there which could take
better advantage of parallelism but doesn't, in some cases due to performance
not being prioritized, in some cases due to some essential complexity of
parallelizing the code, but also in many cases due to accidental complexity of
parallelizing code due to unsafe languages like C and C++ that can make it
quite difficult to refactor serial code into parallel without worrying about
introducing significant bugs.

------
vmarsy
I'm more optimistic about parallelism, and multi cores in general.

Multi cores could have other advantages than parallelism. On mobile, if
there's 8 cores, and if the device/os is able to figure out the current load,
and based on that turn off up to 7 of those 8 cores, the energy savings could
be significant on the CPU side. (Depending on how much energy CPU uses
compared to the radio or screen on a smartphone)

Another obvious advantage as pointed as the end of the article is individual
processes running independent work next to each other. Such embarrassingly
parallel work can keep all the CPU busy pretty well.

------
phkahler
>>We can sum this up with a heuristic: Your odds of being able to apply
parallel-computing techniques to a problem are inversely proportional to the
degree of irreducible semantic nonlocality in your input data.

That's only true at the very high end, where you have so much data that it's
not all physically close and transport time is relevant.

With the reality of stalled clock speeds and multi-core systems, people are
finding the parallelism in many applications. Scaling really appears to be
coming to an end - if not at 7nm then at 5 (or what they'll call 5). I don't
think we're going to meaningfully exhaust our ability to exploit parallelism
before then. They're not used in every application but they are used by enough
things that we should have them.

------
petermcneeley
The problem is the inverse. The problem is that there is not enough
parallelism in hardware. Single digit core counts really are not that
significantly parallel especially when there is a shared main memory
bandwidth. An AVX2 code transform will give you more performance than a 8 core
hyperthreaded code transform, without all the latency and coordination.

~~~
misnome
On the other hand, Intel tried lots of (slightly slower) cores with the Phi
series - And part of the reason for dropping/sidelining was that although they
are an interesting platform for experiments, nobody could work out how to
effectively use that many parallel cores with all of the other constraints,
like memory bandwidth.

~~~
petermcneeley
I highly doubt that the scientific community "could not work out how to
effectively use that many parallel core" given that the gaming community
effectively used the SPE cores of the PS3. These cores are considerably more
difficult to use as they only have small local memory and have strong
alignment constraints. If the phi does not have enough bandwidth to service
the cores that would simply mean it is not a parallel architecture.

------
anujsharmax
> 3\. Consequently, most of the processing units now deployed in 4-core-and-up
> machines are doing nothing most of the time but generating waste heat.

The cores are not being wasted because we don't know how to parallelize
computations - as evident by the scientific software run for high performance
computing applications, which scales upto million cores.

Business decisions are responsible for these wastages. For example, many
engineering software are licensed per core - so if you are using it on 16
cores, it costs A LOT more than using it on 4 cores. In some cases the
licensing costs of the engineering software are higher than the cost of the
hardware. So no one cares if few cores are wasted.

~~~
rossdavidh
This would make it a much better idea to run on a machine with only the cores
you are using, though, yes?

~~~
marcosdumay
At some point you want those 16 cores so MS Word will have an acceptable
performance, even if your auto-routing electrical CAD uses just 1 of them to
solve its own NP-hard problem.

------
jillesvangurp
I'm less pessimistic. IMHO an increasingly big factor is the amount of single
core hardware out there that people have tended to optimize for at the cost of
not properly utilizing multi core machines for the past decade or so. As quad
core CPUs are becoming more or less standard, this is changing already. With
16-32 core CPUs becoming more commom, the pressure is on to utilize those as
well.

The use cases are much broader than gaming. I'd argue any kind of creative
work involving graphics, audio, video, already utilizes any amount of CPU you
can throw at it. Also, software like Firefox is leveraging multiple cores with
their ongoing Rust rewrite of Firefox. I tend to use multiple CPUs when
running software builds. There is a lot UI code that these days is written
asynchronously from the ground up; meaning utilizing multiple CPUs is less of
a big deal than it used to be.

But yes, you need langauges, frameworks, drivers, operating systems, etc. that
can facilitate this. I'd say there is quite bit of room for improvement here
still.

------
marcosdumay
Most desktop software does not use more than 1 core because most desktop
software does not need anything near 1 full core. How much processing can it
take to show you a button?

That does not change the fact that some software does need more, and people
are wise to dimension their computers taking the largest usual load into
account, not the average one.

That said, the point about memory locality is good. It is just not as
important as the author makes - data tends to have locality, even physics
pushes it there. But the point about non-parallelizable algorithms isn't good;
who cares about the property of your algorithm? If the problems allow for
parallelism (with local data), dump the algorithm and get one that scales.

Also interestingly, the authors does bother to link the limits to the von
Neumann architecture, but didn't think about using the extra cores to JIT-
compile the binaries into some other architecture...

~~~
mattnewport
> Most desktop software does not use more than 1 core because most desktop
> software does not need anything near 1 full core. How much processing can it
> take to show you a button?

It seems like this ought to be true but in practice it's not, though the
problem appears to be more due to developers not paying enough attention to
performance than to fundamental limitations of the hardware. I can think of
very few applications that I use on a day to day basis that do not have
multiple areas where I wish they performed better.

------
twtw
I'm not sure what the takeaway here is. If desktop users don't need (or can't
use) more cores, then don't get them - and call an end to the periodic
performance improvements, because CPUs won't get faster without more cores.
Maybe that's fine.

On the other side, high performance computing demands more computational
power, and the only promising path forward (at this time) is parallelism. I'm
not motivated by optimism, but by pragmatism.

As a sidenote, I don't understand this section:

> We look at computing for graphics, neural nets, signal processing, and
> Bitcoin mining, and we see a pattern: parallelizing algorithms work best on
> hardware that is (a) specifically designed to execute them, and (b) can’t do
> anything else!

The same hardware is used for all of these things, and it was only (arguably)
designed specifically for graphics (and even there "specific" is not certain).

~~~
nickpsecurity
"and the only promising path forward (at this time) is parallelism."

Don't forget the FPGA's, esp if bundled with CPU. They still partly work by
parallelism but I think reconfigurable hardware deserves its own mention.

------
zzzcpan
> We know this because there is plenty of real-world evidence that debugging
> implementations of parallelizing code is worse than merely _difficult_ for
> humans. Race conditions, deadlocks, livelocks, and insidious data corruption
> due to subtly unsafe orders of operation plague all such attempts.

None of these problems have anything to do with parallelism. But everything
with "shared memory multithreading". There are plenty of sane concurrency
models implementations out there. Which brings me to this part:

> But the truth is, we don’t really know what the “right” language primitives
> are for parallelism on von-Neuman-architecture computers.

We do know what the right language primitives are. Well, at least if you
follow concurrency and distributed systems, where it's not even an important
question anymore, you just stick to actor model as a default choice and stop
caring about these things.

~~~
gpderetta
>Race conditions, deadlocks, livelocks

These have nothing to do with shared memory multithreading. You can easily
have all of these with shared nothing distributed systems.

>and insidious data corruption due to subtly unsafe orders of operation

I'll give you this though.

~~~
zzzcpan
That's stretching it quite a bit. You can implement shared memory
multithreading on top of actor model and have all of the same issues. If you
don't do that though, you won't have those issues. Same way that you can
implement memory unsafe primitives on top of memory safe ones.

~~~
gpderetta
I have seen deadlocks and race conditions in non shared memory concurrent
systems many times. They are no way exclusive of shared memory systems or even
more common. They have nothing to do with shared memory and everything to do
with message ordering.

~~~
zzzcpan
It doesn't have to be shared memory specifically, but it has to be concurrent
use of something shared even if that something is completely virtual and only
exists in your mind. Which then forces you to synchronize access to that
something and deal with all those problems. It's not about message ordering
though, not unless synchronization is involved.

------
waynecochran
Depth First Search is not parallelizable at all? Provably so? Just divvy up
subtrees.

~~~
nothrabannosir
I'm assuming they mean actually Depth First, where splitting up a tree and
traversing branches in parallel violates the "First" part.

It's a bit contrived, because obviously the name implies that it's sequential.
But yeah, sometimes that's precisely what you need.

~~~
waynecochran
Yes. I guess the divvying up is more like breadth first. Most cases where I
would use DFS this would work.

~~~
wtracy
Nothing that processes nodes in parallel can guarantee either correct depth
first or breadth first ordering.

That said, it's easy enough to get _effectively_ correct ordering by
reassembling the results in the correct order as the workers finish.

------
devxpy
Well maybe not in desktops, but don't you think that mobile devices (even
micro controllers, like ESP8266) have a lot to benefit from multi-core, even
for just regular apps that have nothing to do with number crunching/gaming
etc.?

~~~
foldr
There's probably still plenty of scope for increasing single-core performance
on devices like that.

~~~
devxpy
Well here’s the thing though, I don’t feel like parallelism is just about
speed. I for one like the idea of being able doing 10 things at once, and all
tasks running at the same speed. It’s hard to do with current software, yes.
But there a lot of room for improvement, no?

~~~
foldr
If you make the processor 10x faster, then you can effectively do 10 things at
once without having multiple cores.

~~~
ben-schaaf
This isn't quite true. There's a lot more to CPU performance than how fast it
can process instructions. Someone configured Intel's 9900k at 500mhz 8 cores
and compared it to 4Ghz 1 core. The actual performance characteristics are
vastly different. If I had to choose between 10x more cores or 10x faster
cores, I'd choose more cores.

~~~
wtracy
I'm having trouble imagining a performance benefit to two (otherwise
identical) 1ghz cores over one 2ghz core, unless:

1\. The two processes cause frequent evictions of the CPU cache in some way
that two cores can mitigate by having separate caches.

2\. You are able to pin a process to a CPU core, and eliminate all the
overhead of running the process scheduler.

I would be curious to see the benchmark you're referring to. My gut tells me
that the performance differences come from comparing different processor
families.

~~~
MaulingMonkey
The faster your core, the more sensitive you are to latency, even if you have
infinite throughput. Say your DRAM has a latency of 65ns - on a single 2ghz
core, an L3 cache miss is going to be "twice" as expensive in terms of clock
cycles (130) as it would be on a 1ghz core (65).

So, to take a super contrived example, if you have a parallel program that
needs to run 1B cycles worth of instructions with 1M worth of L3 cache misses.
On a single 2ghz core that might take:

(2B cycles + 130 cycles / cache miss * 1M cache misses) = 2130 M cycles (1.065
seconds @ 2ghz)

On a dual 1ghz core, both _sharing_ a single L3 cache of the same size (so
same number of L3 cache misses), that might take:

(2B cycles + 65 cycles / cache miss * 1M cache misses) = 2065 M cycles (1.0325
seconds @ 2x1ghz)

Which is slightly faster. As long as you're bound by latency rather than
throughput, and can perfectly parallelize the problem, more cores instead of
more raw clock speed will win out, even with identical hardware and the two
cores actually sharing some stuff (the L3 cache) and ignoring the bonuses of
fewer L1/L2 cache misses (because they aren't shared, "doubling" your L1/L2
cache.) The machine I'm typing this from has 4MB of L3 cache and I frequently
deal with hundreds of gigs of I/O. Suffice it to say, there are a _lot_ of L3
cache misses.

Moving away from identical hardware - slower cores can get away with less
speculative execution / shorter instruction pipelines, which make things like
branch mispredictions cheaper in terms of cycle counts as well. This is why
modern GPUs end up with hundreds of cores getting up into the 1Ghz range or
so. They deal with embarrassingly parallel workloads, and can get better
performance by upping core counts than they can by upping clock speeds.

~~~
wtracy
It seems to me that the speed increase in your example comes from having twice
the cache and therefore half the misses. You can do that without adding cores.
(Barring edge cases where two threads use different pages that happen to share
the same cache line.)

As for reducing the need for speculative execution: You're just replacing
implicit parallelism with explicit parallelism. Whether or not that is a win
depends on the workload and the competency of your developers. :-)

~~~
MaulingMonkey
> It seems to me that the speed increase in your example comes from having
> twice the cache and therefore half the misses.

My math assumed a share L3 cache of the same size, no "twice the cache", no
"half the misses". _Same number_ of misses, but each miss is effectively half
as costly, because it's only stalling half your processing power (one of your
two 1 Ghz cores) for X number of nanoseconds instead of all of it (your single
2 Ghz core).

> As for reducing the need for speculative execution: You're just replacing
> implicit parallelism with explicit parallelism.

Yes and no. If you're already explicitly parallel (more and more common),
you're just eliminating a redundant implicit mechanism that's based on
frequently wrong branch prediction heuristics. One can optimize for those, but
then one can argue just how "implicit" it really is...

> Whether or not that is a win depends on the workload and the competency of
> your developers. :-)

Of course. But there's a lot of embarrassingly parallel work out there, and
improving building blocks, that don't take a geniuses to operate. Pixel
shaders, farming video frames out to render farms, compiling separate C++
TUs... these are all things already made explicitly parallel on my behalf.
Really maxing out the performance of a single modern ~4 GHz core is no simple
feat either.

~~~
wtracy
>Same number of misses, but each miss is effectively half as costly

Wow, I'm still trying to wrap my head around that.

I was going to argue that in cases where there's no way to split an algorithm
across multiple threads without extremely frequent IPC, pipelining can
probably get better performance than multiple cores, but then I realized that
I have absolutely nothing to base that claim on.

I still wouldn't opt for more cores rather than higher clock speed without
first trying to benchmark my expected workloads, but you have given me some
very interesting things to think about.

------
riskable
Seems like the solution is "the Unix way": Use a lot of little, single-purpose
applications for competing larger tasks. That way you can let the OS worry
about the parallelization.

Even if you're making a large, multi-purpose app like, say, a spreadsheet you
could still subdivide zillions of little tasks into their own applications. On
an OS this makes little sense because of the overhead associated with spawning
new processes but on Linux it'd be great.

~~~
zozbot123
> you could still subdivide zillions of little tasks into their own
> applications.

This is the approach Rust+Rayon are taking. Rust is in a good position to
enable this because of language-level guarantees. Using "a lot of little,
single-purpose" tasks means that each task has to own its memory; Unix of
course enforces this separation, but this comes with some overhead. Language-
level discipline can achieve much the same thing, more efficiently.

------
sevensor
I've used satisfiability solvers, but I don't know enough about their
internals to evaluate ESR's claim that satisfiability is _intrinsically_
serial. ESR's argument seems to hinge on this idea, that regardless of how
well you handle parallelism, it's not going to help with certain tasks.

~~~
alan-crowe
If I had to write a SAT solver to use lots of cores on an arbitrary boolean
formula, I would farm out the work like this:

This walk the tree to find the variable, X, that occurs most often. There is a
weighting issue. Occurrences higher up the tree count more.

Then I would create two expressions by substituting X = true and X = false.

Now two cores can race to see which can find a satisfying assignment first.
Clearly this division can be applied recursively until all the cores are busy.

~~~
sevensor
This is appealing, but I think you're only going to have full utilization for
a short time? Many of your partitions of the problem are going to come back
unsatisfiable, and then those cores go idle.

~~~
Const-me
I have recently parallelized 3D flood fill algorithm, it’s essentially a graph
search on a 3D grid of nodes. More precisely, the problem is known as
connected-component labeling: [https://en.wikipedia.org/wiki/Connected-
component_labeling](https://en.wikipedia.org/wiki/Connected-
component_labeling)

The solution is to keep partitioning until you saturate all cores. As soon as
all saturated, stop partitioning and compute. As soon as any task finished,
resume partitioning.

On the lower level, I used a thread pool. BTW, the infrastructure for that is
built-in in Windows, see SubmitThreadpoolWork API:
[https://docs.microsoft.com/en-
us/windows/desktop/api/threadp...](https://docs.microsoft.com/en-
us/windows/desktop/api/threadpoolapiset/nf-threadpoolapiset-
submitthreadpoolwork)

------
rossdavidh
All interesting, but if we're talking about typical laptop/desktop users,
beside the point, I think. The reality is that the entire computer's
performance is not the bottleneck, for most of what a typical user is doing,
most of the time. The limitation on performance is either: 1) in the user's
head (e.g. a word processor and an author) 2) in the network the computer is
attached to (much of what is done on the web) Getting faster/better
performance out of your laptop/desktop won't help any, for the reason that
your computer's performance is already not the limiting factor, for most of
what a typical user is doing. Although, this does reinforce the point that
adding even more cores to a typical user's laptop/desktop, isn't really
accomplishing much, but for much simpler reasons than the issues discussed in
the article.

~~~
pjc50
3) in the storage

4) in the antivirus insisting on chewing each piece of IO very slowly in case
it might have a virus

------
IshKebab
> I coined the term “SICK algorithm”, with the SICK expanded to “Serial,
> Intrinscally – Cope, Kiddo!”

Not only is that awful, but something is only "coined" if other people
actually use it. You can just say "I will coin this!".

------
mcguire
" _My regulars include a lot of people who are likely to be able to comment
intelligently on this suspicion. It will be interesting to see what they have
to say._ "

------
smallstepforman
ESR doesn’t know of the Actor programming model. Intersting. In all honesty,
neither did I until 16 months ago. Now I’m transitioning our parallel code to
an Actor model.

~~~
nickpsecurity
Check out Pony language. It's like Rust for Actor-based programming. Wallaroo
uses it in their production database. So, it's probably decent quality for a
new language.

[https://www.ponylang.io/discover/#what-is-
pony](https://www.ponylang.io/discover/#what-is-pony)

------
avmich
What does he mean by "bad locality"?

Why depth-first search can't be parallelized?

------
crimsonalucard
Dfs is not parallel? Why not fork at every recursive call?

------
glenrivard
One things that is not talked about a lot is parallelism in terms of kernel
and I/O.

There is the aspects of an application but what about system code?

If read the FlexSC paper and using multiple cores you can see some pretty
incredible improvements are possible.

[https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Soares.pdf)

We should also see some of this with Zircon as it gets further along and looks
to leverage cores in a type of pipelining.

