
More efficient memory-management could enable chips with thousands of cores - Libertatea
http://news.mit.edu/2015/first-new-cache-coherence-mechanism-30-years-0910
======
Pyxl101
I like the factual tone of the article and specifically the fact that it
honestly mentions that there has been no practical performance benefit yet in
benchmarks, in a quote from the researchers. I wish more journalism had this
balance, and I try to praise it when I see it. It's exciting while avoiding
hype.

~~~
langarto
Well, the title is not very factual nor honest. It is completely hyperbolic.

The idea looks interesting (I have not yet read the actual paper), but calling
it «The first new coherence mechanism in 30 years» is ridiculous.

~~~
Anderkent
What other innovations in cache coherency handling have been created in the
last 30 years?

~~~
langarto
As scott_s says, you could start looking at the papers presented in PACT in
the last years (not all are about coherence, but there are a few almost every
year). You should also look at ISCA and HPCA.

In fact, you could start by looking at the Related Work section of this paper
itself. The version at
[https://people.csail.mit.edu/devadas/pubs/tardis.pdf](https://people.csail.mit.edu/devadas/pubs/tardis.pdf)
is better. It is quite telling that the paper does not make the outrageous
claim of the title of MIT's press release.

O(log N) of memory overhead per block is nothing new. There were commercial
systems in the 1990s that achieved that (search for SCI coherence). Note that
there are other overheads to consider (notably latency and traffic).

This paper is very interesting and looks sound, but MIT's press release makes
it look silly.

Excuse me for not even trying to make a summary of the last 30 years of
research in this field.

------
danbruc
Sounds pretty much like a Lamport clock [1] in hardware.

[1]
[https://en.wikipedia.org/wiki/Lamport_timestamps](https://en.wikipedia.org/wiki/Lamport_timestamps)

~~~
assface
Lamport clocks are logical. You can think of these as physiological.

------
robmccoll
This seems kind of hyperbolic to me (not unlike most MIT news releases). There
have been plenty of new cache coherence mechanisms in the last 30 years. This
may be the greatest departure from the classic MOESI and friends, but it's
certainly not like the research community has been sitting on their hands all
this time.

------
tkinom
I implemented a distributed messages System/API a long time (10+ years) ago on
SMP, AMP, x86 CPU that were completely no-lock, none-blocking. The APIs/system
on both userspace and Linux kernel space.

One thing the APIs depended on was atomic-add. I tried to get to 10 millions
msg/second between process/thread withing a SMP CPU group at that time. For 10
millions msg/s, the APIs had 100ns to routed and distributed the msg. The main
issue was none-cache memory access latency especially for &atomic-add
variables. The none-cache memory latency was 50+ns on DDR2 at that time when I
measured that on 1.2GHz Xeon. It was hard to get that performance.

I even considered adding and FPGA on PCI/PCIe which can mmap to a
physical/virtual address that will auto increment on every read access to get
a very high performance atomic_add.

If that same FPGA is mapped to 128,256,1024 cores, one can easily build a very
high speed distributed sync message system. Hopefully for 10+ millions /
second for 1024 cores.

That would be cool!

------
lorenzhs
The paper is available at
[http://arxiv.org/pdf/1501.04504.pdf](http://arxiv.org/pdf/1501.04504.pdf)
[pdf]

~~~
sanxiyn
A proof of the protocol is in the separate paper, which is at
[http://arxiv.org/abs/1505.06459](http://arxiv.org/abs/1505.06459)

------
omouse
We're having enough trouble utilizing just a few cores and dealing with
parallelism and concurrency, not sure how a hardware improvement is going to
help.

~~~
smrtinsert
who is we?

~~~
TodPunk
The generalized popular programming industry, largely in web, desktop, or
mobile high level programming. Not an entirely useful generalization, but one
that is common enough to habitually pick up on when "we" is used in such a
context.

~~~
stefantalpalaru
> largely in web

No, here we simply increase the number of upstream servers managed by a
frontend like nginx.

------
sahaj
N00b question. With more cores, wouldn't it also mean there's more overhead to
"managing" process execution? There are probably some specific applications
where it makes sense to have more cores, but does the everyday mobile user
benefit from having thousands of cores?

~~~
jeffreyrogers
I think the consensus is that when you get to more than about 4 cores the
performance benefits turn negative for general purpose computing.

~~~
hyperpape
I was surprised to hear this (though not because I know otherwise). What does
general purpose computing mean in this context?

~~~
jeffreyrogers
Basically the workload that an average person is going to have: a bunch of
processes running simultaneously with varying degrees of cpu and memory
intensiveness.

In special cases you can get better performance with more, but less beefy
cores (graphics is a prime example), but in general a few powerful cores
performs better. The main reason is because communication among cores is hard
and inefficient, so only embarrassingly parallel programs work well when
divided among many cores. Plus the speedup you get from parallelize a program
is minor in most cases. See Amdahl's law[1] for more on this topic.

Also, I'm not an expert in this area, but I have some familiarity with it. So
hopefully someone with a bit more experience can come and confirm (or refute)
what I've written.

[1]:
[https://en.wikipedia.org/wiki/Amdahl's_law](https://en.wikipedia.org/wiki/Amdahl's_law)

------
nly
I remember watching a talk on the C++ memory model, memory ordering and
atomics where it was claimed that the CPU industry was moving toward
sequential consistency because cache-coherency protocols haven't been shown to
be a practical bottleneck in the near future.

This is good news for actually scaling caches with cores though, given how
much die space is actually used for cache compared to cores.

------
acd
Sounds like a nature inspired design by the brain tousands of cores at slower
speeds.

Most likely maybe we will have a few speedy main cores for non parallel code
like mobile Arm big little and then the thousands slower cores in parallel.
Will there be transparent cloud execution where workloads can migrate from
local cpu out to massive cloud cpu and back?

Kind of like GPU but for main CPU?

------
stcredzero
Apple's shown that pro and consumer-level workstations and laptops have
reached the maturity level where integration matters much more than
optimization of components. Is it about time for integration between hardware
designers and language designers? Languages that have message passing
semantics could benefit from a particular hardware architecture that enables
that. Functional languages with persistent collections might benefit from
specific hardware designs.

I suspect that an Erlang-like language running on hardware specifically
designed to support it could achieve tremendous scalability.

------
dazam
This seems very much like the Disruptor[0] pattern implemented in hardware.

[0][https://lmax-exchange.github.io/disruptor/](https://lmax-
exchange.github.io/disruptor/)

------
jhallenworld
Cache coherency is kind of the same problem as database locking. Use MVCC.

