
Distributed systems in an alternative universe - jsnell
https://github.com/lukego/blog/issues/15
======
jo909
Everything is a network of computers. Some networks are very fast (QPI between
sockets, PCIe), others only somewhat fast (SATA/SAS, USB, Ethernet). At both
ends of the network there are discrete computers (hard disk controller,
network card, keyboard, monitor, ...) talking to each other.

Essentially the difference between reading data from local memory and reading
it from say Amazon S3 is just the speed/latency of the various networks, and
how many computers had to store, transform and shuffle the data onto another
network.

It only gets more amazing when you think about all the analog stuff that
happens at the end of those networks, the button presses of people on their
keyboard, light hitting a sensor in a camera, spinning particles of magnetized
rust, stored electrons in a flash chip, light travelling through fibers of
glass, crystals moving in the pixels of your monitor etc. In the end those
analog actions are what make it "real" and what the whole network of networks
is build for.

~~~
scribu
It seems to me like a big difference between multi-core CPUs and conventional
networked systems is the expectation of reliability. For example, when a CPU
core randomly fails, you don't expect the CPU as a whole to continue
operating, do you?

~~~
al2o3cr
Most of the time, presumably not. But then there's System z...

------
ingenter
I've been thinking about a hypothetical CPU that has explicit cache and RAM
access. It's pretty close to OP's alternative universe.

What would happen is people would write wrappers for such a system so that you
could think about your RAM in a linear way. And you're now doing what a
Haswell CPU does.

Of course, this approach has advantages for specific tasks Let's say, we want
to implement constant-timing crypto primitive and put all of the required data
in cache. We can now do it.

This would also help with implementing an efficient VM, OS primitives, or many
other interesting things. But it would not change the way people write
C/Rust/JS/Python.

Of course, it's a useful _model_ for your performance, and you can now think
about cache misses as "internet download", and synchronization primitives as
"LAN share", etc, etc.

And, of course:

[http://blog.memsql.com/cache-is-the-new-ram/](http://blog.memsql.com/cache-
is-the-new-ram/)

~~~
nickpsecurity
Look up scratchpad memory. It has better performance, lower cost, more
predictable, and can help with covert channels. The thing is that most
programmers didn't want to manage memory. Compiler vendors also didn't want to
manage many different scratchpads. Cache was useful and productive. So, we all
have caches outside a few products here and there.

------
jkadlec
As far as I know, the only way to have something close to Wireshark for CPUs
is to do a simulation. Then you can measure network latencies, cache hits and
so on, but the simulation will be extremely slow. There's GEM5
([http://www.gem5.org](http://www.gem5.org)) - a simulator with couple of DSLs
designed to do just that, but I'd bet that Intel has their own thing. I did my
thesis on simulating MESIF in GEM5, here's a link:
[https://dip.felk.cvut.cz/browse/pdfcache/kadlej16_2013dipl.p...](https://dip.felk.cvut.cz/browse/pdfcache/kadlej16_2013dipl.pdf)
, that could get people started if they are interested (it also might be a
good introduction into the subject of cache coherency protocols). I have the
code laying around somewhere, I've been wanting to make it public, but it
needs cleanup first.

~~~
irishjohnnie
I think the closest thing to a Wireshark for CPUs is Intel VTune
([https://software.intel.com/en-us/intel-vtune-amplifier-
xe](https://software.intel.com/en-us/intel-vtune-amplifier-xe))

~~~
hendzen
Actually, you can buy a JTAG debugger [0] for Intel CPUs. It's going to cost
you though.

[0] - [https://software.intel.com/en-us/articles/intel-system-
studi...](https://software.intel.com/en-us/articles/intel-system-
studio-2014-jtag-debugger-supports-more-processor-families)

~~~
irishjohnnie
Wouldn't the equivalent of that be a JTAG debugger for the NIC?

------
vdnkh
Finally, a post that validates the usefulness of my BS in electrical and
computer engineering (despite being a JS guy now)! But seriously, I believe
that hardware understanding is extremely useful for all sorts off tangential
tasks related to modern programming. I've found that having intimate knowledge
of how a computer works, especially around memory and disk access, helps me to
understand how to solve problems from the standpoint of a rote-CPU rather than
my high-level brain. It's also really useful for deciphering kernel-level
errors.

------
phamilton
The key difference I see is not just likelihood of failure (an L3 cache is
pretty much always going to be available), but characteristics of failure. A
distributed system experiences partitions which it might recover from. That
uncertainty around recovery actually makes it a whole lot more difficult to
reason about.

------
scott_s
It applies at the socket level as well: modern systems with many sockets
(physical processor chips) are like to be NUMA
([https://en.wikipedia.org/wiki/Non-
uniform_memory_access](https://en.wikipedia.org/wiki/Non-
uniform_memory_access)), or non-uniform memory access systems. That means that
there is a performance benefit if you access memory on your local NUMA node.

And irrespective of NUMA, it also applies to data sharing between threads: if
you have a system with 100s of threads, but you have a multithreaded
application written like this:

    
    
        while (globalFlag) {
            doSomeLightWork();
        }
    

Your application will not scale well. The cost of 100s of threads all
accessing globalFlag will dominate performance, as the cache line for that
variable bounces around the entire system.

All systems are becoming "distributed systems". Even though a single machine
will get more and more beefy, we will still have to apply distributed systems
thinking in order to take full advantage of that system.

~~~
Rusky
Wouldn't globalFlag just be shared in a duplicated cache line between all
cores until one of them modifies it?

~~~
scott_s
If the code was:

    
    
        while (globalFlag) {
            ++someValue;
        }
    

Then you'd probably be okay. But calling a function which does some
lightweight but non-trivial work will interfere with your cache. When you come
back from the stack frame for doSomeLightWork(), you're more likely to need to
load globalFlag from memory.

In a multithreaded application where you want to scale to dozens or even
hundreds of threads, you have look at even global memory accesses as
communication.

~~~
Rusky
That still doesn't sound like the cache line "bouncing around" any more than
it would if a single-core app did the same work. There's no contention on it
the way there would be on a shared mutable value, at least.

~~~
scott_s
There is still contention, but not as much. With, say, 8 threads it probably
doesn't matter much. But with 128, it starts to matter.

------
fridek
A helpful Googler once told me to always think as if you had just one server -
with a hierarchy of computation, access and storage costs and different
characteristics of failure.

------
julie1
Nice. Someone actually saying loud that modern x86 are achaotic system that
actually behaves chaotically in important cases.

Modern Computers are multi modal for a lot of resources access :
IO/CPU/locks/Context Switch/IRQ/memory/L1/L2/L3...

What does a modal change means? Frequency changes. Basically the system have 3
main mode Fast/intermittent/slow access.

EX: when cache is full pointer_p = x is ~= 400cycles else it can go up to
15000cycles.

The problem is the modal change between resources accesses comes with sharp
unpredictable transitions and that they are tightly coupled together.

Memory over use overflow L1/L2/L3 in swap that triggers another mode in the
pattern of IO. And slow memory access impact the access to the data structure
needed to treat the HW data stream ...

At the congestion (when computers are heavily sollicited) tightly coupled
interaction coming from code written in non controlled way tends to do a
snowball effect, creating an avalanche of bad behaviours that makes any
important resources access pattern switch from slow/fast access in a way that
parasite other resources. Chaotically.

Why is the code uncontrolled? Almost all optimization are made by complex code
that Intel put into the CPU on the silicium. So your code is uncontrolled
whether you like it or not. And that what this guy complain about. Not
complexity of Intel CPU but complication.

Our unit test makes at best uni dimensional stress tests. While the true world
has between n and n * (n - 1) potential coupling behaviour at most where n is
the number of potential coupling. (We have quite a lot of resources HPET1/2/3
IRQ, context switches).

The number of mathematical combination of behaviour that
netflix/google/amazon/intel pretend to control are just NOT possibly
coverable.

What does it mean? It mean no one should engage its trust in using modern
computers to do any heavy load critical tasks.

At the opposite of old computers we cannot use them at full capacity and we
don't know what is the safe load so we had even more coupling and complexity
and chaos to the architecture.

The problem is means increasingly fast diminishing operation/watt efficiency.

The throughput of energy that humanity can collect per unit of time is
bounded.

Old devs that did Hardware know modern IT are heading the wrong direction.
But, business men don't care, journalists interview the successful business
men, not the coders that are doing the actual job and know how the machinery
behaves.

~~~
nickpsecurity
Good writeup. Except this part:

"Old devs that did Hardware know modern IT are heading the wrong direction.
But, business men don't care, journalists interview the successful business
men, not the coders that are doing the actual job and know how the machinery
behaves."

The majority was going in that direction. However, there's been steady uptake
of very different models that fight some of the problems you mention. Here's a
few of them:

1\. Simple SIMD/MIMD accelerators like DSP's, vector processors, and so on. GP
on GPU's started dominating this market but there's still simple accelerators
on the market.

2\. RISC multicore CPU's plus accelerators. My favorite of this bunch is
Cavium's Octeon II/III just due to the combo of straight-forward cores, good
I/O, and accelerators for stuff we use all over the place. Cranking out mass
market, affordable version of that is a great start at countering Intel/AMD's
mess. That IBM, Oracle, and recently Google maintained RISC components will
help a transition.

3\. "Semi-custom." AMD led the way w/ Intel following on modifications to
their CPU's for customers willing to pay. All we know is there's a ton of
money going into this. Who knows what improvements have been made. Potential
here is to roll back some of the chaos, keep ISA compatibility with apps, and
add in accelerators.

4\. FPGA's w/ HLS. Aside from usable HLS, I always pushed for FPGA's to be
integrated as deep with CPU as possible for low-latency, high performance. SGI
led by putting it into their NUMAlink system as RASC. My idea was it being in
NOC on CPU. Intel just bought Altera. Now I'm simply waiting to see my long-
term vision happen. :)

5\. Language-oriented machines with an understandable, hardware-accelerated
model for SW. The Burroughs B5000 (ALGOL machine) and LISP machines got this
started. Only one still in server space is Azul Systems Vega3 chips: many-core
Java CPU's with pauseless GC for enterprise Java apps. There's a steady stream
of CompSci prototypes for this sort of thing with lots of work on prerequisite
infrastructure like LLVM. Embedded stays selling Forth, Java and BASIC CPU's.
Sandia's Score processor was a high assurance Java CPU that did first-pass in
silicon. Plenty of potential here to implement a clean model and unikernel-
style OS while reusing silicon building blocks like ALU's from RISC or x86
systems for high-performance. Intel won't do it because of what happened to
i432 APX and i960. They're done with that risk but would probably sell such
mods to a paying customer of semi-custom business.

6\. Last but not the least: all the DARPA, etc work in exascale that basically
removes all kinds of bottlenecks and bad craziness while adding good
craziness. Rex Computing is one whose founder posts here often. There's
others. Lots of focus on CPU vs memory accesses vs energy used.

So, these 6 counter the main trends of biggest CPU's with a variety of success
in the business sector. Language-oriented stuff barely exists while RISC +
accelerators, semi-custom, and SIMD/MIMD via GPU's say business is a boomin'.
Old guard's voices weren't entirely drowned out. There's still hope. Hell,
there's even products up for sale and Octeon II cards are going for $200-300
on eBay w/ Gbps acclerators on them. After I knock out some bills, I plan to
try another model to get away from the horrors of Intel. :)

~~~
hga
I think the Azul Vega line has been beaten by commodity x86-64 hardware and
it's a legacy system, this bit I just noticed at the end of the company's page
on the Vega3 reinforces that
([https://www.azul.com/products/vega/](https://www.azul.com/products/vega/)):

" _Choose Vega 3 for high-capacity JDK 1.4 and 1.5-based workloads. For
applications built in Java SE 8, 7 or 6, check out Zing, a highly cost-
effective 100% software-based solution containing Azul’s C4 and ReadyNow!
technology, optimized for commodity Linux servers._ "

Much the same fate as custom Lisp processors.

~~~
alblue
I interviewed Gil Tene at QConLondon this week (founder of Azul and designer
of the Vega). One of the advantages was that the hardware transactional memory
support was built in, which meant that it could perform optimistic locking on
gaining synchronised methods, leading to reduced contention in a mega multi
core/multi threaded way. It worked by tracking which memory areas were loaded
into the cache like and only permitting write back once the transaction was
committed but using the chip's existing cache coherency protocols along the
way.

When they pivoted to commodity x86/64 chips the HTM code couldn't be ported
over with it due to lack of support with the chips. However Moore's law meant
that the Intel processors were faster than the Vega, in the same way that an
iPhone is more powerful than the Cray-1 was. So it's still a win.

He was particularly excited in Intel's latest generation of server CPUs which
now have this (in the form of the TXC instruction if I remember correctly). He
predicted that as these became available they might be re-integrated back into
Zing - though of course this was speculation rather than promise.

His talk on hardware transactional memory is here, and the slides/video will
be available later on InfoQ:

[https://qconlondon.com/presentation/understanding-
hardware-t...](https://qconlondon.com/presentation/understanding-hardware-
transactional-memory)

Obviously the talk/interview isn't there yet (was only done Mon:Tue this week)
but this will be where it is in the coming months:

[http://www.infoq.com/author/Gil-Tene](http://www.infoq.com/author/Gil-Tene)

Disclaimer: I believe that Azul was a sponsor of QCon, but I am writing this
and interviewed Gil because I'm a technology nerd and write for InfoQ because
I want to share. Apologies if this comes over as an advert, which is not the
intent.

------
stcredzero
And I think this is insane. We've gone through a series of "obvious,
pragmatic" next steps to arrive at a model that seems sane and normal on the
surface, but which results in... _wat!?_

[http://mechanical-sympathy.blogspot.com/2011/07/false-
sharin...](http://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html)

Writing highly concurrent and parallel stuff in Erlang to run on top of multi
core systems seems sane to me. In most other languages? Maybe half-sane at
best.

------
hoosein
The biggest difference is event and time management.

~~~
im_down_w_otp
Not fault detection and probability of failure?

------
afsafafaf
How does this compare to IncrediBuild or Electric Accelerator?

