
Attack of the Killer Microseconds - jgrahamc
http://cacm.acm.org/magazines/2017/4/215032-attack-of-the-killer-microseconds/fulltext
======
angry_octet
This is a timely analysis. The virtual memory system, with its concept of
paging to disk, is obsolete in the sense that hardly anybody that does bigger-
than-ram computations rely on the kernel's algorithms to manage it
([https://scholar.google.com.au/scholar?q=out+of+core+algorith...](https://scholar.google.com.au/scholar?q=out+of+core+algorithms)).

The current paging system doesn't have a sensible mechanism for flash-as-core
memory (10x RAM latency, e.g. DDR4 12ns for first word, so 120ns), persistent
memory in general, or using SSDs as an intermediate cache for data on disk.
ZFS has some SSD caching but it is not really taking advantage of the very
large and very fast devices now available.

So we do need new paradigms to use this effectively. I'd like to be able to
reboot and keep running a program from its previous state, because it all sits
in flash-core.

Also there is huge potential to move to more garbage collected memory storage
systems. This goes hand in hand with systems which can progress concurrently,
without the overhead of difficult multi-threaded code, such as parallel
Haskell.

On the negative side, I find the use of the term 'warehouse scale computing'
to be stupidly buzzwordy.

From
[https://gist.github.com/jboner/2841832](https://gist.github.com/jboner/2841832)

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns 14x L1 cache

Mutex lock/unlock 25 ns

Main memory reference 100 ns 20x L2 cache, 200x L1 cache

Compress 1K bytes with Zippy 3,000 ns 3 us

Send 1K bytes over 1 Gbps network 10,000 ns 10 us

Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD Read 1 MB
sequentially from memory 250,000 ns 250 us

Round trip within same datacenter 500,000 ns 500 us

Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X
memory

Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip

Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X
SSD Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms

~~~
YZF
IMO part of the reason is that DRAM is cheap and you can get a lot of it. How
many applications have a working set that is in that relatively small region
between DRAM and SSD.

I close the lid of my laptop and my memory is saved to SSD, I open it and it
comes back pretty much immediately, what more do I need from that perspective?

One thing that tends to happen with caches is that you tend to get smaller
returns as they grow larger. You're saying my 64GB RAM can be another level of
cache for my 500GB SSD but I'm not quite sure what we'd do with that and why
we need more than what we can already do with this SSD at the application
layer. I agree that SSD paging can probably be improved. Maybe support can be
moved out of the OS into hardware to get better latency. I'd still think that
if you're thrashing the SSD you're likely not getting good performance just
like if you're thrashing DRAM you're not doing as good as you could be doing.

~~~
angry_octet
> How many applications have a working set that is in that relatively small
> region between DRAM and SSD.

Actually I'd say quite a few. Because not everyone has a server with 40 cores
and 1TB RAM because that is quite expensive. But many people have 8 cores and
32GB RAM, and could conceivably add 512GB of fast flash-core (by which I mean
fast flash memory accessibly via the memory bus, rather than PCIe, although
that may be fast enough). So, your laptop could search a ~fast-as-RAM key
value store with 200GB of data, even with only 8GB of RAM.

But I don't think any of this is particularly relevant to desktops/laptops as
such. This is more of a programming paradigm change. Main memory is still
going to be unbearably slow (many clocks to fill a cache line), but next level
storage will only be 10 times slower than main memory, instead of 1000 times
slower. What do we do with that? How do we orchestrate inter-processor and
inter-chassis cooperation on solving problems? (For example, if inter-node
flash-core IPC is about the same speed as intra-node. Distributed flash core
could be hundreds of TB.) What can we do if memory is persistent? How will we
adapt algorithms to reduce flash wear problems?

[https://www.usenix.org/conference/inflow16](https://www.usenix.org/conference/inflow16)

------
londons_explore
One might imagine this being solved by forcing all "malloc" operations to
specify for which thread or request that memory will be used by.

The malloc library or OS can then mark those pages with specific thread
numbers. When that thread then blocks on millisecond IO (such as a remote
network request), allow that memory to be moved via RDMA to a remote store (a
microsecond level delay). When the thread unblocks, move the memory back
again.

The key benefit here over regular paging is that it's on a per-thread basis
rather than a per-page basis, so latency isn't so critical. It's also RDMA
rather than to local disk, so it can go anywhere in the warehouse, allowing
perfect RAM utilization.

Now, for warehouse level computing you no longer need to worry about where
your CPU's or RAM are - you can efficiently move RAM about between machines
(effectively paging), as long as at any point in time enough threads across
your warehouse have reasonably long blocking times (milliseconds or more), and
you have enough thread/request specific memory usage, which can be moved about
far more easily than shared memory.

------
bullen
This is bad for Google:

1) They don't understand IO-wait. Async-to-async HTTP is going to be very
important once you manage to make it work:
[https://github.com/tinspin/rupy/wiki/Fuse](https://github.com/tinspin/rupy/wiki/Fuse)

2) They don't understand the end of Moore's law combined with Peak
Hydrocarbons. CPU is the bottleneck not network, memory or disk.

Tip to YC: force comment on downvote.

~~~
icebraining
I didn't downvote you, but I have to say your comment neither explains what
async-to-async HTTP means (and the link doesn't help either), nor why it's
important, nor why the article shows that Google doesn't understand it.

Frankly, if you want to claim that Google research engineers don't understand
something regarding computing, you need to make a good case for it, rather
than a short comment.

~~~
bullen
Watch the video and you will understand that/why google chooses sync.
development.

As to why it's important, you will see eventually, or you can try to figure it
out by reading the link again in a couple of months. The brain needs time to
understand things that are hidden by self preservation, like how money is
created f.ex.

~~~
icebraining
I understand why Google chooses a synchronous model of development, that
wasn't what I said you should explain. If you believe your explanation is
crystal clear and complete, and people just need time, then good luck. But I
suggest you try reformulating it.

By the way, _" Money creation in the modern economy"_ wasn't hard to
understand. In my experience, most things aren't, if they're well explained
and you have the foundation knowledge. "The brain needs time to adjust" tends
to be an excuse used by people selling perpetual motion machines.

~~~
bullen
Ok, so if you understand the creation of money and you understand that google
is super good at engineering, can you see the link between those two things?

Below is the answer, tip no. 2 to YC: add spoiler tag.

They are using their monopoly and benefits (their clients borrowed money at
low interest) to build data centers that use lots of cheap finite energy
instead of actually trying to innovate.

The only energy added to the planet is sunrays, the only way that energy is
stored is by photosyntesis, we have consumed millions of years of sunshine in
200 years.

If all internet systems where async-to-async you could probably close 10
nuclear power plants immediately. (btw we only have 50 years of uranium left
at this rate)

Solar and wind to electrivity are net energy negative and require more fossil
energy to manufacture than they deliver in their lifetime.

Talk about "doing no evil".

~~~
icebraining
Like I said, you're yet to explain what "async-to-async" actually means in
this context. I say this, because I don't think you actually understand what
Google means when you write that they're using a synchronous model of
development. What they're doing is, quoting from the article, "shifting the
burden of managing asynchronous events away from the programmer to the
operating system or the thread library". The programs themselves are still
asynchronous, it's just that the programmer doesn't have to care.

So if you're going to claim that using a synchronous programming model over an
asynchronous runtime wastes a lot of energy, you're going to have to (1)
explain very well the differences between that model and the one you're
proposing and (2) explain how your model wastes less energy.

You're also going to have to support the rather extraordinary claim that
"[s]olar and wind to electrivity are net energy negative". That myth was
already false back in the late 90s, it's quite absurd nowadays. The Energy
Payback time of a current solar panel is just a couple of years, depending on
how sunny the place you put it is.

~~~
bullen
You can't move the async without the developer touching it if you want the
async to save you IO-wait = be on the socket. You need the async. to be in the
network.

Async-to-async means your CPU only touches your work when it has to and when
the instance is waiting the work is in transit on the network so the CPU can
do other things.

> In practice, you need two small pools of threads with the same amount of
> threads as the CPU has cores. And those will pass the "work" back and forth
> between incoming requests -> outgoing requests and then incoming response ->
> outgoing response, both arrows indicating a thread to thread handover that
> will make most programmers faint. (but once the platform is built developers
> just need to wrap their head around the philosophy and not making that
> handover work)

Async-to-async is the solution for zero IO-wait without context switching
problems and scalable to millions of concurrent sockets per machine, not that
that's what you want, what you want is a distributed model where many small
machines co-operate on a larger task, but that requires async-to-async too. An
here you have to try and use any SQL database to see why that fails in a
distributed system. etc. etc. I need to write a book and I rather write
software to prove my point instead. (if yc had edit after longer duration I
would explain this here too later 3rd tip)

When a thread is stuck in IO-wait that is pure loss, eventually all sync.
systems break due to IO-wait and that means you need overcapacity. The
advantage with Async. is that 100% CPU means little to no problem and you can
have the system appropriately powered at all times by sharing resources in a
distributed sandbox instead of sharding them in virtualboxes/containers.

> Sidenote: you need to be able to share memory across threads so forget about
> programming languages that don't have "real" threads (PHP, Javascript, ruby,
> python, etc.)

Overloads are fun in a async. system, because the CPU just keeps working on
the queue that it fundamentally needs (that sync systems have between
everything to avoid crashing) and watching the latency go up a bit when the
system is overloaded is so much fun when you see the system does not grind to
a halt but just churns right through the overload.

> Async. is inherently anti-fragile.

About solar, you need to remove money and add all energy required to build
everything around the solar panel, including the energy required for the
person assembling it and even that persons cat food, transport, mining, etc.
Otherwise you're only seeing "the tree and not the forest". Energy has no
dollar price, you should not be able to buy energy with "nothing".

> The dollar does not exist.

As an exercise you can read the source, and then when you feel confident
enough google: async. processor.

My next and last project is to apply async. to a 3D engine, where the
rendering/physics is done in two c++ threads, and the scripting in java
asynchronously modifies the state for those threads from a third thread.

~~~
icebraining
The way you keep extolling the virtues of asynchronous execution leads me to
conclude that you don't understand Google's approach, since it _is_
asynchronous as well. If you want to explain the actual concrete differences
between your approach and theirs, feel free, otherwise there's little point is
continuing this discussion.

~~~
bullen
Is it async. with one thread per "work"? If so you are in trouble. I tried to
see if you can abstract async. into a sync model but you need the callback and
wakeup syntax which is a dealbreaker. Google is not async the way you need
async to work for it to make sense for IO-wait.

What this discussion should lead to is for you to read the code and learn how
async. can work. Not just to blindly follow the majority. Maybe if you like to
have someone explain things; go to school, but don't take a student loan.

But you are right there are a million things I'm missing to communicate,
because of time and frankly I forget how it all works, intuition is very
important for progress, if you had to explain everything all the time nothing
would get made.

~~~
icebraining
_Is it async. with one thread per "work"? If so you are in trouble. I tried to
see if you can abstract async. into a sync model but you need the callback and
wakeup syntax which is a dealbreaker._

No, you don't need any callback syntax. You can use CSP or Actors, and there
are probably other models. See Go, Limbo, Erlang, Akka, and a bunch of other
languages and frameworks.

~~~
bullen
Ok, tell you right now, this abstraction is overengineered (akka).

My Async.java class replaces the whole framework in 750 lines of Java:
[http://root.rupy.se/code](http://root.rupy.se/code)

Erlang is not a good language because it can't share memory across threads.

As for Go, it doesn't have real threads that can share memory either, you
can't hot-deploy it and coding non-VM languages should only be done if you
need the performance (3D engine).

Don't know anything about Limbo except that it probably is crummy as hell and
probably suffer from the same "no real threads" syndrome.

The reason you don't need callbacks is because these language ARE "callback"
languages. Just like Javascript.

There are only two families of real usable programming languages: native: C
(C++, Rust, etc.) and VM: Java (C#, F#, etc.).

------
YZF
Really? We are good at nanoseconds and milliseconds but not at microseconds?
Last I checked a microsecond was 1000 nanoseconds so you can't really be good
at nanoseconds but somehow bad at microseconds.

This ties in a little to the recent HN discussions about the cost of a context
switch. I think what they're trying to say, and not very well, is that there
is somewhat of a discontinuity when you move between different levels of
abstraction. There's examples of this phenomena in operating systems where the
overhead of making a system call can be so high you can't get the latency down
or in programming languages, e.g. running over a virtual machine or an
interpreter. But this is far from new and there's a continuum of solutions
from hardware like DSPs through real time operating systems, lightweight
threads, lower lever languages, kernel bypass. Abstractions have cost, you
want protected/virtual memory there's a cost and you pay that cost in your
context switches. Not sure you can have your cake and eat it here but there's
plenty of different choices on the menu for different situations.

~~~
dom0
As I've said before, a context switch is really cheap on modern hardware. What
kills it if you need to drag all sorts of dirt through all the caches. I.e.
cold caches kill performance. Hardly news.

~~~
jacquesm
A context switch is only cheap if the task you switch to doesn't do anything.
Even 2 decades ago you could to upwards of 100K task switches _per second_ on
relatively anemic hardware. That's never been the problem.

~~~
dom0
In terms of cycles the switch has become more expensive since a lot of
registers were added that need saving/restoring. (But since frequencies have
increased a lot the absolute time shrank when we reached the "3 GHz era"; I
think it's been pretty much constant for the last ~10 years or so). But you
are right that the switch itself on modern-era (=32+ bit few-chip systems with
MMU) hardware was never really that expensive (on the order of 10-20 us).

