Hacker News new | comments | show | ask | jobs | submit login
Attack of the Killer Microseconds (acm.org)
89 points by jgrahamc 264 days ago | hide | past | web | favorite | 27 comments



This is a timely analysis. The virtual memory system, with its concept of paging to disk, is obsolete in the sense that hardly anybody that does bigger-than-ram computations rely on the kernel's algorithms to manage it (https://scholar.google.com.au/scholar?q=out+of+core+algorith...).

The current paging system doesn't have a sensible mechanism for flash-as-core memory (10x RAM latency, e.g. DDR4 12ns for first word, so 120ns), persistent memory in general, or using SSDs as an intermediate cache for data on disk. ZFS has some SSD caching but it is not really taking advantage of the very large and very fast devices now available.

So we do need new paradigms to use this effectively. I'd like to be able to reboot and keep running a program from its previous state, because it all sits in flash-core.

Also there is huge potential to move to more garbage collected memory storage systems. This goes hand in hand with systems which can progress concurrently, without the overhead of difficult multi-threaded code, such as parallel Haskell.

On the negative side, I find the use of the term 'warehouse scale computing' to be stupidly buzzwordy.

From https://gist.github.com/jboner/2841832

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns 14x L1 cache

Mutex lock/unlock 25 ns

Main memory reference 100 ns 20x L2 cache, 200x L1 cache

Compress 1K bytes with Zippy 3,000 ns 3 us

Send 1K bytes over 1 Gbps network 10,000 ns 10 us

Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD Read 1 MB sequentially from memory 250,000 ns 250 us

Round trip within same datacenter 500,000 ns 500 us

Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory

Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip

Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms


IMO part of the reason is that DRAM is cheap and you can get a lot of it. How many applications have a working set that is in that relatively small region between DRAM and SSD.

I close the lid of my laptop and my memory is saved to SSD, I open it and it comes back pretty much immediately, what more do I need from that perspective?

One thing that tends to happen with caches is that you tend to get smaller returns as they grow larger. You're saying my 64GB RAM can be another level of cache for my 500GB SSD but I'm not quite sure what we'd do with that and why we need more than what we can already do with this SSD at the application layer. I agree that SSD paging can probably be improved. Maybe support can be moved out of the OS into hardware to get better latency. I'd still think that if you're thrashing the SSD you're likely not getting good performance just like if you're thrashing DRAM you're not doing as good as you could be doing.


> How many applications have a working set that is in that relatively small region between DRAM and SSD.

Actually I'd say quite a few. Because not everyone has a server with 40 cores and 1TB RAM because that is quite expensive. But many people have 8 cores and 32GB RAM, and could conceivably add 512GB of fast flash-core (by which I mean fast flash memory accessibly via the memory bus, rather than PCIe, although that may be fast enough). So, your laptop could search a ~fast-as-RAM key value store with 200GB of data, even with only 8GB of RAM.

But I don't think any of this is particularly relevant to desktops/laptops as such. This is more of a programming paradigm change. Main memory is still going to be unbearably slow (many clocks to fill a cache line), but next level storage will only be 10 times slower than main memory, instead of 1000 times slower. What do we do with that? How do we orchestrate inter-processor and inter-chassis cooperation on solving problems? (For example, if inter-node flash-core IPC is about the same speed as intra-node. Distributed flash core could be hundreds of TB.) What can we do if memory is persistent? How will we adapt algorithms to reduce flash wear problems?

https://www.usenix.org/conference/inflow16


There are systems like Druid that memory map files and rely on the OS for paging in and out segments: http://druid.io/docs/latest/operations/performance-faq.html


One might imagine this being solved by forcing all "malloc" operations to specify for which thread or request that memory will be used by.

The malloc library or OS can then mark those pages with specific thread numbers. When that thread then blocks on millisecond IO (such as a remote network request), allow that memory to be moved via RDMA to a remote store (a microsecond level delay). When the thread unblocks, move the memory back again.

The key benefit here over regular paging is that it's on a per-thread basis rather than a per-page basis, so latency isn't so critical. It's also RDMA rather than to local disk, so it can go anywhere in the warehouse, allowing perfect RAM utilization.

Now, for warehouse level computing you no longer need to worry about where your CPU's or RAM are - you can efficiently move RAM about between machines (effectively paging), as long as at any point in time enough threads across your warehouse have reasonably long blocking times (milliseconds or more), and you have enough thread/request specific memory usage, which can be moved about far more easily than shared memory.


This is bad for Google:

1) They don't understand IO-wait. Async-to-async HTTP is going to be very important once you manage to make it work: https://github.com/tinspin/rupy/wiki/Fuse

2) They don't understand the end of Moore's law combined with Peak Hydrocarbons. CPU is the bottleneck not network, memory or disk.

Tip to YC: force comment on downvote.


I didn't downvote you, but I have to say your comment neither explains what async-to-async HTTP means (and the link doesn't help either), nor why it's important, nor why the article shows that Google doesn't understand it.

Frankly, if you want to claim that Google research engineers don't understand something regarding computing, you need to make a good case for it, rather than a short comment.


Watch the video and you will understand that/why google chooses sync. development.

As to why it's important, you will see eventually, or you can try to figure it out by reading the link again in a couple of months. The brain needs time to understand things that are hidden by self preservation, like how money is created f.ex.


I understand why Google chooses a synchronous model of development, that wasn't what I said you should explain. If you believe your explanation is crystal clear and complete, and people just need time, then good luck. But I suggest you try reformulating it.

By the way, "Money creation in the modern economy" wasn't hard to understand. In my experience, most things aren't, if they're well explained and you have the foundation knowledge. "The brain needs time to adjust" tends to be an excuse used by people selling perpetual motion machines.


Ok, so if you understand the creation of money and you understand that google is super good at engineering, can you see the link between those two things?

Below is the answer, tip no. 2 to YC: add spoiler tag.

They are using their monopoly and benefits (their clients borrowed money at low interest) to build data centers that use lots of cheap finite energy instead of actually trying to innovate.

The only energy added to the planet is sunrays, the only way that energy is stored is by photosyntesis, we have consumed millions of years of sunshine in 200 years.

If all internet systems where async-to-async you could probably close 10 nuclear power plants immediately. (btw we only have 50 years of uranium left at this rate)

Solar and wind to electrivity are net energy negative and require more fossil energy to manufacture than they deliver in their lifetime.

Talk about "doing no evil".


Like I said, you're yet to explain what "async-to-async" actually means in this context. I say this, because I don't think you actually understand what Google means when you write that they're using a synchronous model of development. What they're doing is, quoting from the article, "shifting the burden of managing asynchronous events away from the programmer to the operating system or the thread library". The programs themselves are still asynchronous, it's just that the programmer doesn't have to care.

So if you're going to claim that using a synchronous programming model over an asynchronous runtime wastes a lot of energy, you're going to have to (1) explain very well the differences between that model and the one you're proposing and (2) explain how your model wastes less energy.

You're also going to have to support the rather extraordinary claim that "[s]olar and wind to electrivity are net energy negative". That myth was already false back in the late 90s, it's quite absurd nowadays. The Energy Payback time of a current solar panel is just a couple of years, depending on how sunny the place you put it is.


You can't move the async without the developer touching it if you want the async to save you IO-wait = be on the socket. You need the async. to be in the network.

Async-to-async means your CPU only touches your work when it has to and when the instance is waiting the work is in transit on the network so the CPU can do other things.

> In practice, you need two small pools of threads with the same amount of threads as the CPU has cores. And those will pass the "work" back and forth between incoming requests -> outgoing requests and then incoming response -> outgoing response, both arrows indicating a thread to thread handover that will make most programmers faint. (but once the platform is built developers just need to wrap their head around the philosophy and not making that handover work)

Async-to-async is the solution for zero IO-wait without context switching problems and scalable to millions of concurrent sockets per machine, not that that's what you want, what you want is a distributed model where many small machines co-operate on a larger task, but that requires async-to-async too. An here you have to try and use any SQL database to see why that fails in a distributed system. etc. etc. I need to write a book and I rather write software to prove my point instead. (if yc had edit after longer duration I would explain this here too later 3rd tip)

When a thread is stuck in IO-wait that is pure loss, eventually all sync. systems break due to IO-wait and that means you need overcapacity. The advantage with Async. is that 100% CPU means little to no problem and you can have the system appropriately powered at all times by sharing resources in a distributed sandbox instead of sharding them in virtualboxes/containers.

> Sidenote: you need to be able to share memory across threads so forget about programming languages that don't have "real" threads (PHP, Javascript, ruby, python, etc.)

Overloads are fun in a async. system, because the CPU just keeps working on the queue that it fundamentally needs (that sync systems have between everything to avoid crashing) and watching the latency go up a bit when the system is overloaded is so much fun when you see the system does not grind to a halt but just churns right through the overload.

> Async. is inherently anti-fragile.

About solar, you need to remove money and add all energy required to build everything around the solar panel, including the energy required for the person assembling it and even that persons cat food, transport, mining, etc. Otherwise you're only seeing "the tree and not the forest". Energy has no dollar price, you should not be able to buy energy with "nothing".

> The dollar does not exist.

As an exercise you can read the source, and then when you feel confident enough google: async. processor.

My next and last project is to apply async. to a 3D engine, where the rendering/physics is done in two c++ threads, and the scripting in java asynchronously modifies the state for those threads from a third thread.


The way you keep extolling the virtues of asynchronous execution leads me to conclude that you don't understand Google's approach, since it is asynchronous as well. If you want to explain the actual concrete differences between your approach and theirs, feel free, otherwise there's little point is continuing this discussion.


Is it async. with one thread per "work"? If so you are in trouble. I tried to see if you can abstract async. into a sync model but you need the callback and wakeup syntax which is a dealbreaker. Google is not async the way you need async to work for it to make sense for IO-wait.

What this discussion should lead to is for you to read the code and learn how async. can work. Not just to blindly follow the majority. Maybe if you like to have someone explain things; go to school, but don't take a student loan.

But you are right there are a million things I'm missing to communicate, because of time and frankly I forget how it all works, intuition is very important for progress, if you had to explain everything all the time nothing would get made.


Is it async. with one thread per "work"? If so you are in trouble. I tried to see if you can abstract async. into a sync model but you need the callback and wakeup syntax which is a dealbreaker.

No, you don't need any callback syntax. You can use CSP or Actors, and there are probably other models. See Go, Limbo, Erlang, Akka, and a bunch of other languages and frameworks.


Ok, tell you right now, this abstraction is overengineered (akka).

My Async.java class replaces the whole framework in 750 lines of Java: http://root.rupy.se/code

Erlang is not a good language because it can't share memory across threads.

As for Go, it doesn't have real threads that can share memory either, you can't hot-deploy it and coding non-VM languages should only be done if you need the performance (3D engine).

Don't know anything about Limbo except that it probably is crummy as hell and probably suffer from the same "no real threads" syndrome.

The reason you don't need callbacks is because these language ARE "callback" languages. Just like Javascript.

There are only two families of real usable programming languages: native: C (C++, Rust, etc.) and VM: Java (C#, F#, etc.).


Really? We are good at nanoseconds and milliseconds but not at microseconds? Last I checked a microsecond was 1000 nanoseconds so you can't really be good at nanoseconds but somehow bad at microseconds.

This ties in a little to the recent HN discussions about the cost of a context switch. I think what they're trying to say, and not very well, is that there is somewhat of a discontinuity when you move between different levels of abstraction. There's examples of this phenomena in operating systems where the overhead of making a system call can be so high you can't get the latency down or in programming languages, e.g. running over a virtual machine or an interpreter. But this is far from new and there's a continuum of solutions from hardware like DSPs through real time operating systems, lightweight threads, lower lever languages, kernel bypass. Abstractions have cost, you want protected/virtual memory there's a cost and you pay that cost in your context switches. Not sure you can have your cake and eat it here but there's plenty of different choices on the menu for different situations.


This is probably the best summary paragraph from the article that answers your question:

Techniques optimized for nanosecond or millisecond time scales do not scale well for this microsecond regime. Superscalar out-of-order execution, branch prediction, prefetching, simultaneous multithreading, and other techniques for nanosecond time scales do not scale well to the microsecond regime; system designers do not have enough instruction-level parallelism or hardware-managed thread contexts to hide the longer latencies. Likewise, software techniques to tolerate millisecond-scale latencies (such as software-directed context switching) scale poorly down to microseconds; the overheads in these techniques often equal or exceed the latency of the I/O device itself.

To expand a bit: the reason you can be good at nanoseconds but bad at microseconds is that you need to throw a thousand times more resources at the problem to solve it using the same techniques. You can get good instruction-level parallelism with current memory latencies using a reorder buffer with a hundred or so entries, but can you really just scale that up to a ROB with 100,000 entries?


What's the problem that you're trying to solve?

OOE, branch prediction, prefetching, are to allow a CPU to execute multiple instructions per cycle on a GHz clock. I.e. within a nanosecond. Memory latency all the way out to DRAM is more than enough to do things in microseconds and the caches are big enough to get you a lot of concurrency within that.

It's true that when you're coding over a modern OS in a high level language you can't get e.g. a GPIO pin to update in a microsecond. But that's life. Use a real-time OS you can do better. Get rid of the OS and you'll do even better. Use a different CPU or an FGPA and you may do better. The various techniques used to get a CPU to execute 20 instructions in a nanosecond aren't the limiting factor in that stack.

EDIT: Once upon a time there was a UART called the 8250. You could either poll it or if you wanted to do something else concurrently you'd use interrupts. Only interrupts have to do a context switch (in this case pushing a few things onto the stack) so they cost a little more. Once serial rates became high enough interrupts started consuming more and more cycles. So they added a buffer, and we got the 16550 (or one of its cousins). Now the thing would only interrupt above a certain watermark and we can cut the number of interrupts by 10 so we can get higher throughput at the expense of higher latency. Same thing happens with network interfaces. You can drive a 10Gbps network because a lot of stuff is buffered and/or hardware offloaded. It doesn't make sense to bitbang 10G Ethernet out of a general purpose CPU. These sorts of tradeoffs in different parts of the system are unavoidable but yet the tools to solve problems down to GHz clocks are out there both for software and hardware.


The problem being posed here is: we have a thread that is computing something, and as part of that it needs to make a request that will be satisfied with millisecond-range latency (it might be an RDMA request for a memory page on another machine in the same data centre, or a maybe a largish vector of data to be processed on a GPU, etc.).

If we suspend the thread at the OS level and re-schedule it when the result is available (as is done when waiting for, eg, data from a local disk) then the latency of the rescheduling swamps that of the actual operation. If we spin and wait for the result, we quickly run out of instruction-level parallelism and end up underutilising our CPUs.


In the general case what you would do is one of these:

1. Increase the size of the data you're operating on so you get more compute per context switch. I.e. you amortize the latency over more data or in other words you trade off latency for throughput. If you're dealing with your local disk e.g. you fread() 100MB instead of 1MB.

2. Rearrange your software. If you have many threads each reading small chunks and operating on them you rearrange the data and your threads such that you perhaps have some threads doing I/O while other threads doing computation. Other rearrangements include moving some of your code into the kernel or alternatively moving pieces from the kernel into user mode to reduce the "depth" of the stack. (RTOS or getting rid of the OS are more extreme examples of "rearranging")

3. Use different hardware, e.g. historically DSP vs. general purpose CPUs.

What I'm trying to say, apparently not very clearly, is that I see this as an engineering problem, not a fundamental problem, and an old problem, not a new problem. It's hard to talk about this though without a concrete example.

Instruction level parallelism is in nanoseconds, not microseconds, a 2GHz CPU can execute 10000-ish ops across all its execution units in a microsecond. So that's not a bottleneck in processing things that are on the order of microseconds. As long as what you're switching every microsecond is "small" you can perform many concurrent tasks at microsecond blocks. If you need to flush all your caches and remap all your memory then that will be a problem. But caches have a certain latency because they are very close to the CPU and they are sized under similar constraints. This is where it's up to us to engineer the right solution under the constraints.

If you need microsecond latency on your processing, let's say you want to do real-time processing on 1000 video conference channels and you never want to introduce more than 1ms delay, then you need to engineer your system differently than a system that doesn't have that sort of requirement. That's always been the case. If the expectation is that you'll run 1000 Linux processes that read and write the video from disk and this will just somehow work then that's not going to happen. It seems we're trying to "backsplain" why something that wasn't designed to be a real time system isn't behaving like a real time system. Linux is a general purpose OS built for certain tasks and the CPUs we tend to run it on are also built for certain tasks. But there are other OSes and other CPUs (or FPGA or ASICs or GPUs). Not to say there isn't a lot of flexibility in Linux on x86 for different situations.


They mention point 1 in the article - essentially this is what the HPC crowd are doing with their static environment that's easier to batch, but they're saying it doesn't work in the Google-type dynamic environment because they have a lots of heterogeneous threads doing slightly different things for different users.

Instruction level parallelism is in nanoseconds, not microseconds, a 2GHz CPU can execute 10000-ish ops across all its execution units in a microsecond. So that's not a bottleneck in processing things that are on the order of microseconds.

It's not about the CPU being a bottleneck at all. It's about not having your CPU either sit idle for that time in which it could have done 10,000-ish ops, or spend that amount of time running scheduler code that is pure overhead.

The idea is millions of threads (across your whole compute environment, not one machine!) that are each doing compute/usec-latency-io/compute/usec-latency-io/... and scheduling those to make efficient use of your CPU resources. Coupling each usec-latency-io request with a heavyweight reschedule that's on the same order of magnitude is clearly not efficient, and neither is spinning for that time in which your CPU could have been doing a quantum of useful work.

It seems we're trying to "backsplain" why something that wasn't designed to be a real time system isn't behaving like a real time system.

This is going down the wrong path - the issue at hand isn't about real-time latency at all. The reason rescheduling to wait for this kind of IO isn't a good solution isn't because of latency introduced, it's because you might end up wasting 50% or more of your CPU time just running scheduler code - it's purely efficiency.

Use different hardware, e.g. historically DSP vs. general purpose CPUs.

This is also covered in the article, the whole idea is for this to be workable by mere-mortal programmers on general purpose platforms.


Let's talk actual examples. I've read the article. Can you give me a concrete example of the "attack of the killer microseconds"? Something that doesn't boil down to "the article said so" or "we analyzed Google's data center and we found we're running poorly written software on it?". Something that is fundamental. I've worked on software where the amount of context switching was so high that it consumed a significant amount of cycles (on hypervisors this used to be a bigger problem btw, and maybe still is) and we analyzed the problem and fixed it reducing the number of context switches by a factor of 10. The bottom line there was poorly designed software.

I understand the author wants to have his cake and eat it or in his words "code simplicity and programmer productivity". He wants some magical architectures where software engineers don't have to worry about tradeoffs, the OS or the hardware. They should be able to write 500 lines Python and the whole thing will run in a single cycle and do I/O and compute at the same time. They should be able to write 1000's of threads with a working set of gigabytes and not worry about the size of the L1 cache. They should be able to write a 10GB file 1 byte at a time. That's unlikely to happen. In the meantime specific engineering problems can be solved with specific techniques.

The author's conclusion:

System designers can no longer ignore efficient support for microsecond-scale I/O, as the most useful new warehouse-scale computing technologies start running at that time scale. Today's hardware and system software make an inadequate platform, particularly given support for synchronous programming models is deemed critical for software productivity.

Micro-second scale I/O isn't new. Low level I/O devices always operated at these time scales. Writes to disks with memory buffers. Sequential disk reads. Networking. Video cards. From a software perspective very low overhead concurrency goes back to coroutines in the 1950's through various forms of lightweight threads. The ASICs inside your typical rack-top switch are running software that is absolutely capable of handling this sort of workload efficiently.

I think what is a relatively new phenomena is that there are a lot of developers who don't understand anything about the underlying hardware. The other thing that has happened is that systems have become more complex. It's not easy for someone to go from "wasted cycles" in the data center to the design issues that are the cause of these inefficiencies.

People have tried many different permutations of software and hardware architecture and I think it's unlikely that they completely missed the magical one that lets you have your cake and eat it. The primary reason being that the constraints are physical ones. Can we do things a little more cleverly? Sure, we do that all the time.

I keep coming back to real-time OSes not because this is necessarily a hard real-time situation but because real-time OSes have forever had very lightweight threading, the ability to handle interrupts in lower layers of the stack, zero copy network stacks etc. They are an off-the-shelf solution for some of the problems discussed. Obviously (or not so obviously) there is a price to pay for giving up a lot of the "luxuries" of high level OSes but mere-mortals are absolutely capable of working in this environment.

If it's not clear ;) I didn't like this article.

EDIT: Another IMO interesting observation is that a lot of the techniques for dealing with storage I/O are from times where CPUs were a lot slower. For the longest time ~10ms seek times were relatively constant as CPUs got faster by orders of magnitude. So in some sense a microsecond latency SSD on a 3GHz CPU is already benefiting from a lot of work that went into shaving cycles from the millisecond I/O on the 8Mhz CPU of past.


As I've said before, a context switch is really cheap on modern hardware. What kills it if you need to drag all sorts of dirt through all the caches. I.e. cold caches kill performance. Hardly news.


A context switch is only cheap if the task you switch to doesn't do anything. Even 2 decades ago you could to upwards of 100K task switches per second on relatively anemic hardware. That's never been the problem.


In terms of cycles the switch has become more expensive since a lot of registers were added that need saving/restoring. (But since frequencies have increased a lot the absolute time shrank when we reached the "3 GHz era"; I think it's been pretty much constant for the last ~10 years or so). But you are right that the switch itself on modern-era (=32+ bit few-chip systems with MMU) hardware was never really that expensive (on the order of 10-20 us).


Thousands of instructions is not "really cheap". You can cause a lot of slowdown by accidentally calling into the kernel in a tight loop, even if L1 is big enough for both.

A different CPU design could make context switches essentially free.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: