Hacker News new | past | comments | ask | show | jobs | submit login
What does I/O bound mean? (2008) (erikengbrecht.blogspot.com)
79 points by higerordermap 20 days ago | hide | past | favorite | 55 comments

Please stop using bonnie (I know, it's a 2008 post, but in case it inspires anyone). It was useful at the time, but written in a different era, and nowadays produces misleading results on multi-threaded I/O stacks (along with other problems). I spoke to Tim about this issue back at Sun, and he did write about it.

Today I usually use fio for disk/FS benchmarks, as it's frequently updated by Jens Axboe, who is also responsible for the modern Linux multiqueue block I/O implementation.

fio is definitely the way to go! Are you looking at what Jens has done with io_uring as well for async IO?

Yes, io_uring is great, and not just for I/O: it's a generic asynchronous interface for other kernel calls as well.

In order for something to be "bound" by something else there must be an element of concurrency. People often forget this and I think it confuses the definition.

At this point people say "bound by X" when what they really mean is "most of the time is spent doing X".

If you spend 5 seconds doing I/O and then afterwards 1 second doing CPU heavy work, you are not I/O bound. You can tell this because making the CPU work take 0.5 seconds less time still saves you 0.5 seconds. It's not a bound at all.

In order for this to be I/O bound you would need to do the CPU heavy work during the I/O. Once the task is done that way, any optimisation on the CPU heavy work has precisely zero effect on runtime. This is how you know it's bound.

I think the reason that web programmers have become confused by this particularly is because on a web server there is almost always inter-request concurrency. You run a bunch of workers on the same CPU and now your waiting-for-db time is concurrent with the doing-cpu-work time for a different request.

The system as a whole can be considered I/O bound here because no mater how much faster you make the CPU, the number of requests per second you can process doesn't go up and so it seems like you might as well not care about how fast the cpu-work part is.

And that's all well and good, but it's cold comfort for any individual request that still has to wait the full amount of time for a request to be served. You could absolutely improve the situation for any given request by improving the speed of the CPU or optimising the code it runs.

While correct, I think that's a bit too pedantic. For example, if you have a program that spend 5 seconds doing I/O and 0.05 second in CPU, for all intents and purposes it's I/O bound. Given the fast CPUs in today's world, this kind of workload is pretty common.

It's probably better to look at things proportionally, rather than in terms of absolute time saved. And it's definitely more useful to consider terms like "IO bound" and "CPU bound" as relative and approximate, rather than unattainable absolutes.

Using the original numbers of 5 seconds IO, 1 second CPU: If doubling the speed of the CPU portion only makes the overall process 8% faster, it's not meaningfully CPU bound. But doubling the IO speed would reduce the total job time by 41%, which is a pretty big speedup overall.

Using your example of 5 seconds IO and 0.05 seconds CPU: doubling the CPU speed gives a 0.5% overall reduction in job time, and doubling the IO performance gives a 49.5% overall reduction. So this example can be meaningfully said to be more IO bound than the first example.

Well, almost. A good pedantic definition would show how to see i/o rates as a percentage of maximum possible. And if those numbers were quite high and the CPU part was in the i/o loop (r/w, then compute, repeat) then indeed i/o bound. No amount of work on the CPU part will appreciably drive down absolute time. If the CPU part is only after i/o the CPU improvements can at most only help relatively little but at least CPU delays isn't making i/o artificially slower by delaying when new i/o can start.

What's a little misleading in the example is that the problem setup already is a half answer. In complex commercial apps just arriving at an apportionment of 5 and .05 for some narrow set of use cases is already an achievement.

If we see 5 and .05 comes in the mixed case (CPU in the i/o loop) then there is a choice on what to fix to drive down absolute time. If the numbers are not so skewed it may take more debugging to suss out what's going on. Here it is obvious with givens.

If apart we conclude the i/o is slow (pedantically i/o bound) and there's no engineering need to figure out where labor needs to go. Here it's pointless to wonder about CPU or memory.

The emphasis in sum is on the management end: our system slow. But what are we going to fix and why.

I was only commenting on this on HN just the other day. When most people say IO bound what they really mean is "There's a hot CPU but it's across a network" ie: "I wrote really inefficient SQL queries, therefor I'm IO bound, therefor I don't have to care about CPU" - and the process pushes further and further downstream as every service talking to one of these "IO bound" services also becomes "IO bound".

This is obviously sometimes the case. But more often I’ve seen IO bound apps spending all their time on network roundtrip latency. I.E. not a few poorly performing SQL queries, but a thousand queries which all take a millisecond or two.

Totally. I've seen similar things. I've also seen thread contention (such as on a connection pool) that can look a lot like a slow database query/ an "IO bound" workflow. I think profiling is just really hard and lots of code tends to be very inefficient at actually performing IO.

I spent my last 2 weeks optimizing a decade old SaaS to handle massive traffic spikes from one of our biggest customers. We had other customers serving similar amounts of traffic, but with smaller data sets.

- Increasing the numbers of servers running the app. App connections were still stacking up though. This gave us more breathing room though for connections to start stacking and handle small spikes.

- The database seemed very overloaded with so many concurrent connections. I began putting everything I could into memcached (we already had a lot of data in it, but I put more).

- now we had a cache hotspot. Some digging found a age old bug in our cache driver where it didn’t actually keep things in process memory after fetching from memcached and we had a medium sized key getting fetch 100s of times per request.

- Days and days of app optimization after profiling. Our average response time went improved by more than 50%. The site would still start collapsing under a little load.

- While profiling in a single request all queries would complete very quickly (<50ms). Somehow the DB was still the bottleneck. We overprovisoned it significantly and it still would collapse.

- I started collecting counts and timings for cumulative and maximum single cache/db, read/writes to our log stack.

- the bottleneck was clearly still the DB.

- at this point we were desperate. Thinking it might have been an issue in the underlying VM we live migrated the DB to a new VM.

- the database was still the bottleneck.

In the end the thing that fixed it? A simple OPTIMIZE TABLE.

Somehow ANALYZE TABLE hadn’t detected anything but rebuilding the table still fixed the issue.

If anyone is looking for a good load testing tool, Vegeta was invaluable. I highly recommend it.

These are hinted by high system% usage when your system is busy (ie higher than say 10%). If it looks cpu bound but spends a lot of time in the kernel thread switching or synchronization (eg mutexes) is happening too much.

And worth noting that this can be missed during development if you have a good network to the server but customer is using a not so great WiFi network.

We had this at work, where one customer complained some operation was very slow, taking around 30 minutes. Couldn't pin it down, copied their database to my machine and it took only a couple of minutes. A bit of digging and I found that in this case this module caused a few million of fairly trivial SQL statements to be executed.

Each took less than a millisecond to execute locally, but round-trip time over WiFi can be 10 milliseconds or more. So suddenly 2 minutes becomes over 20 minutes.

I asked how the client connected to the LAN, and it was indeed via WiFi. As a quick fix we got the customer to use a network cable, which did indeed reduce the running time to a few minutes. The proper fix was to a bit of caching.

> And worth noting that this can be missed during development if you have a good network

It can be even worse if development is done using a local database, possibly on super fast local SSD. The latencies can be orders of magnitude lower, hiding performance issues that would be obvious even with only a millisecond of additional latency.

I've seen many vendors claim that they only "support" monolithic single-machine setups (sometimes even virtualization has "unacceptable overhead") when it's blatantly obvious that the application is just written with the assumption that database latency is approximately zero.

Hey Static, thanks for the thought-provoking dicussion the other day!

Let me ask you this: let's say you're downstream of a very slow server that is outside of your control. You need to access it, and there's nothing you can do to speed it up. Are you IO bound in this case?

If you're on a group hike/bike/ride/run, the group goes at the speed of the slowest person.

It may be that you didn't sleep well last night, or that your stomach is bugging you, but the reason it's going to take an extra hour today is because Tim has a blister.

Whatever is holding up that server is what's holding up the entire train of communication. If that is where $5000 will fix the problem, don't talk about any other problems, you'll just confuse management.

The next place to stop if that doesn't work is, if you can't fix "Tim's foot" can you offload some things "he" is dealing with? Dump some of his load, work steal, etc. But these are just mitigations.

Somebody read The Goal.

But this is the summary of a large part of that book: Systems have bottlenecks, address the bottleneck. Spending time on improvements elsewhere will not help the bottleneck, you're just wasting time/money. Data/material is going into it too fast to process (work is building up). Improving before the bottleneck just makes its queue fill faster. Improving after the bottleneck just creates a segment of the system that's starved for work. So focus on the problem at hand, once it's addressed, focus on the next bottleneck.

That someone was a coworker, but a bit of this is just queuing theory.

Buffering up does improve things when there is variability the processing time for each task, but it can also make average wait time hell. So you better be sure whether latency or throughput is really your primary concern.

Honestly, this is a reasonable question. I'm not really sure, and I think this is a gap in terminology (or even tooling), as I think we both sort of stumbled into the other day.

I want to say no? If I were to reframe the question as a slow function that I had to call, I wouldn't call it IO bound, so I think not - the fact that there's IO in between the two components is more of a confounding factor in profiling than an actual performance limitation.

I've definitely seen it referred to as IO bound when an upstream server just can't get your input fast enough, whether that's because you've maxed out the medium (e.g. you're bound by a 10/100/1000 Mbit link), or bound by how fast the disks can serve up bits, to downstream it's generally encountered as the same thing, the ability to process IO faster than it's available.

Good answer :)

I'm not sure I'd call this "bound" at all, though I like the siblings "dependency bound". To me this would be "blocked". "Bound" means I'm doing something productive, and I'm "bound" on whatever is limiting my speed. If I'm waiting on an external API outside my control, I'm not "bound" I'm "blocked" (stuck sitting around doing nothing, making no progress until they deign to get back to me).

This is a nice and more technically accurate definition but i don’t believe it’s at all found in the wild, which mitigates its usefulness. I think the bound in the common usage is in the sense of destination rather than limit, ie, bound for an i/o subsystem whether under local control or not. This then implies the local actor entering the ‘blocked’ state that you mention, during which time the local actor is free in principle to do something else.

In short I think I/O bound is a concise way of saying that this work will leave my concern and will pass back into it at some later time.

IO and CPU bound are meant in the sense of upper and lower bound, not destination, but limits on the system.

Saying something is CPU bound means that the CPU is the bottleneck. For instance, the data IO doesn’t limit the system throughout of iteratively running a hash algorithm a million times, the CPU does. An IO bound process might be something like reading a lot of data off a disk (or over a network) with a small, quick transformation on the CPU. Where you saturate the data bus, but the CPU is comparatively idle.

Ok hmm, thank you for the correction.

Isn't a "bottleneck" the traditional term for this?

In this case, I call it “dependency bound”. Anything not in your direct control is a dependency, and if that’s your limiting factor, call it such.

I/O means precisely that: input / output. If I/O interactions with a remote peer are (for whatever reason) stalling work in the local node, as far as the local node is concerned, it is I/O bound.

If you own both local and remote peer, you then instrument the remote peer. In your hypothetical scenario, CPU saturation would finger the query provided, in which case you are back to correcting that code in the I/O bound node. So, your calling/local node was I/O bound, and your remote/server node was CPU bound.

Sometimes this very same I/O bound pattern -- low cpu utilization on the client node due to blocked threads waiting on I/O from remote peers -- is simply due to insufficient resources and requires changes in capacity planning.

Extending your hypothetical, let's assume as the system in question (post- query optimization) is used over time, the extent in data to be processed grows, with the subsequent result that the optimizied query is now saturating the CPUs while supporting the same number of clients as before. Do you still insist on asserting that processing conditions of a remote peer allow for mis-characterizing the processing condition on the local nodes as "CPU bound"?

Uh, I guess somehow I wasn't clear that I agree that the situation I mentioned is not IO bound but is in fact a mistake I see a lot.

I don’t think I’ve ever encountered someone calling a slow query bottleneck “I/O bound” unless the query itself was I/O bound. I usually just see it explained as “bottlenecked by a slow query”.

Not to justify the abuse of the term, but here, the implication is that since the query is running on a database, and the database is accessible only over a network, that it is blocked on network I/O.

I have usually heard it in reference to a serialized process that’s spending most of its time waiting for hard drive / file system requests to go through.

It depends on where you draw the lines around your boxes.

The whole system together is CPU bound. But the individual part might be 'IO bound'.

Exactly this.

The individual subsystems can be I/O bound on other subsystems whereas considered as an entire system, it could be CPU-bound.

So, yes, one subsystem "blocking" on network I/O definitely is I/O bound regardless of the nature of that I/O, be it database-related or not, it's still I/O.

The system at the other end (the db subsystem) may also be I/O (or CPU) bound, but that is distinct from the clients bottleneck.

I wouldn't call this IO bound, and I'd argue anyone calling it IO bound is wrong, and needs coaching. If your waiting on a query to execute your bound on whatever the query is bound on (IO/Memory/CPU).

Also I'm IO bound, so don't care is a weird response. Doesn't matter were I'm bound, if it needs to go faster, I need to examine what I can do to remove that bottleneck. What I'm bound on just gives me a clue where most of my energy tuning/troubleshooting goes, not shrug and forget the problem.

But maybe I'm just lucky to have always worked with good people.

Sure, my holistic service is possibly bound by compute in some fashion on the SQL host, but what is claimed is that the process/host that is issuing those queries to the db isn't compute bound -- it spends the majority of its time waiting for input (from the db).

I feel like you're agreeing with me?

edit: To be clear, when I say "most people" I mean "many people I've talked to, especially with regards to excuses for not caring about their service performance", and that they are incorrect.

Yeah we definitely aren't disagreeing. I think my comment was mostly driven by surprise that you've apparently encountered a large number of people who just throw up their hands at a slow database. I've worked with some people who really frustrated me ("this operator isn't doing what I expect, and I don't see what's wrong, must be a bug in sql server") but even they'd try to tune a slow query. But I that might be more an artifact of working in a shop where the devs owned the database bottom to top (no dba's, and the sysadmins certainly don't know sql). I forget that for a lot of devs writing sql might not be a daily occurrence.

Oh, gotcha, yeah maybe I'm just lucky like that haha

So true about not coding for performance "because the disks are slow". Almost all disk i/o, especially big-iron, has cache in front of it.

Maybe in the case of cloud storage - what does I/O bound mean then? It should still conceivably be the same IMO.

Here is a recent HN discussion on this topic: https://news.ycombinator.com/item?id=24519786

My current answer is... it's complicated.


I'd say this is more relevant now than in 2008. With SSDs and 100Gbit network connections, it's a lot easier to saturate a CPU with data. And a lot of cloud providers now recommend colocating computing such that it's as close to the data as possible.

It's both easier and harder. Easier in that raw storage bandwidth has increased faster than single core performance. Harder in that to access that bandwidth you have to optimize that latency for that bandwidth has not decreased, in fact in cloud it's increased. I.e. cloud storage can be extremely fast, but only if it's multi stream storage. Similarly cloud compute can be fast, but only if it's multi thread CPU.

But the main point of (2008) is that a lot of the links are broken and tools mentioned outdated. The general concept "you can probably optimize your storage access" is still true though but usually any such concept is, it's the details around current caveats that are interesting.

This is what's really interesting to me right now. A 1 TB 980 Pro SSD (I know, not a "data center" SSD, but it's the most recent Gen4 SSD I've seen announced) can handle ~300MB/s [1] of random writes at realistic queue depths up to 32. Previously with 10 Gbit (1.25GB) connection, 4 of those SSDs could handle that incoming write traffic, which is no sweat and can actually physically fit on a server. But as you go to 100 Gbit (12.5GB) of incoming random writes to a storage server, something's gotta give. At this level of traffic, you'd need 41 of those SSDs to be able to handle that incoming traffic without a write buffer on the server. This is a simplification of a very complex topic, but it's what I/O bound means to me, at least.

[1] https://www.anandtech.com/show/16087/the-samsung-980-pro-pci...

Note that the numbers presented in that review are for single-threaded benchmarks, when the current best practices for a server application that really has a lot of write traffic would be to split the work over multiple threads, and preferably use io_uring for both network and disk IO. Even on those consumer-grade SSDs, that would allow for much higher throughput than what's presented in that review—though most of those consumer-grade SSDs wouldn't be able to sustain such write performance for very long.

Right now, a lot of software that wasn't written specifically with SSDs in mind is effectively bottlenecked on the CPU side with system call overhead, or is limited by the SSD's latency while using only a fraction of its potential throughput because the drive is only given one or a few requests to work on at a time.

If you do write your software with fast storage in mind, and have the fastest SSD money can't quite buy, the results can be very different: https://lore.kernel.org/io-uring/4af91b50-4a9c-8a16-9470-a51...

Wow, the author himself! Thanks for the great details. I missed the detail in the article about single threaded testing, so thanks for clarifying. I definitely see the specs for the Samsung drive show higher peak random IOPS but I'm always trying to look past the marketing hype and see what's more realistic. What allows an enterprise-grade SSD to sustain writes longer than a consumer-grade SSD? I assume it's cache related (SLC or DRAM), but I'm still learning.

Awesome results from Jens there...

Enterprise SSDs often (but not always) reserve more spare area out of the pool of flash on the drive, which helps a lot with sustained write speeds, especially random writes; the garbage collector is under less pressure to free up erase blocks for new writes.

Consumer SSDs use SLC caching which allows for much higher write speeds until that cache runs out, but then the drive is stuck with the job of cleaning up that cache while also still handling new writes. So for the same amount of raw NAND capacity and same usable capacity, an enterprise drive that doesn't do any SLC caching at all will tend to have a better long-term write speed than the post-cache write speed on a consumer drive.

Once you factor the storage latency to/through the server the IOPS at realistic queue depths fall by an enormous factor. This is usually balanced by putting tons of servers on it at which point 41 NVMe drives isn't really a lot - will usually fit in a single rack mount storage enclosure.

The issue though is that each server itself is starting to have 100 Gbit network connections, not just the rack itself.

Bandwidth isn't the problem, latency is. Each server could have 400 terabit interfaces and you'll still have garbage iops compared to local pcie on a laptop for reasonable depth.

Can you help me understand why latency kills IOPS? There's lots of work showing NVMe over TCP bringing similar performance and latency from a remote server as compared to a local SSD. Here's an example:


Interestingly for most people the network is just as slow today as 10 years ago beacuse 10/100 GBe has stalled in the datacenters and the trickling down that used to happen seems to have stopped. (Gigabit ethernet was 10 years old already in 2008)

Go back to the 1960s and people were already talking about i/o vs cpu bound. The units for measuring both have changed but the tradeoff is evergreen.

Supercomputer: device used to turn compute-bound problems into i/o bound ones.

Mainframe: device used to turn i/o-bound problems into compute bound ones.

tl;dr: In 2020, I would expect most programs to be ultimately CPU bound (meaning the cost of the cpu is the dominant cost – you can upgrade other 3 resources more cheaply until cpu becomes the bottleneck). Hence cpu optimizations matter today more than it did in 2008.

A program's execution time is a function of cpu performance, memory performance, disk performance or network performance.

Performance here is a complex phenomenon – usually we measure performance in terms of 'latency' and 'throughput' of the operations we are executing – and its characteristics vary by attributes like – sequential vs random, block sizes, simple vs complex instructions, instruction offloads, queue depths, scheduling queues, context/mode switching etc.

Imagine a naive execution of a program where all operations (cpu, memory, disk, network) are sequentially scheduled and executed. We can make two observations about such a program:

1. This program's execution time can be sped up by upgrading to a faster cpu or memory or disk or network – which of these we should upgrade first depends on what operation the program spends most of its execution time vs what costs less to upgrade.

2. While the program is executing on one of the four resources, the other resources are idling. In other words resource utilization will be less.

#1 happens every few months/years as the hardware becomes faster for the same price.

#2 is addressed by modern cpus, compilers and operating system schedulers through mechanisms like pipelining, parallelizing, prefetching, offloading etc. – to increase the overall utilization of all the resources while the program is executing. These techniques turn this naive program execution into a complex program execution.

The automated optimization of the naive program in this way is not perfect/complete. A programmer will have to adjust the program to utilize the idling resources on a computer. This is the performance optimization work. Even after doing this, the program execution will be constrained by one of the 4 resources.

In this situation we can say its execution time is bound by that resource's performance. Theoretically, if that resource performance were to become faster (through hardware upgrade), then the program will no longer by bound by that particular resource and it will now be bound by another resource. By definition, while the program execution time is bound by one resource, the other resources are under-utilized.

In designing hardware+software systems, a purist objective is to ensure least resource underutilization occurs while ensuring the program's performance objective is met. Since resources cost differently, the under-utilization has to be weighed by its cost.

This resource optimization at a datacenter level takes on a different dimension – a common mistake I have seen is to provision $5000 servers (where majority cost component is cpu and memory) and skimp on the network bandwidth between the servers. To build a full-clos non-blocking inter-server network at a reasonable enterprise scale, it would cost less than $300 per 10GE port. I have seen people save $150 per port and deploy an over-subscribed network that results in indeterministic network performance and cost much more wastage in the utilization of those $5000 servers.

Another common mistake I have seen is to provision too much RAM (expensive in purchase cost as well as running cost - power/cooling) while not provisioning enough high-speed SSDs.

In 2020, I would expect most programs to be ultimately CPU bound (meaning the cost of the cpu is the dominant cost – you can upgrade other 3 resources more cheaply until cpu becomes the bottleneck). Hence cpu optimizations matter today more than it did in 2008.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact