Today I usually use fio for disk/FS benchmarks, as it's frequently updated by Jens Axboe, who is also responsible for the modern Linux multiqueue block I/O implementation.
At this point people say "bound by X" when what they really mean is "most of the time is spent doing X".
If you spend 5 seconds doing I/O and then afterwards 1 second doing CPU heavy work, you are not I/O bound. You can tell this because making the CPU work take 0.5 seconds less time still saves you 0.5 seconds. It's not a bound at all.
In order for this to be I/O bound you would need to do the CPU heavy work during the I/O. Once the task is done that way, any optimisation on the CPU heavy work has precisely zero effect on runtime. This is how you know it's bound.
I think the reason that web programmers have become confused by this particularly is because on a web server there is almost always inter-request concurrency. You run a bunch of workers on the same CPU and now your waiting-for-db time is concurrent with the doing-cpu-work time for a different request.
The system as a whole can be considered I/O bound here because no mater how much faster you make the CPU, the number of requests per second you can process doesn't go up and so it seems like you might as well not care about how fast the cpu-work part is.
And that's all well and good, but it's cold comfort for any individual request that still has to wait the full amount of time for a request to be served. You could absolutely improve the situation for any given request by improving the speed of the CPU or optimising the code it runs.
Using the original numbers of 5 seconds IO, 1 second CPU: If doubling the speed of the CPU portion only makes the overall process 8% faster, it's not meaningfully CPU bound. But doubling the IO speed would reduce the total job time by 41%, which is a pretty big speedup overall.
Using your example of 5 seconds IO and 0.05 seconds CPU: doubling the CPU speed gives a 0.5% overall reduction in job time, and doubling the IO performance gives a 49.5% overall reduction. So this example can be meaningfully said to be more IO bound than the first example.
What's a little misleading in the example is that the problem setup already is a half answer. In complex commercial apps just arriving at an apportionment of 5 and .05 for some narrow set of use cases is already an achievement.
If we see 5 and .05 comes in the mixed case (CPU in the i/o loop) then there is a choice on what to fix to drive down absolute time. If the numbers are not so skewed it may take more debugging to suss out what's going on. Here it is obvious with givens.
If apart we conclude the i/o is slow (pedantically i/o bound) and there's no engineering need to figure out where labor needs to go. Here it's pointless to wonder about CPU or memory.
The emphasis in sum is on the management end: our system slow. But what are we going to fix and why.
- Increasing the numbers of servers running the app. App connections were still stacking up though. This gave us more breathing room though for connections to start stacking and handle small spikes.
- The database seemed very overloaded with so many concurrent connections. I began putting everything I could into memcached (we already had a lot of data in it, but I put more).
- now we had a cache hotspot. Some digging found a age old bug in our cache driver where it didn’t actually keep things in process memory after fetching from memcached and we had a medium sized key getting fetch 100s of times per request.
- Days and days of app optimization after profiling. Our average response time went improved by more than 50%. The site would still start collapsing under a little load.
- While profiling in a single request all queries would complete very quickly (<50ms). Somehow the DB was still the bottleneck. We overprovisoned it significantly and it still would collapse.
- I started collecting counts and timings for cumulative and maximum single cache/db, read/writes to our log stack.
- the bottleneck was clearly still the DB.
- at this point we were desperate. Thinking it might have been an issue in the underlying VM we live migrated the DB to a new VM.
- the database was still the bottleneck.
In the end the thing that fixed it? A simple OPTIMIZE TABLE.
Somehow ANALYZE TABLE hadn’t detected anything but rebuilding the table still fixed the issue.
If anyone is looking for a good load testing tool, Vegeta was invaluable. I highly recommend it.
We had this at work, where one customer complained some operation was very slow, taking around 30 minutes. Couldn't pin it down, copied their database to my machine and it took only a couple of minutes. A bit of digging and I found that in this case this module caused a few million of fairly trivial SQL statements to be executed.
Each took less than a millisecond to execute locally, but round-trip time over WiFi can be 10 milliseconds or more. So suddenly 2 minutes becomes over 20 minutes.
I asked how the client connected to the LAN, and it was indeed via WiFi. As a quick fix we got the customer to use a network cable, which did indeed reduce the running time to a few minutes. The proper fix was to a bit of caching.
It can be even worse if development is done using a local database, possibly on super fast local SSD. The latencies can be orders of magnitude lower, hiding performance issues that would be obvious even with only a millisecond of additional latency.
I've seen many vendors claim that they only "support" monolithic single-machine setups (sometimes even virtualization has "unacceptable overhead") when it's blatantly obvious that the application is just written with the assumption that database latency is approximately zero.
Let me ask you this: let's say you're downstream of a very slow server that is outside of your control. You need to access it, and there's nothing you can do to speed it up. Are you IO bound in this case?
It may be that you didn't sleep well last night, or that your stomach is bugging you, but the reason it's going to take an extra hour today is because Tim has a blister.
Whatever is holding up that server is what's holding up the entire train of communication. If that is where $5000 will fix the problem, don't talk about any other problems, you'll just confuse management.
The next place to stop if that doesn't work is, if you can't fix "Tim's foot" can you offload some things "he" is dealing with? Dump some of his load, work steal, etc. But these are just mitigations.
But this is the summary of a large part of that book: Systems have bottlenecks, address the bottleneck. Spending time on improvements elsewhere will not help the bottleneck, you're just wasting time/money. Data/material is going into it too fast to process (work is building up). Improving before the bottleneck just makes its queue fill faster. Improving after the bottleneck just creates a segment of the system that's starved for work. So focus on the problem at hand, once it's addressed, focus on the next bottleneck.
Buffering up does improve things when there is variability the processing time for each task, but it can also make average wait time hell. So you better be sure whether latency or throughput is really your primary concern.
I want to say no? If I were to reframe the question as a slow function that I had to call, I wouldn't call it IO bound, so I think not - the fact that there's IO in between the two components is more of a confounding factor in profiling than an actual performance limitation.
In short I think I/O bound is a concise way of saying that this work will leave my concern and will pass back into it at some later time.
Saying something is CPU bound means that the CPU is the bottleneck. For instance, the data IO doesn’t limit the system throughout of iteratively running a hash algorithm a million times, the CPU does. An IO bound process might be something like reading a lot of data off a disk (or over a network) with a small, quick transformation on the CPU. Where you saturate the data bus, but the CPU is comparatively idle.
If you own both local and remote peer, you then instrument the remote peer. In your hypothetical scenario, CPU saturation would finger the query provided, in which case you are back to correcting that code in the I/O bound node. So, your calling/local node was I/O bound, and your remote/server node was CPU bound.
Sometimes this very same I/O bound pattern -- low cpu utilization on the client node due to blocked threads waiting on I/O from remote peers -- is simply due to insufficient resources and requires changes in capacity planning.
Extending your hypothetical, let's assume as the system in question (post- query optimization) is used over time, the extent in data to be processed grows, with the subsequent result that the optimizied query is now saturating the CPUs while supporting the same number of clients as before. Do you still insist on asserting that processing conditions of a remote peer allow for mis-characterizing the processing condition on the local nodes as "CPU bound"?
The whole system together is CPU bound. But the individual part might be 'IO bound'.
The individual subsystems can be I/O bound on other subsystems whereas considered as an entire system, it could be CPU-bound.
So, yes, one subsystem "blocking" on network I/O definitely is I/O bound regardless of the nature of that I/O, be it database-related or not, it's still I/O.
The system at the other end (the db subsystem) may also be I/O (or CPU) bound, but that is distinct from the clients bottleneck.
Also I'm IO bound, so don't care is a weird response. Doesn't matter were I'm bound, if it needs to go faster, I need to examine what I can do to remove that bottleneck. What I'm bound on just gives me a clue where most of my energy tuning/troubleshooting goes, not shrug and forget the problem.
But maybe I'm just lucky to have always worked with good people.
edit: To be clear, when I say "most people" I mean "many people I've talked to, especially with regards to excuses for not caring about their service performance", and that they are incorrect.
Maybe in the case of cloud storage - what does I/O bound mean then? It should still conceivably be the same IMO.
My current answer is... it's complicated.
But the main point of (2008) is that a lot of the links are broken and tools mentioned outdated. The general concept "you can probably optimize your storage access" is still true though but usually any such concept is, it's the details around current caveats that are interesting.
Right now, a lot of software that wasn't written specifically with SSDs in mind is effectively bottlenecked on the CPU side with system call overhead, or is limited by the SSD's latency while using only a fraction of its potential throughput because the drive is only given one or a few requests to work on at a time.
If you do write your software with fast storage in mind, and have the fastest SSD money can't quite buy, the results can be very different: https://lore.kernel.org/io-uring/4af91b50-4a9c-8a16-9470-a51...
Awesome results from Jens there...
Consumer SSDs use SLC caching which allows for much higher write speeds until that cache runs out, but then the drive is stuck with the job of cleaning up that cache while also still handling new writes. So for the same amount of raw NAND capacity and same usable capacity, an enterprise drive that doesn't do any SLC caching at all will tend to have a better long-term write speed than the post-cache write speed on a consumer drive.
Supercomputer: device used to turn compute-bound problems into i/o bound ones.
Mainframe: device used to turn i/o-bound problems into compute bound ones.
A program's execution time is a function of cpu performance, memory performance, disk performance or network performance.
Performance here is a complex phenomenon – usually we measure performance in terms of 'latency' and 'throughput' of the operations we are executing – and its characteristics vary by attributes like – sequential vs random, block sizes, simple vs complex instructions, instruction offloads, queue depths, scheduling queues, context/mode switching etc.
Imagine a naive execution of a program where all operations (cpu, memory, disk, network) are sequentially scheduled and executed.
We can make two observations about such a program:
1. This program's execution time can be sped up by upgrading to a faster cpu or memory or disk or network – which of these we should upgrade first depends on what operation the program spends most of its execution time vs what costs less to upgrade.
2. While the program is executing on one of the four resources, the other resources are idling. In other words resource utilization will be less.
#1 happens every few months/years as the hardware becomes faster for the same price.
#2 is addressed by modern cpus, compilers and operating system schedulers through mechanisms like pipelining, parallelizing, prefetching, offloading etc. – to increase the overall utilization of all the resources while the program is executing. These techniques turn this naive program execution into a complex program execution.
The automated optimization of the naive program in this way is not perfect/complete. A programmer will have to adjust the program to utilize the idling resources on a computer. This is the performance optimization work. Even after doing this, the program execution will be constrained by one of the 4 resources.
In this situation we can say its execution time is bound by that resource's performance. Theoretically, if that resource performance were to become faster (through hardware upgrade), then the program will no longer by bound by that particular resource and it will now be bound by another resource. By definition, while the program execution time is bound by one resource, the other resources are under-utilized.
In designing hardware+software systems, a purist objective is to ensure least resource underutilization occurs while ensuring the program's performance objective is met. Since resources cost differently, the under-utilization has to be weighed by its cost.
This resource optimization at a datacenter level takes on a different dimension – a common mistake I have seen is to provision $5000 servers (where majority cost component is cpu and memory) and skimp on the network bandwidth between the servers. To build a full-clos non-blocking inter-server network at a reasonable enterprise scale, it would cost less than $300 per 10GE port. I have seen people save $150 per port and deploy an over-subscribed network that results in indeterministic network performance and cost much more wastage in the utilization of those $5000 servers.
Another common mistake I have seen is to provision too much RAM (expensive in purchase cost as well as running cost - power/cooling) while not provisioning enough high-speed SSDs.
In 2020, I would expect most programs to be ultimately CPU bound (meaning the cost of the cpu is the dominant cost – you can upgrade other 3 resources more cheaply until cpu becomes the bottleneck). Hence cpu optimizations matter today more than it did in 2008.