Our current engineering target is 1 million writes/sec and > 10 million reads/sec on top of an architecture similar to that, on a single box, to our fully transactional, MVCC database (write do not block reads, and vice versa) that runs in the same process (a la SQLite), which we've also merged with our application code and our caching tier, so we're down to—literally—a single process for what would have been at least three separate tiers in a traditional setup.
The result is that we had to move to measuring request latency in microseconds exclusively. The architecture (without additional application-specific processing) supports a wire-to-wire messaging speed of 26 nanoseconds, or approx. 40 million requests per second. And that's written in Lua!
To put that in perspective, that kind of performance is about 1/3 of what you'd need to be able to do to handle Facebook's messaging load (on average, obviously, Facebook bursts higher than the average at times...).
Point being, the OS is just plain out-of-date for how to solve heavy data plane problems efficiently. The disparity between what the OS can do and what the hardware is capable of delivering is off by a few orders of magnitude right now. It's downright ridiculous how much performance we're giving up for supposed "convenience" today.
The paper was "Network Stack Specialization for Performance" and by moving pretty much all of tcp out of the kernel and into userspace they were able to get a web server that outperformed nginx 3.5x or so. The point of this particular paper was that, as the title suggests, keeping these things generalized as they must be in the kernel comes at a fairly significant performance cost.
Now obviously I don't want to have to write tcp for every network application I write, but it is interesting to think of such things as libraries instead of in the kernel, where I can pick the tcp that I know performs best for my particular application.
It says that (# requests) = (# requests / time) * time
There is nothing else "that might have conceivably affected the result", it's just algebraic cancellation.
Everything else that might have mattered is ruled out by saying "stable system" and saying that all variables are averages.
I'm not saying it's not useful, but that's an awfully "gee whiz" tone to use to comment on r = (r/t) * t
This part is more worthy of an exclamation point, since it makes clear a practical impact:
> Because L is the minimum of all these limits, the OS scheduler suddenly dropped our capacity, L, from the high 100Ks-low millions, to well under 20,000!
I think the deeper point is along the lines of Wigner's classic "The Unreasonable Effectiveness of Mathematics in the Natural Sciences"
It happens to be the case that elementary dimensional analysis is sufficient in the case in hand -- and it often is.
 Caveat: impressive in terms of features; i don't have any information about the quality of the implementation
The polyglot aspect, very impressive performance (as it's based on netty), and simple deployment model make it an interesting choice. It let's you put computationally intensive operations into worker verticles that run in their own thread pool, while also having i/o blocking operations run async on the event loop in a nodejs style.
Fibers first remove the problem, and then let you choose the most appropriate programming style for your domain.
My proposal is to have a small exokernel between the hardware and the application. The exokernel is there to provide very simple access to the hardware (like the disk or network) and will rely on the application to do anything complicated (like handling TCP/IP).
So the increased latency of the OS scheduler has little to do with the number of layers between your application and the hardware, and a lot to do about assumptions the OS can make about your code.
You most certainly want a general-purpose OS scheduler, it's just that many applications can benefit from user-level lightweight threads.
I hope I don't come off as a creeper because I was very curious on what you are working on and dug through your comments. Is it this BareMetalOS project?
The answer is, of course, it depends on what your priorities are.
I've seen projects that write OS's from the ground up for a wide variety of reasons (security, correctness, optimization of use case, etc.) But very rarely is it for price reasons.
How is it that you intend to do an OS startup? That's a pretty tough nut to crack.