Our current engineering target is 1 million writes/sec and > 10 million reads/sec on top of an architecture similar to that, on a single box, to our fully transactional, MVCC database (write do not block reads, and vice versa) that runs in the same process (a la SQLite), which we've also merged with our application code and our caching tier, so we're down to—literally—a single process for what would have been at least three separate tiers in a traditional setup.
The result is that we had to move to measuring request latency in microseconds exclusively. The architecture (without additional application-specific processing) supports a wire-to-wire messaging speed of 26 nanoseconds, or approx. 40 million requests per second. And that's written in Lua!
To put that in perspective, that kind of performance is about 1/3 of what you'd need to be able to do to handle Facebook's messaging load (on average, obviously, Facebook bursts higher than the average at times...).
Point being, the OS is just plain out-of-date for how to solve heavy data plane problems efficiently. The disparity between what the OS can do and what the hardware is capable of delivering is off by a few orders of magnitude right now. It's downright ridiculous how much performance we're giving up for supposed "convenience" today.
The paper was "Network Stack Specialization for Performance" and by moving pretty much all of tcp out of the kernel and into userspace they were able to get a web server that outperformed nginx 3.5x or so. The point of this particular paper was that, as the title suggests, keeping these things generalized as they must be in the kernel comes at a fairly significant performance cost.
Now obviously I don't want to have to write tcp for every network application I write, but it is interesting to think of such things as libraries instead of in the kernel, where I can pick the tcp that I know performs best for my particular application.