
Little’s Law, Scalability and Fault Tolerance: The OS is your bottleneck - pron
http://blog.paralleluniverse.co/2014/02/04/littles-law/
======
callesgg
Yes, the OS will become the bootleneck. If you design your application in a
way that makes the OS do a shitload of work.

~~~
erichocean
Or, you could just avoid the OS altogether:
[https://github.com/SnabbCo/snabbswitch](https://github.com/SnabbCo/snabbswitch)

Our current engineering target is 1 million writes/sec and > 10 million
reads/sec on top of an architecture similar to that, on a single box, to our
fully transactional, MVCC database (write do not block reads, and vice versa)
that runs in the same process (a la SQLite), which we've also merged with our
application code and our caching tier, so we're down to—literally—a single
process for what would have been at least three separate tiers in a
traditional setup.

The result is that we had to move to measuring request latency in microseconds
exclusively. The architecture (without additional application-specific
processing) supports a wire-to-wire messaging speed of 26 nanoseconds, or
approx. 40 million requests per second. And that's written in Lua!

To put that in perspective, that kind of performance is about 1/3 of what
you'd need to be able to do to handle Facebook's messaging load (on average,
obviously, Facebook bursts higher than the average at times...).

Point being, the OS is just plain out-of-date for how to solve heavy data
plane problems efficiently. The disparity between what the OS can do and what
the hardware is capable of delivering is off by a few orders of magnitude
right now. It's downright ridiculous how much performance we're giving up for
supposed "convenience" today.

~~~
arghnoname
I read a paper recently and in some cases anyway, even the operating system
doing TCP for you can be slow.

The paper was "Network Stack Specialization for Performance" and by moving
pretty much all of tcp out of the kernel and into userspace they were able to
get a web server that outperformed nginx 3.5x or so. The point of this
particular paper was that, as the title suggests, keeping these things
generalized as they must be in the kernel comes at a fairly significant
performance cost.

Now obviously I don't want to have to write tcp for every network application
I write, but it is interesting to think of such things as libraries instead of
in the kernel, where I can pick the tcp that I know performs best for my
particular application.

[http://conferences.sigcomm.org/hotnets/2013/papers/hotnets-f...](http://conferences.sigcomm.org/hotnets/2013/papers/hotnets-
final43.pdf)

------
DougMerritt
> What’s remarkable about this result is that it does not depend on the
> precise distribution of the requests, the order in which requests are
> processed or any other variable that might have conceivably affected the
> result.

It says that (# requests) = (# requests / time) * time

There is nothing else "that might have conceivably affected the result", it's
just algebraic cancellation.

Everything else that _might_ have mattered is ruled out by saying "stable
system" and saying that all variables are averages.

I'm not saying it's not useful, but that's an awfully "gee whiz" tone to use
to comment on r = (r/t) * t

This part is more worthy of an exclamation point, since it makes clear a
practical impact:

> Because L is the minimum of all these limits, the OS scheduler suddenly
> dropped our capacity, L, from the high 100Ks-low millions, to well under
> 20,000!

~~~
pron
Well, it's not quite as simple as that. We're talking about a stochastic
process here, and while a "hand-wavy" proof may satisfy some, there's a Little
;) more to it: [http://www.ece.virginia.edu/~mv/edu/715/lectures/littles-
law...](http://www.ece.virginia.edu/~mv/edu/715/lectures/littles-law/littles-
law.pdf)

~~~
DougMerritt
Fair enough, although even that terse lecture hand-waved (by leaving it to a
reference) about ergodic systems.

I think the deeper point is along the lines of Wigner's classic "The
Unreasonable Effectiveness of Mathematics in the Natural Sciences"

[http://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_...](http://en.wikipedia.org/wiki/The_Unreasonable_Effectiveness_of_Mathematics_in_the_Natural_Sciences)

It happens to be the case that elementary dimensional analysis is sufficient
in the case in hand -- and it often is.

------
twic
Parallel Universe are churning out some seriously impressive software [0]. How
many programmer-hours have gone into Quasar and its supporting cast so far?

[0] Caveat: impressive in terms of features; i don't have any information
about the quality of the implementation

~~~
pron
Quasar developer here: I'd guess about 1 man-year on Quasar+Pulsar, a similar
amount on Galaxy, and much more than that on SpaceBase. Comsat was easy once
Quasar was in place.

------
RyanZAG
There's also another option for doing async on the JVM:
[http://vertx.io/](http://vertx.io/)

The polyglot aspect, very impressive performance (as it's based on netty), and
simple deployment model make it an interesting choice. It let's you put
computationally intensive operations into worker verticles that run in their
own thread pool, while also having i/o blocking operations run async on the
event loop in a nodejs style.

~~~
pron
Yep, it's just that I find that these solutions (Node.js and Vertx.io) make
you adopt a functional/callback-based style not because it's appropriate from
a design perspective, but rather to work around OS limitations.

Fibers first remove the problem, and then let you choose the most appropriate
programming style for your domain.

------
iseyler
Is it time to ditch the "normal" OS? My startup is looking to do exactly that
but I would like to know what others thoughts are on the matter.

My proposal is to have a small exokernel between the hardware and the
application. The exokernel is there to provide very simple access to the
hardware (like the disk or network) and will rely on the application to do
anything complicated (like handling TCP/IP).

~~~
pron
"Plain" OS threads certainly have their place and the OS does a fine job
scheduling them. It's just that the more information you have about the
threads' behavior the better they can be scheduled (you can reduce latencies
by keeping related fibers on the same core to share cache; that's one of the
things Quasar does).

So the increased latency of the OS scheduler has little to do with the number
of layers between your application and the hardware, and a lot to do about
assumptions the OS can make about your code.

You most certainly want a general-purpose OS scheduler, it's just that many
applications can benefit from user-level lightweight threads.

