
Building A Node.js Server That Won't Melt - lloydhilaiel
https://hacks.mozilla.org/2013/01/building-a-node-js-server-that-wont-melt-a-node-js-holiday-season-part-5/
======
spartango
This is definitely neat; it's a simple, clever way to fail somewhat-
gracefully.

Interestingly, this is one piece of a broader strategy for building scalable
web applications, addressed in Matt Welsh's thesis (SEDA):

<http://www.eecs.harvard.edu/~mdw/proj/seda/>

If you haven't taken a look at the ideas in SEDA, it's definitely worth the
time. Most modern web apps incorporate at least some pieces of it to great
effect.

~~~
kenshiro_o
That's a very good technique! I did not even know about the event loop lag
until now... It feels good to get a daily education from Hacker News!

------
mnarayan01

      // check if we're toobusy() - note, this call is extremely fast, and returns
      // state that is calculated asynchronously.  
      if (toobusy()) res.send(503, "I'm busy right now, sorry.");
    

Saying it is calculated "asynchronously" seems pretty confusing in this
context. Maybe "is cached at a fixed interval"?

~~~
firefoxman1
Side note: caching at a fixed interval is a great way to save CPU cycles. If
you're serving 100req/s and you had to grab the current time within 1sec
accuracy for them, you could have 1/100th the amount of Date calls by getting
the time each second instead of each request.

------
jeffasinger
Why you would do this, instead of putting your node server behind a more
battle tested proxy, like Varnish or HAProxy, and limiting the number of
simultaneous connections.

~~~
lloydhilaiel
Our problem is the the number of simultaneous connections we can support isn't
static, it varies depending on the _type_ of traffic bursts (i.e. New vs.
Returning users).

But agree that a higher level proxy is useful - because this way your
overloaded application doesn't even need to deal with traffic (we hope to use
toobusy to have applications instruct the routing layer to temporarily block /
reroute traffic in times of load).

~~~
jeffasinger
The reason that I asked that is I feel that most applications end up needing
something at that layer eventually anyways, but not knowing exactly what level
of traffic results in too much load is a good reason to do this.

~~~
davidw
It seems to me that letting the computer calculate when it's in trouble is
preferable to have a human guessing.

------
cpeterso
If you are writing server software, I recommend Michael Nygard's book
_"Release It!: Design and Deploy Production-Ready Software"_. Measuring event
loop lag sounds like Nygard's "Circuit Breaker" pattern to avoid cascading
failures.

The book's examples and text are all Java, but the lessons are applicable
anywhere. He offers many scalability patterns (resource pools) and anti-
patterns (runaway log files) with interesting stories from his experience
debugging real systems. I especially liked his story about debugging a crash
in an Oracle DB driver that caused unexpected Java exceptions to be thrown
from java.sql.Statement.close(), which quickly blocked a DB connection pool.

<http://michaelnygard.com/>

------
jconley
This is great to see! IMO it's something that should be configurable and built
into http in node.js.

IIS [1] and other mature web servers like nginx [2] or apache [3] do this kind
of thing for you and provide simple configuration on it.

[1] <http://support.microsoft.com/kb/943891> [2]
[http://serverfault.com/questions/412323/nginx-503-error-
in-h...](http://serverfault.com/questions/412323/nginx-503-error-in-high-
traffic) [3]
[http://httpd.apache.org/docs/2.2/mod/mpm_common.html#maxrequ...](http://httpd.apache.org/docs/2.2/mod/mpm_common.html#maxrequestsperchild)

~~~
toast0
Apache's MaxRequestsPerChild directive is not at all about limiting load; it
says after serving N requests (or N connections in a keep-alive setting), the
worker kills itself and another may be spawned in its place (subject to spare
server/thread config). This mostly helps keep slow memory (or other resource)
leaks in check by starting fresh every so often.

Did you mean to link to MaxClients[1]? MaxClients sets the maximum number of
simultaneous connections; any additional connections will be queued by the OS
socket api (subject to listen backlog, etc).

I think waiting to accept sockets that you can't handle is a better solution
than either accepting a socket to return an error message, or (much worse)
accepting a socket that overloads your system. Unfortunately, sometimes it can
be hard to set MaxClients to the right value that isn't so big that you get
reduced througput, or too small that you don't use all your resources. (One
thing that does help you get to the right number for MaxClients is to set
MinSpareServers to the same value as MaxClients; you will avoid issues where
MaxClients is too big and you start swapping during high load, but you don't
notice it because things are fine with a small number of servers).

Sorry, that was way too much information.

[1]
[http://httpd.apache.org/docs/2.2/mod/mpm_common.html#maxclie...](http://httpd.apache.org/docs/2.2/mod/mpm_common.html#maxclients)

------
chetanahuja
This is a neat little technique and the fact that the developers did this puts
nodes.js higher in my stack of "technologies to consider".

A harder problem is to solve the same "melt-out" problem for the entire gamut
of servers that are usually found in large serving stacks (various web
servers, relattional and nosql db servers etc). The users are usually not the
developers of these pieces of technology and there's never enough dev-
bandwidth available to actually implement self-tuning techniques like the one
mentioned here.

For such situations I once came up with a "little" trick to limit the damage
in sudden overload situation. It allows the admins of said servers to tune the
threadpool and request queue lengths in an intelligent way (with the help of
historical performance data that most shops _should_ have). I mentioned it in
this discussion thread here: [http://www.linkedin.com/groups/Whats-generally-
used-methodol...](http://www.linkedin.com/groups/Whats-generally-used-
methodology-threadpool-4145762.S.79451272)

------
j45
Forgive my ignorance, but I thought the whole point of Node.js was not to
meltdown.

Are there some recommend resources I could learn more from about the out of
the box scaling vs tactics and strategies that benefit growth? I love learning
how this problem is tackled anywhere. :)

~~~
nakkiel
There's only so much load any piece of software running on a given and
finished set of resources can handle properly... Once limits are reached,
things will misbehave.

~~~
j45
I agree completely, have lots of experience in complex hosting.

With all software having it's limits (assuming hw is relative) has a starting
and ending point of when you need to start tweaking it, and in what direction.
I was hoping to read more about that.

------
scottmey
I've been looking for a solution for this, every time I had used 'nodemon' it
couldn't handle the load, even if it was quite insignificant... perhaps I had
misconfigured something, but I'm pleased to find this, good timing!

------
ttty
Instead of returning a 503 page it could attach a ticket (by cookies) and say:
"Your request will be solved in 55 seconds, be patient". The time calculation
is based on statistics in the last 10 minutes or 1 hour...

~~~
Jare
The problem is not the lag before resolving the request, it's the lag the user
finding out that the request will not be solved. Underlying this is the fact
that the server simply does not have enough power to resolve all requests, so
some must fail.

The trick of using event loop lag is pretty neat, but the general strategy is
IMHO a must have for any service. By aborting early and somewhat gracefully,
the user knows ASAP, receives a controlled message that suggests she should
not quickly retry, and the server can avoid trying to do part of the work for
the request that is not going to be completed.

------
dtwwtd
This is pretty cool. It seems like 10 seconds is still a pretty long time to
wait for a response when connections over the max limit are being dropped. Is
there a reason for this or some way to tune it?

------
jcampbell1
Using event loop lag is pretty clever. I think I first learned about event
loop lag the hard when trying to do DOM animations that relied on setTimeout
intervals without checking the current time.

------
politician
It'd be interesting to experiment with pairing this to a proof-of-work system
(e.g. Hashcash) to smooth the transition between toobusy and !toobusy. With an
additional signalling mechanic, it could help the surge against peers in a
cluster when one server begins to struggle.

Though you'd need a custom client, so maybe for appcache'd webapps or mobile
apps, but not typical webapps. Or browser vendor participation (not serious
(ok, half serious)).

------
EGreg
And what happens when there are too many requests to handle even this way?

------
donebizkit
Great tip! Can you explain how you figured out the 200 request/sec limit?

~~~
lloydhilaiel
That was specifically for the example server - if each request uses 5ms of
processor time, (1000 ms / s) / (5 ms / req) == (200 req / s)

[https://gist.github.com/4532177#file-application_server-
js-L...](https://gist.github.com/4532177#file-application_server-js-L3-L20)

------
n0cturne
But will it blend?

~~~
lloydhilaiel
like, make you a smoothie? Or do this: [https://github.com/lloyd/node-
toobusy/blob/master/toobusy.cc...](https://github.com/lloyd/node-
toobusy/blob/master/toobusy.cc#L26-L27)

~~~
creativename
A joke, I believe: melt vs. blend

------
tbatterii
i thought node.js was fast as hell bad ass rock star tech that would never
melt.

<http://www.youtube.com/watch?v=bzkRVzciAZg>

~~~
taf2
well it's still pretty bad ass and rock star that you can measure server load
by looking at "event loop lag", that's actually really neat... how do you do
that with threads?

~~~
tbatterii
"how do you do that with threads?"

how should I know? :)

