
Scale Fail (part 1) - ableal
http://lwn.net/Articles/441790/
======
wccrawford
The bit about metrics really rang true.

At 1 company, we had a situation where 2 of the 3 webservers were out of
service for various reasons. I knew from past experience that even a single
web server should have a load less than 1 with all our traffic. The others
were just there in case they went down. (Like they did!) In the middle of
that, the 1 working server suddenly starts spiking load to 40 and misbehaving.

There was a really smart new guy there that was usually right. But this time,
he said there was no way 1 server could handle our traffic and that was the
problem. He refused to listen to reason and eventually I gave up and started
working on the problem alone. I eventually found the code that was going crazy
and fixed it... And voila. 0.4 load.

The point is that people get things in their head that 'have' to be true, but
they aren't necessarily. He assumed that the existence of 3 webservers meant
we had that much load, but it was just for redundancy. If I had had metrics to
prove my case, it would have gone a lot smoother... But instead I just had my
experience.

~~~
JonLim
I think having numbers to prove your case is generally a good way to approach
things. If you can get objective numbers that really show and prove what you
are talking about, you can swing most people your way.

If not, they are probably not the best of people to work with.

~~~
wccrawford
I want to reiterate that the guy was really good at his job. One of the best
people I've ever worked with. He was generally pretty easy to get along with.
But my lack of hard data meant I couldn't sway him on that problem, no matter
how right I was.

That company had a history of not having metrics to work with. Every new
sysadmin or db admin came in and asked for metrics and we'd shrug and laugh
nervously. (I mean, there was a reason why hired so many new ones!)

~~~
JonLim
Oh, no, heard you very loud and clear. Was just reinforcing your anecdote by
saying that it's applicable to a lot more than just scaling apps. :)

------
asymptotic
This is a phenomenal article. After wading through bullshit "X HTTP server
beats Y HTTP server at serving 40 bytes of static content!!!" articles this is
a breath of fresh air. This is clearly written by someone working in the real
world.

~~~
krakensden
LWN is pretty great, I'm quite happy I bought a subscription. Very few other
publications will have long articles with diagrams explaining a kernel memory
corruption bug a day after it happens.

~~~
michaelchisari
Honestly, I'd consider getting a subscription just for this:

 _"When I was at Amazon, we used a squid reverse proxy ..."_

 _"Dan, you were an ad sales manager at Amazon."_

------
johngalt
It continues to surprise me how much IT/sysadmin work is done by heuristic
rather than actual measurements.

Applications slow? -> Add RAM

Database is slow? -> must be network/load

Service failures? -> reboot

It becomes very problematic in large IT organizations. Teams will play hot
potato with issues, and all use excuses. Desktop support will blame the DB
team, DB will blame the server team, then they all blame the network. All the
while no one is actually measuring anything.

~~~
Meai
Surprisingly almost no one cares about performance. When was the last time you
have seen a webframework or a database that routinely does performance
benchmarks at each iteration? In fact, I don't know any. I was very impressed
with the continuous measurements that PyPy does.

~~~
shimon
<http://django-speedcenter.djangozoom.net/>

~~~
hartror
The site is great, but ironically a bit slow.

------
armored
Comments > the article:

 _So to avoid these ends, let's avoid these beginnings: avoid multi-threading.
Use single-threaded programs, which are easier to design, write, and debug
than their shared-memory counterparts. Instead, use multiple processes to
extract concurrency from the hardware. Choose a communication medium that
works just as well on a single machine as it does between machines, and make
sure the individual processes comprising the system are blind to the
difference. This way, deployment becomes flexible and scaling becomes simpler.
Because communication between processes by necessity has to be explicitly
designed and specified, modularity almost happens by itself._

~~~
jshen
"Use single-threaded programs, which are easier to design, write, and debug
than their shared-memory counterparts."

Not really. Well, only true in simple cases. Let's say I have a 16-core box
and I want to crunch some data using all the cores. What's easier, clojure
with a single shared memory data structure that collects that stats, or a
multi-process system where we not only have to manage multiple application
processes, we also have to use something like redis to hold the shared state?

~~~
minimax
Here is a link to the commenter's full comment:
<http://lwn.net/Articles/441867/>

He's precisely considering your 16-core case. Now think about what happens
when your dataset grows and you need more cores than you can reasonably fit in
a single machine.

~~~
jshen
It's not when, it's if, and often you know it won't happen. Being a good
engineer requires understanding when k use which model. acting as if there
aren't a lot of cases where singel process, shred memory, concurrency is the
only good choice for almost all cases is wrong

------
CountHackulus
What I do appreciate about this article is that it doesn't just give
anecdotes, but actually quantifies and explains the problems. In fact it even
explains why not to use anecdotes!

Seems like most of the tips here could be applied to all sort of problems, not
just scalability, and I think that's probably what the author was going for.
To show that scalability isn't some special problem that needs special
solutions, but that if you think hard about it, and use data to back up your
findings, then it's just like any other problem you want to solve.

Can't wait for part 2.

~~~
valyala
Part 2 is already available - see <http://lwn.net/Articles/443775/>

------
gregburek
How does everyone here gather and analyze their metrics? What do you have
always deployed and what do you use when shit hits the fan?

[Edit for typo]

~~~
seiji
<rant> The "standard" ways are all very outdated, ugly, unscalable, and brain
dead in implementation. nagios, cacti, munin, ganglia, ... -- all crap.
</rant>

People end up writing their own [1]. They rarely open source their custom
monitoring infrastructure. Sometimes a private monitoring system gets open
sourced, but then you see it has complex dependencies. The complexity of
monitoring blocks wide-scale deployment. People stick with 15 year old,
simple, dumb, solutions.

I'm working on making a new distributed monitoring/alerting/trending/stats
framework/service, but it's slow going. One weekend per month of free time
doesn't exactly yield the mindset to get into hardcore systems hacking flows
[2].

[1]: [http://www.slideshare.net/kevinweil/rainbird-realtime-
analyt...](http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-
twitter-strata-2011)

[2]: Will develop next-gen monitoring for food.

~~~
gregburek
I'm getting the feeling that with all the unique server setups in use,
monitoring and metrics systems are going to be just as unique and specific.

There are some interesting process monitoring projects out there like god,
monit and bluepill, as well as ec2/cloud specific stuff from ylastic,
rightscale and librato silverline. Have you ever used any of those tools?

Fitting all these together for my setup is trial and error, but it really does
force me to think hard about my tools and assumptions even before I get hard
data.

~~~
josephruscio
I hack on the aforementioned Silverline at <http://librato.com>, and we
provide system-level monitoring at the process/application granularity as-a-
Service. (We also have a bunch of features around active workload management
controls, but that's out of scope here). It actually works on any server
running one of the supported versions of Linux, not just EC2. Benefits of
going with a service-based offering are the same as in any other vertical, you
don't need to install and manage your own software/hardware for monitoring.

Here's an example of the visualizations we provide into what's going on in
your server instances: [http://support.silverline.librato.com/kb/monitoring-
tags/app...](http://support.silverline.librato.com/kb/monitoring-
tags/application-monitoring-versus-server-monitoring)

------
chaostheory
> Single-threading is the enemy of scalability.

Multi-thread programming is hard. Everyone already knows about Node.js, but if
you're on the Java I suggest you check out Akka (<http://akka.io>) - makes
concurrency much easier

~~~
IgorPartola
Is it hard? Always? If it is so hard then why is it around. How would you go
about parallelizing even mildly CPU heavy work loads on node.js?

Node.js and the like are awesome, but only if you have no blocking calls and
your CPU usage is tiny. Otherwise you get the performance of a single threaded
server. You know, because that's what it is.

~~~
chaostheory
>Is it hard?

Yes multithread programming is difficult (for most people - like me). Hence
the popularity of stuff like node.js, and also why people highlight Erlang's
Actor model.

> How would you go about parallelizing even mildly CPU heavy work loads on
> node.js?

I'm not sure I understand your question, since (I believe) node.js has an
event based concurrency model and not thread based (at least for its users).

------
selectnull
Part 2 has been released, for those who have subscription.
<https://lwn.net/Articles/443775/>

------
hartror
The talk is very amusing especially the use of Jason Fried in one of the
slides.

<http://www.youtube.com/watch?v=nPG4sK_glls>

------
chopsueyar
Well written article and funny, too. Nice job.

