

Don’t scale: 99.999% uptime is for Wal-Mart - cfontes
http://37signals.com/svn/archives2/dont_scale_99999_uptime_is_for_walmart.php

======
gyardley
I wish I could delete the first two words from this article's title. Uptime's
related to scalability, but it's not the same thing - scalability encompasses
more than that.

In my experience, you can get away with lower uptime but your users will
crucify you for poor performance. Site down for a half hour here and there?
Fine. Site responds slowly and your clients' data takes a while to update?
Big, big problem.

~~~
ZoFreX
Strongly agreed, I'm not sure why the post conflated these two very different
issues. Scaling has almost nothing to do with five 9s uptime, with the
exception that if you can't scale, once you reach a certain number of users
your uptime will be closer to 0% than 100%.

------
patio11
Relatedly, one piece of advice that 37Signals had in their Getting Real book
that really helped me is that you can delay building many systems past launch.

For example, BCC has a substantial amount of functionality in the back-end
interface so that I can handle common support tasks. AR has virtually nothing
-- a single page which lists customer email addresses, trial statuses, and
upcoming subscription renewal dates. I could have spent 2 weeks on building
out a decent amount of functionality for CS and more advanced statistical
navel gazing, but a) I might not pick the right stuff and b) it would mean
that the release of the next feature that actually sells software would be on
3/15 instead of 3/1.

BCC has organically grown its backend over the years, as I get so frustrated
with fixing the same issue manually that I make a one-button way to do it.

~~~
jasonkester
Well said. I launched S3stat as a paid service without any way of processing
credit cards. Since I offered a 30 Day Free Trial, that gave me at least 3
weeks in which to build it (and some good incentive to do so).

FairTutor will probably go live with no way to review teachers. Same reason.

------
PedroCandeias
The people at my office who actually use Highrise and have to deal with 37s'
frequent bouts of downtime would beg to differ.

~~~
jshen
98% or 99% uptime isn't good enough for the people you know? Or are you
suggesting that their uptime is significantly lower than that?

~~~
true_religion
1% of a year is 3.65242199 days.

So on the extreme end 37signals could be down for over 3 days in a single bout
and still have 99% uptime. 3 days downtime is serious.

However, this case is unlikely.

Also unlikely is a scenario where they are down for the same time each day---
14.4 minutes @ 99% uptime.

Its most likely that they're down for a few minutes here and there across the
day during working hours, which makes for an incredibly frustrating
experience. You can never just click a button and _know_ your form will be
submitted.

~~~
jshen
Yes, but there are a some more nuances. I'll bet a fair amount of downtime is
scheduled, and would be scheduled for low usage periods. Second, most services
I've ever used with an SLA for n sixes hasn't actually lived up to that SLA.
Instead they offer you a prorated charge for that month or something similar.

What people promise in an SLA and what they deliver are very different things.
I'd be very interested in some stats about historic uptimes for similar sites.

Also, how much more are you willing to pay for an extra 9 of uptime?

------
xd
Sounds like they are trying to fluff their reliability reputation.

I develop a web application that is used by schools and just can't entertain
the notion of anything other than 100% uptime. I take the reliability of my
product very very seriously. If one of my customers had a fire at their school
and couldn't access our system for registers - that would be us and them up
the proverbial creak without a paddle.

I've built up a company (over 7 years now) with a very good reputation for
reliability and uptime. Don't assume that just because something is web based
it doesn't require 100% uptime.

~~~
rudiger
Are you sure your web application "just can't entertain the notion of anything
other than 100% uptime"? That sounds like a vacuous promise to me. Even
telecoms switches are designed with something like 99.9999999% (9 nines) of
availability; that's ~30 milliseconds of downtime a year.

I'm not criticizing you or your product, but if reliability is critical, it's
made explicit with a realistic SLA.

~~~
xd
I didn't say we provide 100% uptime. We just don't entertain the idea that we
may have to settle for less .. simply, we pro-actively monitor our servers and
have enough redundancy in place to keep problems under OUR control.

We don't pretend, promise or have an SLA that offers 100% uptime.

------
eftpotrm
98% uptime is down roughly:

* 1 minute every hour, or

* 3h20 every week, or

* 1 week every year

I know about hyperbole for making a point, but does Basecamp _really_ total
anything like a week's downtime per year? If so, why? I'm pretty sure I've
never had anything like that bad a number and equally that I wouldn't be happy
using a service that did.

The general thrust is right: high reliability is expensive and you need to
look at cost/benefit not chest-beating. But let's be honest about what we're
actually aiming at.

~~~
rudiger
This article is from 2005. It's quite likely that Basecamp's downtime _was_
around a week per year. At that time, Rails was expected to have ~400 restarts
a day[1] ;)

[1] [http://www.loudthinking.com/posts/31-myth-2-rails-is-
expecte...](http://www.loudthinking.com/posts/31-myth-2-rails-is-expected-to-
crash-400-timesday)

------
joetek
As others have pointed out, this post is from 2005. In the past 6 years, the
cost of scalability has dropped sharply, and shooting for three 9's should be
the minimum for most sites. It doesn't cost thousands to go from 98% to 99%
any more, and to 99.9% is still pretty cheap.

Sure, five and six 9's does get expensive, and that will depend on your cost
of downtime (ie: lost sales, etc.).

------
martin_kirch
Indeed this article doesn't talk about scaling but about uptime... But
although the topic is still open for discussion in 2011, I don't think an
article written in 2005 should be posted in Hacker _News_.

------
cullenking
This doesn't take into account startups who have SLA's because they have a B2B
product. We have both B2C and B2B customers and as a result we can't be down
for our business customers or we have to credit them. Honestly, 99.9% uptime
is not hard to manage. Pick the right colo facility with a history of good
uptime. Have more than 1 machine, and have them on redundant power supplies
(on separate PDU's). Voila, unless you screw up deploys, you have 99.9%
uptime. This doesn't take a huge amount of money.

------
chopsueyar
Alistair Cockburn wrote an awesome book for small teams based on the "Crystal
Clear Method". It has some great info.

[http://www.amazon.com/exec/obidos/ASIN/0201699478/ref=ase_al...](http://www.amazon.com/exec/obidos/ASIN/0201699478/ref=ase_alistaircockburn)

------
wensing
_The criticality of your average “Web 2.0” application is one with loss of
comfort as the result of something going wrong._

Which is also why your average Web 2.0 application can't charge very much:
without it, your comfort level slightly drops. No big deal.

------
nopal
It's amazing how much the economics of up time have changed since this article
was written because of services like AWS.

While it may not be technically or fiscally trivial yet, it's far easier and
cheaper than its ever been, and far more so that in 2005.

------
leif
> To go from 98% to 99% can cost thousands of dollars. To go from 99% to 99.9%
> tens of thousands more.

Somewhere in there is the pickle problem:

You have 1000kg of pickles in your basement. Now, pickles are mostly water. In
fact, your pickles are 99% water, the rest is cellulose. Cellulose has
negligible mass, so we can say that all the mass comes from the water. You
leave your pickles in the basement for a year, and when you come back, they've
dried out a certain amount, so they're now 98% water. What's their new mass?

------
nir
Off topic, a sentence that really stood out for me:

 _"Now what if Delicious, Feedster, or Technorati goes down for 30 minutes?"_

The article was written just six years ago, and these were the examples of
popular sites that came to the author's mind. Gives you some idea on how
transient this field is.

------
nhangen
WoW goes down for nearly a full day every single Tuesday, and they seem to do
OK.

I agree with the post, but I don't like that it encourages settling for less
than the best.

------
code_duck
The encouragment of frugality and pragmatism relating to spending to upgrade
business systems is at odds with the author's publicized, flamboyant spending
on frivolous ultra-luxuries. Apparently, purchasing $900,000 sports cars and
houses in Italy is a higher priority for David than his customers always
having access to what they're paying for.

That's his decision obviously, but if I was a 37 Signals customer ever
inconvenienced by problems with their infrastructure, I'd think of this
article.

