
Client asks for 100% uptime  - splattne
http://serverfault.com/questions/316637/100-uptime-for-web-application
======
oconnore
I don't understand what the issue is. The client wants you to plan for
disaster, and they aren't math oriented, so asking for 100% probability sounds
reasonable. The engineer, as engineers are prone to do, remembered his first
day of prob&stat 101, without considering that the client might not.

When they say this, they aren't thinking about nuclear winter, they are
thinking about Fred dumping his coffee on the office server, a disk crashing,
or an ISP going down.

Furthermore, you can accomplish this. With geographically distinct,
independent, self monitoring servers, you will basically have no downtime.
With 3 servers operating at an independent[1] three 9 reliability, with good
failover modes, your expected downtime is under a second per year [2]. Even if
this happens all at once, you are still within a reasonable SLA for web
connections, and therefore the downtime practically does not exist.

The client still has to deal with doomsday scenarios, but Godzilla excluded,
he will have a service that is "always" up.

[1] A server in LA is reasonably independent from the server in Boston, but
yes, I understand that there is some intersection involving nuclear war,
Chinese hackers crashing the power grid, etc. I don't think your client will
be upset by this.

[2] DNS failover may add a few seconds. You are still in a scenario where the
client has to retry a request once a year, which is, again, within a
reasonable SLA, and not typically considered in the same vein as "downtime".
With an application that automatically reroutes to an available node on
failure, this can be unnoticeable.

~~~
snorkel
Most clients don't understand the R2D2 talk. They understand money, features,
bugs, and downtime. So I've always explained it like so:

Uptime beyond 95% costs lots and lots of money. Magnitudes of order of more
money. It requires redundant equipment, engineering all of the automatic
failovers at every layer, lots of monitoring, and 24/7 technical staff to
watch everything like a hawk. Not ... in ... your ... budget.

... or you could rest in the comfort of knowing that services like Twitter
have achieved mammoth success despite long and embarrassing outages. I thought
you might you see it that way. Good choice.

~~~
swombat
First of all, uptime of 95% means 18 and a quarter entire days of downtime per
year. That's horrendous. I wouldn't host my dog's website on a server with
that kind of SLA - and I don't even have a dog.

Secondly, although Twitter got away with large helpings of downtime, that
doesn't mean that every business type can. Twitter is not (or at least was
not, for most of its existence) business-critical to anyone. If Twitter goes
down, oh well. Shucks.

If you're running recently-featured-on-HN Stripe, however, where thousands or
more other businesses depend on you keeping your servers up to make their
money, I'd say even 10 minutes of downtime is unacceptable.

Finally, this doesn't have to cost a lot. Just find a host that offers the SLA
you're looking for, and have a reasonably fast failover to another similar
host somewhere else.

~~~
snorkel
Two problems:

The definition of "uptime", the host SLAs only covers network uptime and
environment uptime, but clients consider "uptime" to mean application level
uptime, which includes downtimes for server maintenance, steady-state backups,
backup restores, deploying new releases, etc ... anything other than service
is not 100% fully functional = downtime in their minds.

Also for costs, "reasonably fast failover to another similar host" implies
live redundant equipment at another host which is double the hosting costs,
that's a big pill to swallow, so big that most orgs would rather suffer the
downtime when they see the real cost of full redundancy.

------
sunchild
100% uptime is not an operational requirement – it's contractual. A client
that demands 100% uptime isn't being unreasonable; they're looking for a
contract remedy (most likely a termination right) if/when the site goes down.

1\. "Uptime" is defined in many, many ways. In the OP's article, it's the
definition of uptime that seems unreasonable. Normally, the demarc points for
the network segments and equipment being measured for uptime are entirely
within the provider's control. In the OP article's update, the client
clarified that 100% uptime only applies when hosting is cut over to the
provider's site – something they are (theoretically) capable of controlling.

2\. Remedies for failing the uptime requirement are different for nearly every
agreement. Often SLA credits are the exclusive remedy. Sometimes the customer
has a termination right (either express, or through the termination for cause
provision). The remedy is probably more important than the uptime percentage.

You'd be surprised how many big name web apps offer 100% uptime as a matter of
contract, knowing that it's a near-impossible operational goal. It's a matter
of taking on the risk of your customer leaving you or claiming SLA credits, or
whatever remedies you agree upon.

~~~
sunchild
EDIT: I've represented a whole lot of customers of web services over the years
(IAAL). The big lesson in this area for me is: (1) customers rarely invoke SLA
credits, preferring instead to "work it out" at the relationship level, and
(2) most provider off-the-shelf SLAs are so full of holes and tricky
thresholds that they are effectively useless. On this last point, beware the
100% or five nines or other unreasonably high uptime commitment. When you get
into the details of the SLA (the demarcs, the qualifications for obtaining
credits, the remedies for failure), you will almost always find that there is
no realistic remedy at all.

N.B. Meant to edit my comment above, not self-respond.

------
msy
It's always helpful when clients give unambiguous signs of unreasonable
insanity upfront instead of hiding it until you're halfway through the
project. It makes running away as far and fast as humanly possible so much
easier.

~~~
duck
There also was this comment on the SO thread: _I would personally RUN from
this client as fast as possible. I suspect this won't be the last crazy idea
they may have (from a technology standpoint)._

Why run though? They probably just don't understand what 100% means and it
just takes you explaining it to them. Or simply state that you cannot meet
that requirement and see if they still want you to bid on the project.

~~~
rickmb
You've just quoted the reason why: it won't be the last crazy idea they may
have.

That's pretty much an absolute certainty. Even if you can convince them with
reasonable arguments to accept a few point less uptime, you're going to be
having the same kind of discussion many times after on different subjects.

You have to be really, really sure you want and _need_ this kind of client.
Most of the time (around 100%...) they are more trouble than they are worth.

------
runako
All the posters are stuck on the fact that 100% availability is impossible.
But why not instead try to learn from others who offer 100% availability, like
Rackspace and SoftLayer? These (legitimate) providers know 100% availability
is not possible, but they guarantee it anyway. How can they get away with
this? Easy, they have a contractual SLA that indicates what their clients are
entitled to when their network fails for any period of time. Further, neither
is a low-cost provider, which allows them to engineer their systems to reduce
incidences when clients will invoke the SLA.

Note that this doesn't mean that Rackspace is shady because they promise 100%
knowing they can't deliver it. After all, they put their money where their
mouth is! They have an incentive to actually achieve 100% uptime. I'm sure
there are other applications where the target is 100% (not 5 nines)
availability, especially in finance, medicine, and militaries.

My recommendation would be to take your engineering hat off, replace it with a
business hat, and provide them with a series of price quotes for various
uptime SLAs. And then make sure you're pricing high enough that when something
goes down for any period of time that you can make good on your obligations
under the SLA without losing too much sleep. Then let the client choose the
SLA that matches their business needs and budget.

------
joelhaasnoot
Another option not mentioned in the thread is to accept it, and pay any of the
fines associated with not meeting it. This happens all the time in public
tenders and contracts, where the fines are calculated into the business risk.
It does mean that the organization needs to set the right fines to make that
unfeasible.

~~~
eli
I've definitely seen hosting services do this. Sure, there's a 100% SLA, but
if you actually read it, it says you get back the pro-rated monthly fee for
the time it was down. So, in other words, you don't have to pay for it when it
isn't working. Not much of an SLA.

~~~
joelhaasnoot
Public transport does this too: in order to provide a robust, perfect
implementation of a schedule, you need extra busses/trains/street cars
(expensive capital goods) and extra manhours (generally most expensive part of
the operation). Rather than invest too much money in making sure the schedule
can be met, it's cheaper to pay fines when there are delays.

~~~
gaius
British Rail once declared they operated on a 66-minute hour for this reason.

~~~
shabble
A (garishly coloured) reference: <http://www.lococarriage.org.uk/66minute.htm>

I've actually had quite good experiences claiming ticket credits or total
refunds for some long-distance UK train journeys. I can't find the exact
terms, and they potentially vary per-operator, but it's around 50% refund for
up to 30 minutes delay, and a full ticket cost refund if it's >60 minutes. A
couple of times, I've had a trip delayed by 50-59 minutes. I suspect this
isn't a coincidence.

------
Duff
Their craziness doesn't matter. Usually crazy customers aren't rich. So if you
build to their craziness, you'll lose the customer.

You need to build an appropriate infrastructure that will win the bid, figure
out what you can achieve (99.9%/99.99% uptime) and build in enough overhead to
cover your SLA penalties. Or negotiate a monitoring methodology that is in
your favor. (ie. exclude planned maintenance windows, use a monitoring
threshold/interval to allow you to address issues before triggering contract
"downtime", exclude external provider issues, etc)

~~~
efsavage
> Usually crazy customers aren't rich.

On a personal level, I think individual people who've become wealthy have a
more reasonable outlook when it comes to stuff like this, but this doesn't
apply to rich _companies_.

The least reasonable clients I've ever had were employees of large companies,
and more specifically those who'd been recently empowered with the
responsibility of the project I was working on and lacked the perspective
required to realize that crazy doesn't help anyone.

It's not their money, but it's their decision, and that's a petri dish for
crazy.

------
rwmj
Come on, this is possible.

First we're going to have to get the governments of the world together to
agree to remove all nuclear weapons. Second would be getting that asteroid
tracking and deflection system working. Quantum physics does unfortunately
predict that the earth might flick out of existence with some small
probability, but by distributing the website across the universe we can reduce
this probability arbitrarily (and numbers approaching p=0.9999.. are the same
as p=1). The client is going to need to budget for this.

~~~
mkup
And what about gamma ray bursts from nearby supernova directed to the Earth?

~~~
rwmj
The Dyson sphere surrounding the solar system is obviously going to be
expensive, but what's money when you want 100% uptime?

------
hmottestad
So 100% uptime is really difficult to achieve, hardware wise. Software wise
you'll have to prove that there are no bugs in the system that might bring it
down. That is much, much, harder.

You can have 100 servers in 100 different countries and have the client
automatically change to another server if the one they are connected to goes
down. But if there is a software bug that crashes all your clients on start-
up, or worse, crashes all your servers (think what happened to Skype not long
ago).

Also, never underestimate bugs in hardware (pentium 1). You'll need multiple
locations, multiple hardware, multiple operating systems, multiple compilers,
multiple versions of the software.... Standardizing on one of these components
may bring down your entire system!

------
illumin8
Look at F5 networks Global Traffic Manager. It's really just a fancy DNS
server. You set your TTL (time to live) down to just a few seconds and it
monitors your main and standby sites. If one of the sites goes down it changes
your A records to point to the new site. It can even do load balancing across
sites based on response time or number of connections.

They are expensive, but this is how large companies like Yahoo keep close to
100% uptime.

Explain to the customer that even with a hot site, the failover can take a few
minutes. Also, some ISPs don't honor TTL and cache DNS queries for longer than
they should. The Internet isn't perfect, and usually each extra 9 you add is
around 4x more cost.

~~~
marquis
This is the kind of service I'd love to attack as a side project, it's
fascinating. Though I'm sure someone out there reading this, has something
like this service but affordable for startups?

~~~
illumin8
There is a lot of room in this market for competition from open source
projects. Really, the concept is so simple that it could be done with a shell
script for simple failover:

(pseudo code) if (curl <http://ip> of main site) fails then copy alternate
zone file to bind dir service named reload fi

You get the idea... F5 Networks is really just a fancy DNS server running on a
BSD based OS on an x86 appliance. Zeus, which another commentor mentioned, has
an AWS version and will let you run it on your own hardware if you like.

I'd love to see some open source competition for this space, or even low price
competition.

~~~
marquis
>I'd love to see some open source competition for this space, or even low
price competition.

Yes, that's exactly what I meant. Running our own failover system is not just
expensive, but time-consuming - just another thing to do when you are trying
to scale and time is already short within a small team.

------
_corbett
I was at an Akamai presentation the other day in which the salesperson claimed
"100% reliability" of their services "not 99%, not 99.9% but 100%".

~~~
moe
These claims are common for all the big CDNs and ISPs but it's always
accompanied by half a page of fineprint that rids them of any liability when
an outage happens and limits compensation to a microscopic penalty (usually a
fraction of the monthly fee).

You _can_ negotiate steeper penalties with funky multiplicators - but they
make you pay through the nose for such an arrangement (for obvious reasons).

------
JoachimSchipper
Seems reasonable, actually. "100%" is obviously not going to be achievable,
but "external users should be ok if our office network fails" is not
necessarily a bad requirement. There are lots of things that may make this
client happy: a VPN to an "internal" server in an external data center,
synchronous replication, etc.

------
patrickgzill
Level3 offers 100% uptime in their SLA. All that means is that if the network
goes down you get some money back.

------
dmbaggett
Look at it from a business, rather than engineering perspective. Forget the
achievability of the 100% target for a moment -- what target can you
realistically achieve? Then, what does the contract say the remedy is for
breach? As long as the remedy is not huge and -- this is very important -- is
clearly quantifiable, it may be just fine to enter into such an agreement.

You need the remedy to be clearly quantifiable (X dollars per Y minutes, for
example) because otherwise you create an opportunity for dispute when the
inevitable occurs and you breach. Resolving such a dispute could very well
cost more than the remedy itself, even in the worst case.

From an ethical standpoint, I would only enter into such an agreement with an
understanding that "while we agree that it makes sense for you to request that
target, we think realistically that we'll be closer to 99.9% (or whatever you
truly believe)". Entering into an agreement with a 100% uptime clause is
different from setting an expectation that uptime _will actually be_ 100%.

------
DanBC
Helping your customers understand what they actually want to buy is part of
selling, surely? Things are made trickier by PHBs in the client company
claiming that everything is _mission critical_ and that they can never ever
have any downtime ever for any reason. Educating these people about, for
example, just how flaky email and dns can be is important for your sanity.

See, for example, these couple of posts from a Microsoft public newsgroup ten
years ago:
([https://groups.google.com/group/microsoft.public.backoffice....](https://groups.google.com/group/microsoft.public.backoffice.smallbiz2000/msg/4bb5462eaa89b3b8?hl=en&dmode=source))
([https://groups.google.com/group/microsoft.public.backoffice....](https://groups.google.com/group/microsoft.public.backoffice.smallbiz2000/browse_thread/thread/90b8b85319f9b626/3e47ac766d184e4c?hl=en&lnk=gst&q=mission+critical#3e47ac766d184e4c))

Some customers are clueless, but at least they care about the data.

------
blrgeek
Looks like client is asking for off-site failover, not really 100% uptime and
the OP doesn't know how to achieve it over a WAN. Esp if this is a real
enterprise customer, they want Disaster Recovery (DR).

This is a solved problem, albeit not a commonly known solution. Any of F5,
Radware, and other expensive boxes can do this. This can also be done through
DNS or with HA-Proxy etc.

------
smoyer
Offer them a 100% up-time guarantee for a year if they also promise to avoid
being sick for the entire next year. If they can't avoid succumbing to a
virus, why would they expect your service to avoid it (or any other sort of
bug)?

------
nodata
99.5% uptime is 100% uptime to 0 decimal places.

~~~
sophacles
I appreciate the snark in your post :), but it also brings up a serious
question:

How does contract law handle significant figures?

~~~
Colman
I haven't read any cases on this, but I imagine most courts would
truncate/round down the achieved performance number. If the contract says
100%, that wouldn't allow for any downtime whatsoever.

It might be different for lower percentages - getting 89.8% performance where
90% is called for could be a de minimis breach and not actually count as
breaking the contract. Definitely curious as to whether anyone has more to add
on this.

~~~
a3camero
No.

They are interpreted the same way other commercial agreements are. They're
also generally very lengthy and specify what they mean (i.e. what counts, what
doesn't). They also set out what the consequences of breach are (maybe you
want $, maybe you want something else, etc.).

There's no magic to writing "SLA" or something else. What you put in the
contract is what you'll be held to...

------
Joakal
100% uptime is unrealistic for big companies because at scale, it costs a lot.
For example, replicating is expensive with transmission, storage and
maintenance costs.

When the Amazon incident happened, I did an analysis and found that the cost
about triples if stored in an external data centre. Almost 6x cost if hosting
overseas despite same company.

I then understood why companies like Reddit do not aim for 100% uptime
possible beyond the data centre. It depends on how much the customer (or
client) is willing to pay that determines the uptime aim (I think Reddit's aim
for example is 90% at least).

------
yuliyp
Let's say I wanted 100% reliable music listening. To do this, I buy a million
of the original 30GB Zune media players, create a perfect failover system, so
that if the sound from one of those stops for whatever reason (hardware,
software, cosmic rays, etc), it'll switch to another one. I even move these
Zunes all across the world, with AC provided, and multiple network links
linking all of them, with satellite link backups between them.

Then December 31, 2008 rolls around, and a tiny firmware bug knocks out all of
them simultaneously for 24 hours. Oops.

Not all failures are independent events.

------
babebridou
Would a heavyweight client with nothing but static data and no network at all
reach 100% uptime, from a contractual point of view? Even a wrist watch does
not exactly guarantee 100% uptime.

------
pstoneman
I can't post on serverfault, since the question's been locked, so I'll put
useful things to consider here:

* 100% SLA doesn't always mean 'It has to be up all the time'. Depending on the customer or the supplier, it can mean 'We'll aim to have it up all the time, but if it's not, we'll pay you compensation according to a predefined scale'. Clearly, in this case, you need to define quite firmly what 'up' and 'down' mean, how you measure them, how you time them, and how you decide what compensation to pay.

* DNS failover or load balancing is often nearly good enough. It won't get you instantaneous failover, since you'll need to have a finite (albeit small) TTL, and some client stub resolver libraries cache stuff anyway in violation of the TTL. But it's an easy step on the way

* If you want true 100% uptime, ultimately, you need a single IP (or range of IPs) which will be permanently reachable. That pretty much means the IPs need to come from one AS number - in other words, one ISP or one company.

* You can choose an ISP or company which has multiple internet connections, peers with a lot of people in multiple locations, and has a well-designed network such that you feel confident they won't go offline. Amazon may be a good example, but they've had several recent high profile failures!

* You could do it yourself - in which case, you'd need to become an ISP, get your own AS number, and set up peering arrangements with multiple suppliers in multiple locations. This can be very costly, and you still have to run a network and servers yourself in a reliable way

* You might be able to find a supplier who peers in multiple locations, and anycasts their protected IPs within their AS. That way, the same IP comes from multiple locations and should be reliable. Akamai might do something similar to this, I think.

* Ultimately, however you do it, you'll have a very difficult time making it impossible for it to fail. You're into the game of making it exponentially less and less likely that it'll fail, but you can't eliminate all risk. At the end of the day, your contract with your customer needs to define what happens if you should fail to reach 100% uptime. Is it breach of contract? Or do you need to pay a penalty fee? In either case, however you host it, you ideally need to make sure your suppliers compensation to you if they have a failure will cover the losses you incur.

------
mathattack
The issue is of expectations and education.

Many clients ask for the following without knowing better:

\- 100 pct uptime

\- Zero defects

\- Zero scope changes

\- Zero perceptible latency in all cases

It is up to the professional specialist to educate the client in terms they
understand. Only after explaining in terms they understand can you call them
unreasonable. If the client understands and is still unreasonable, the true
professional has the obligation to walk away.

------
ck2
DNS round-robin with mirror servers that run 24/7.

~~~
brador
Yup, this is what I'd do. Mirroring would be the big issue, but not that big a
deal unless data was time critical.

~~~
ck2
Sites with large amounts of data do this in realtime all the time.

WordPress.com for example has five mirrors (maybe more by now).

------
droithomme
You can do this, but I suspect they might not like the cost quote of $10
trillion per year, and they 10 years of lead time it will cost you to build a
worldwide network of secure underground bunkers with their own air, water,
food and energy supplies, and a hardened shadow internet that duplicates the
function of the existing internet.

------
samuel
I would answer that Price is a function of uptime given by the following
formula:

Price(uptime) = K*(1/1 - uptime)

so I would ask for infinite dollars...

~~~
brador
I think you mean

Price = K/(1-uptime)

~~~
grhino
K*(1/1 - uptime) = K/(1-uptime)

~~~
brador
No. K _(1/1 - uptime) = K_ (1-uptime)

If you are going to use brackets, use them correctly.

~~~
grhino
Good catch. I kept reading it as K*(1/(1-uptime)) and carelessly left it out.

------
gte910h
This is not a technical issue, this is a contractual issue. You want to sign a
contract that pays reasonable amounts for reasonable (far less than 100%)
reliability for every minute of downtime.

Then you have the impetus to minimize (but not eliminate) those minutes to a
reasonable level.

------
bsiemon
It seems as though someone read half an article about 99.999 uptime and
decided to be clever.

------
Toenex
I would agree to this but charge them infinite money.

~~~
rmc
And then the client gives you a cheque that says "infinity" on it. They have
fulfilled their end of the deal, now you must. What, your bank won't accept
that cheque? Not our problem, get back to work.

~~~
Toenex
Now you're just being silly. Clearly I would have them pay in a series of
monthly instalments each one infinitely smaller than the total.

------
rjurney
Start looking at Erlang driven telephone systems in Europe with 100% up-time,
and the what it took to build them.

------
capdiz
I liked the dude that said "i wish i could down vote your client".

------
kahawe
I see several possible approaches, if you really want to have that client.

This easiest would be to just talk to them, try to find out what that "100%"
is actually REALLY all about and make them understand that from a technical
point of view, 100% will add a lot of things to the project budget. A "100%"
demand in a smaller project for a typical small-to-medium business will likely
mean something different than "100%" in a project for the NYSE. So, talk to
the customer and find out what it is actually all about and then plan and
quote according to their actual needs. So, this makes it more a requirements-
engineering type of problem, not necessarily a hacker problem.

Or you just say "yes, of course" and tell them how super reliable the system
is and then let the guys in legal work it out in the fine print and cover your
ass.. but don't expect much happiness and continued business from that client
then once they find out what's going on.

But, in a more honest approach, maybe this is actually all they really want
and need? Maybe it actually is enough for them to have someone to blame and
pay some penalties for violating SLAs. Again, you need to find that out.

Not a typical hacker-hacker-problem but surely an issue a hacker would
typically encounter, even on a daily basis, and should learn to deal with.

