
100% uptime for a web application - leonegresima
http://serverfault.com/questions/316637/100-uptime-for-a-web-application
======
biot
This is a great lesson in how not to be a pedantic nerd toward clients. By
100%, the client is obviously not specifying an uptime requirement of:

    
    
          lim (1 - 1/x)
         x->inf
    

They're saying that downtime is unacceptable. As the consultant/contractor,
it's your job to present various scenarios and reiterate to the client that
redundancy/failover features introduced at each new level roughly doubles the
hosting cost plus requires additional engineering time for all the planning,
prototyping, development, testing, deployment, and so on for the sync/failover
functionality. For example (successive levels include redundancies of previous
levels):

Level 1: Single server is connected to a UPS device. Domain's DNS records from
multiple DNS hosting companies.

Level 2: Redundant servers load balanced on the firewall.

Level 3: Pool of redundant servers with redundant firewalls and redundant
network switches.

Level 4: Geographically separated data centers with traffic routed via
anycast.

Level 5: Servers in multiple datacenters from multiple distinct vendors.

Outline ballpark costs at each level plus introduce the required staffing
levels to support each level including site reliability engineers to monitor
and maintain each site's operation and verify correct data synchronization,
24x7 on-call engineers, equipment to enable remote diagnostics, and so on.

~~~
300bps
>This is a great lesson in how not to be a pedantic nerd toward clients

I've been in IT for 20 years so I may be jaded, but I disagree with your
conclusion.

Your Levels don't seem to scratch the surface on achieving even 99.999%
uptime. First of all, if you are planning on using vendors for your
hypothetical 99.999% uptime you can stop there. Sure, they may promise that to
you. But what is the penalty if they don't?

If you are planning on hosting this in your own data centers... Are you
including diesel generators and massive diesel tanks to supply them in the
event of extended power outages? What is your DDoS plan? What is your plan in
the event of a hacker? What's your patching plan? These are just examples. The
level of planning to achieve 99.999% uptime per year is massive. That's only 5
minutes of downtime per year!!

The bottom line is there are a handful of applications in the world that
require this level of uptime. Unfortunately, every single client I have ever
had was under the impression they required this level of uptime. It is my job
to explain to them what reality is. If I don't, then I am not allowing them to
prepare for the inevitable times that the system will be down no matter how
much they spend or what they do.

~~~
biot
I totally agree. I was assuming that diesel backup generators with refuelling
contracts in place is considered table stakes for a data center these days. My
example levels weren't exhaustive by any means and your points around capacity
planning, attack mitigation, change management, etc. are spot on.

------
lucaspiller
I'm a bit disappointed with these answers. Rather than explaining how it could
be done, the answers merely say why you shouldn't do it or why it is
impossible.

Assuming you have a client who asks for this, you explain it isn't possible to
get 100%, but they still want as good as they can get, and are willing to pay
big bucks, what do you do?

~~~
benjaminwootton
High level items:

\- Resiliency of all hardware, software, interconnects i.e. no single points
of failure;

\- Isolation of the various routes, such that you can pull any plug in the
system and there is always a route through;

\- An application architecture that supports horizontal scaling, clustering,
load balancing, error conditions, enough asynchronicity;

\- Enough capacity such that if things start failing, the rest of the system
can handle the load;

\- Tooling and environments that support stuff such as intraday deploys and
rollbacks, including database migrations;

\- QA and testing processes to ensure catastrocphic bugs do not slip through!

These are all achievable, but it's when problems start interacting that stuff
slips through the cracks.

~~~
waps
Legal items. Alternatively: The cheat everyone else is using.

1) contractually provide a 100% SLA

2) the "benefit" you give the customer when you don't provide 100% of uptime
for a given period is 10% price reduction for that period only

3) To add insult to injury, make the period a day or even an hour.

4) Downtime is only downtime after reception of an email from the designated
customer contact, and only if it wasn't resolved at the time they actually
noticed. That customer contact must be a person, ready and able (legally
allowed) to answer questions and make choices.

This is how a lot of ISPs are doing it.

Amazing how many people consider this a 100% SLA. Then the ISP is down for
weeks, and then they get to pay $50 less on a $5000 bill ...

------
zalew
<https://news.ycombinator.com/item?id=3056414> linked from the answers

------
eatmyshorts
This doesn't address the asymptotic impossibility of maintaining true 100%
uptime, but...

I think you can maintain uptime even when the servers aren't available with
something like Meteor (a single-page web app built with Javascript...doesn't
have to be Meteor, but Meteor should be able to handle it). It's not easy, but
with Meteor's events indicating availability of the server should allow you to
cache pending changes (when the server is unavailable) and send cached changes
when the server becomes available again.

Meteor gives you full control over a client-side session, such that as long as
the user doesn't reload the page, the user can continue using the app. There
are hooks to detect when the server reconnects, so you should be able to write
some custom code that updates the server with any pending changes that
occurred when the server is unavailable.

Reloading the page would kill the pending updates, so you'd need to have the
users use some sort of kiosk-like browser that prevents reloading the page and
navigating to other pages.

There are other issues to be considered (what if the pending changes are
invalidated by another user's changes?) that are very challenging, but can be
handled depending on the user's other requirements.

------
anuraj
Every nine after 99 is going to cost exponentially - There are only few
critical services that can justify that kind of outlay.

