Hacker News new | comments | ask | show | jobs | submit login

I don't understand what the issue is. The client wants you to plan for disaster, and they aren't math oriented, so asking for 100% probability sounds reasonable. The engineer, as engineers are prone to do, remembered his first day of prob&stat 101, without considering that the client might not.

When they say this, they aren't thinking about nuclear winter, they are thinking about Fred dumping his coffee on the office server, a disk crashing, or an ISP going down.

Furthermore, you can accomplish this. With geographically distinct, independent, self monitoring servers, you will basically have no downtime. With 3 servers operating at an independent[1] three 9 reliability, with good failover modes, your expected downtime is under a second per year [2]. Even if this happens all at once, you are still within a reasonable SLA for web connections, and therefore the downtime practically does not exist.

The client still has to deal with doomsday scenarios, but Godzilla excluded, he will have a service that is "always" up.

[1] A server in LA is reasonably independent from the server in Boston, but yes, I understand that there is some intersection involving nuclear war, Chinese hackers crashing the power grid, etc. I don't think your client will be upset by this.

[2] DNS failover may add a few seconds. You are still in a scenario where the client has to retry a request once a year, which is, again, within a reasonable SLA, and not typically considered in the same vein as "downtime". With an application that automatically reroutes to an available node on failure, this can be unnoticeable.

Most clients don't understand the R2D2 talk. They understand money, features, bugs, and downtime. So I've always explained it like so:

Uptime beyond 95% costs lots and lots of money. Magnitudes of order of more money. It requires redundant equipment, engineering all of the automatic failovers at every layer, lots of monitoring, and 24/7 technical staff to watch everything like a hawk. Not ... in ... your ... budget.

... or you could rest in the comfort of knowing that services like Twitter have achieved mammoth success despite long and embarrassing outages. I thought you might you see it that way. Good choice.

First of all, uptime of 95% means 18 and a quarter entire days of downtime per year. That's horrendous. I wouldn't host my dog's website on a server with that kind of SLA - and I don't even have a dog.

Secondly, although Twitter got away with large helpings of downtime, that doesn't mean that every business type can. Twitter is not (or at least was not, for most of its existence) business-critical to anyone. If Twitter goes down, oh well. Shucks.

If you're running recently-featured-on-HN Stripe, however, where thousands or more other businesses depend on you keeping your servers up to make their money, I'd say even 10 minutes of downtime is unacceptable.

Finally, this doesn't have to cost a lot. Just find a host that offers the SLA you're looking for, and have a reasonably fast failover to another similar host somewhere else.

Two problems:

The definition of "uptime", the host SLAs only covers network uptime and environment uptime, but clients consider "uptime" to mean application level uptime, which includes downtimes for server maintenance, steady-state backups, backup restores, deploying new releases, etc ... anything other than service is not 100% fully functional = downtime in their minds.

Also for costs, "reasonably fast failover to another similar host" implies live redundant equipment at another host which is double the hosting costs, that's a big pill to swallow, so big that most orgs would rather suffer the downtime when they see the real cost of full redundancy.

> First of all, uptime of 95% means 18 and a quarter entire days of downtime per year. That's horrendous.

It may be acceptable to some clients depending on what other provisions are part of the SLA (though probably not with as little as 95%).

I've seen a 98% SLA which was applied both annually (~7-ana-third days) and daily (2% of a day being about half an hour) with significant remuneration if the daily SLA was not kept as well as the annual one. If I remember rightly, maintenance windows counted against the SLA except in certain (specified in the contract) circumstances.

Of course for many applications this would still be completely unacceptable, but for others it might be fine depending on the costs and the comeback if the SLA is broken.

I'm getting better than 95% up time on my home network. If you told me that and I was your client, I'd be going elsewhere.

Over 99% costs lots of money, yes. How much is dependent on how close to 100 you are looking to get, but that's the client's decision. 99% though is a perfectly acceptable standard.

There is a big difference between getting >95% up time and being able to actually back up an SLA for > 95% up time.

There is? What exactly are you doing that you need 20 days a year of downtime to accomplish? That's a full business day a week.

If you can't commit to that low of an SLA, you're doing it wrong.

Often SLAs aren't given per full year, though. Think less of 20 days a year, and more of X hours per week if something goes severely wrong.

Now, you are right that you get diminishing returns as you add more nines, but 95% is still in the area where there are a lot of cheap things you can to to increase uptime; RAID, a ups, and even a low-end but business-grade connection should get you around two nines.

I have a SLA of 99.5% (over a month) on a low-end setup[1] and it's fairly rare that I don't meet it, even including planned downtime and network outages due to DoS or mistakes of my upstream.

[1]I use RAID and mostly supermicro server grade hardware with ecc ram, but there is no failover across servers; I'm in a data center, though it's a low-end data center with a low-end bandwidth provider.

Maybe it would be good to write the custom an email explaining stuff like

  99%=Well run server
  99.9%=Multiple backups, will cover most hardware failures
  99.99%=Top grade commercial 
  99.999%=What companies like Google or Yahoo can achieve
  99.9999%=Hopefully the US strategic defense systems are this reliable
Also, you're assumption about the servers are entirely independant. That's a reasonable assumption in terms of fires and blackouts and floods, but not for software problems. You really can't assume that unless the servers are all running entirely different software stacks on different operating systems.

99.999%=What companies like Google or Yahoo can achieve

Even worse. Five nines is five minutes of downtime per year. The core Google search experience blew five nines for half a decade with just one outage -- the one where they marked the entire Internet as a malware site, which took somewhere like 40 minutes to address.

This kind of thing makes me dismiss talk of nines as fetishism or sales-speak. You can say your system is going to have five nines of uptime at the application level. You're probably lying.

P.S. Pricing-wise, a client who wants > 99.5% either wants to pay mid-six figures (and up up up) or they want something which is deeply irrational for you to offer.

If we're talking about the agreement, SLAs that are better than 99.9% are quite common, and available even on low end products. The problem is that the SLA payout is usually "we refund you for the time you were down, if you ask for the refund" - heck, with that payout, I'd be happy to give you a 100% SLA on any product I sell. (of course, I'm not going to advertise as such; the sort of people who buy from me would find that disingenuous.)

That said, I think you are about right with 99.5% being about the best you can expect while spending a reasonable amount of money. (especially for a static site, i think a few more tenths of a point is possible for less money than you think, but the cost curve goes parabolic sometime after 99.5%)

Or they've failed to understand exactly what the SLA says.

It may exclude all manner of things so that it says >99.5% but really doesn't mean what you as an engineer might think. These are legal agreements...

You need to add a line:

100%=Thank you for investing so much in me that my grandchildren's grandchildren are set for life.

99.999% is the standard for landline telephones, which I think is a pretty decent analogy here. Unless the customer has some crummy VoIP solution and doesn't trust their phones anymore :P.

It is? According to whom? One good rainstorm can knock phone lines out for long enough to mess up five nines for a LONG time.

I believe it's what they design for at the central switch nodes. The buildings would have entire rooms if not floors dedicated to nothing but 48 volt wet-cell batteries.

Of course the "last mile" infrastructure is not so reliable. That said, in over 20 years I've had POTS service, I can't ever remember not having dial tone when I lifted the handset.

Not noticing downtime is not the same as not having it.

I'll sell you 100% uptime at a much better price if you promise to only check whether you're up a few times a day...

But "three 9 reliability" is still not the same thing as 100%. The contractor has a right to be concerned about the 100% figure making it's way into a contract.

Um, things don't just "make their way" into contracts. Yes, they suddenly appear in drafts, but finals? Sorry, no. Finals require approvals and signatures from the people who are going to be on the hook. (The e-Bay lawyers who approved the Skype purchase may beg to differ, but they're hardly unbiased.)

The draft-stage is where you take snorkel's exactly-right advice, and declare that the difference between 99.999% and 100% is about a bajillion dollars, give-or-take. Go a step further, and sell that 0.001% hard by pointing out just how much they're be prepared for ("multiple earthquakes plus a giant robot attack, all at once!"). They'll start rethinking fast - guaranteed.

And that's the essence of diplomacy; letting the other guy have it your way. If he thinks it's his idea, even better.

No, three 9 reliability for a single server.

1-.001^3 = .999999999, which is under a second expected per year, which the client will never notice even with good monitoring tools, and therefore will never invoke the contract.

Your assuming independence to a level that does not exist. Consider a Y2K style bug in the OS would could take down all severs for an extended period of time. Or someone could write a virus that uses a zero day exploit etc.

I think it's more likely that the programmer screws up in keeping the separate servers independent through database migrations.

I am making no such assumption. See [1] in my original post. I already talked about intersection. Feel free to add Y3K to the list of nuclear war, chinese hackers, etc. The intersection is incredibly small, and not something that I am going to include in my back-of-the-napkin calculation.

Your failover code never has bugs?

See the linked discussion, only 3 out of 20 top sites had 5 nines.

For enough time and money dumped into code auditing and hiring smart people, no. Not that most companies should do that, but it is possible if you want to pay for it. Most companies (rightly) prioritize innovation, scalability and profit margins over absolute reliability.

How many of those top sites actually prioritized reliability? Is it even justifiable for their business models? I bet you can find a lot better reliability engineering in bank, credit, and stock systems. For example, when was the last time the Visa credit network crashed (as a whole, not localized outages)? Nasdaq?

Nasdaq states that their system has 4 nine's: http://www.nasdaqtrader.com/trader.aspx?id=tradingusequities

This is the answer missing from SF. This is the true "what would it take" that could be presented back to the client.

Would the client go for the cost? Who knows,but that's the client's decision.

Add a little more data on the DNS failover and it would be a good community wiki entry there.

Aiming at 100% uptime opens up all sorts of scope issues. Consider that 'failover' does technically cause a small amount of downtime as you restart the session. If you acknowledge that, it throws out most of the current model of fault tolerance from helping you acheive 100% uptime.

It's not different from any other system really. Try designing a car that can drive 100% of the time? Or a power grid that's up 100% of the time.

I'm flattered. I don't have a SF account but feel free to copy/paste if you think they would enjoy reading it.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact