

How Seriously Does Your Cloud Hosting Provider Take Redundancy? - Sami_Lehtinen
http://en.upcloud.com/blog/how-seriously-does-your-cloud-hosting-provider-take-redundancy/

======
dkokelley
Strong SLAs are nice, but wouldn't the best measure be historical outages as a
percentage of whole infrastructure uptime? For most application, receiving
compensation for an SLA breach is negligible compared to the cost of an outage
to the customer.

One hour of downtime for an application paying $20,000/month for cloud
services is worth $27.78. Even with a 100x multiplier the customer would be
due a check/credit for $2,778. Given the choice, I'm sure most rational
businesses would prefer the uptime over the credit. Perhaps someone running a
company with similar cloud service bills could chime in.

~~~
vilpponen
"Given the choice, I'm sure most rational businesses would prefer the uptime
over the credit."

How would you propose to compensate with uptime (if I've understood
correctly)? Interested to learn your thoughts on this.

~~~
dkokelley
I'm not sure the best way to approach it. The issue is that there is (or can
be) a huge imbalance between cost and impact for cloud service downtime. A
business-critical application coming down for even an hour has the potential
to cause significant heartache to the customer's business. The imbalance is so
high that I'm not sure it's possible to justifiably compensate for your
customer's business impact, based on what the market is for cloud services.

My thoughts on the matter are this:

* Nearly any compensation is "not enough" to a customer. They would have rather had the service not come down.

* It is nearly always better to use _extravagant_ SLA breach compensation for higher redundancy. [1]

* Because 100% uptime is prohibitively expensive, transparency about historical uptime is vital for a prospect making a buying decision.

* The most fair way to compensate a customer for downtime is to take out an insurance policy against your service for the cost of downtime to your customer. (This is unique to each customer, and service should be priced accordingly.) [1]

1: Customers should be compensated for downtime, if at the very least because
they aren't able to use the service they paid for. If additional money can't
be used for additional redundancy/reliability, it can be used to insure
against the cost of downtime.

Edit: After re-reading your question, I see that I misunderstood your
question. I don't mean to say "Your customer was down for an hour, give them
an hour (or more) for free." I actually meant "Your customer was down for an
hour. That costs them $XX,XXX in headache. You can't expect to compensate them
enough to cover that, so it would be better to avoid the downtime."

~~~
Khaine
If an application is that business critical, then shouldn't the organisation
be running it across multiple cloud providers, or ensuring it can run it
internally?

------
kordless
SLAs are an artifact of the current business model where you have compute
resources and I buy them from you. If we alter that model slightly where
everyone buys compute from everyone else, the need for SLAs disappears
completely. It's not surprising such a model would carry an immense amount of
trust for availability, given it would be difficult for everyone to be 'down'
all at once.

The table listing the "confidence" levels of the providers is designed to
impart a certain amount of trust on the end user. The provider is basically
saying "you can trust me, not because you can see how I operate my network,
but because I'm willing to put my money at risk through an escrow statement
that is applied to all my customers". This isn't really useful to everyone as
Sue's use case may have a high net worth and Bob's use case has a low net
worth.

~~~
vilpponen
Thanks for the comment. I work at UpCloud so I'll chip in my 2c on the issue.
Your description of "how much money providers are willing to risk" is a great
way to approach this. While it doesn't probably give you the most universal
view on redundancy - it is a good, comparative measure of how much companies
are willing to risk if their services go down.

Building multi-cloud services is of course a way to try and avoid this, but I
would say that this approach isn't very well adopted as of yet.

Having said that, those utilising cloud hosting services will need to balance
their own risk levels with those of the companies they are looking at and find
the best match. Another way to approach this is to ask; how much am I willing
to risk my business for the benefit of this company offering me a cloud
hosting service?

------
earlz
I know they're not the same, or even the same target market and demographic..
but can anyone explain why a VPS host like linode can manage to keep my VPS up
for >6 months without reboot, while AWS or Azure can barely manage 1 month?

I know the whole "you're suppose to use availability sets for redundancy" or
equivalent speak.. but there are a lot of times when this just is not
possible, not to mention potentially doubling the cost when you do this

------
srcmap
Just suffered my first app down for a few hours after 460+ days of uptime. The
cause was SSD read errors, which trigger ext4 journaling errors and Linux turn
fs into read only mode. From that point on the app failed because logging api
failed.

Just curious, does your SLA cover such outage case? My service provide said
everything is normal from their POV which is kind of true.

~~~
vilpponen
If I've understood your situation correctly - then yes, this would be an issue
that is covered by our SLA. Our customer data is actually hosted on two
separate storage backends for issues like failing disk drives.

Our approach on this is that when we offer a service to you, you should not be
worried about how the physical equipment functions - it should just work.
Hence all issues on that front would be our responsibility. Glad to talk more
on this if you're interested. My contact details can be found on our company
website.

------
switch007
There's nothing in that post about your data centers, just your "stack". Of
what does your service stack comprise?

Why did you not opt for 2N?

Do you co-locate? What tier is the DC? If you colo, especially <= T3, your
colo company has likely cut corners somewhere and that you don't know about
some of the weaknesses in their infrastructure.

------
protonfish
I like my cloud hosting provider to take redundancy seriously, AND take
redundancy seriously.

------
NDizzle
Cute. People guaranteeing 100% uptime. They should run for political office
with ideals so peachy and clean.

~~~
vilpponen
Thanks for the comment! A 100% SLA is actually a promise of future uptime.
Nobody can of course guarantee their uptime, no matter what the SLA
percentage. Thus, the company includes compensation in the agreement if the
service level is not met. All this of course is decided based on historical
data and what makes business sense for the company, ie. how much money are
they willing to lose if their services go down.

~~~
NDizzle
Would you buy a car that guaranteed to never break down? Do you not understand
how ridiculous a 100% uptime claim is?

I don't even consider companies stupid enough to put out a blurb like that.

~~~
michaelbuckbee
The key here is that that is for a SLA. So it's not that they'll never be
down, but rather for every minute they are down they'll pay out money to you
for the trouble.

To extend the car analogy, it's more like a lifetime warranty on a car: the
car might break down all the time, but its still covered.

