
I build High Availability Platforms so Cloud is not for me - nicktelford
http://blog.networksaremadeofstring.co.uk/2012/07/01/i-build-high-availability-platforms-so-cloud-is-not-for-me/
======
rdl
The problem is AWS makes representations about how available certain
components will be, and the independence of various features like AZs, which
turn out to be lies in practice. When a higher level system builds on top of
AWS and relies on these representations, the higher level system fails.

Carriers lie about route diversity on specifically ordered diverse circuits
all the time. I rarely physically trace every power cable in a colo to make
sure A and B are independently fed; if someone has done it correctly 30 times
before, 31 is probably going to be ok. If it is critical, or something which
might be difficult, I may check, but since I am confident it will pass, I'll
bundle a bunch of checks together so it only passes if all the checks pass.

------
notatoad
This post seems like somebody looking for an excuse to attack "the cloud".
AWS's issues last week were caused by a whole datacenter going down, due to an
electrical storm. No matter what your configuration is, whether you use cloud
hosting or traditional hosting, losing a datacenter is going to throw a wrench
in things. Even if your failover works perfectly, the increased load on
whatever you fail to might bring things down.

The procedure for high availability is the same for cloud hosts as traditional
hosts: multiple datacenters and multiple providers. If somebody made the
decision to only host only with amazon and only in US-East, the traditional
hosting alternative to that is not a high-availability network spanning 5
datacenters. You're comparing apples to oranges. You can host with AWS across
multiple regions, just like you can host with any other provider in multiple
datacenters.

------
patrickgzill
This is an excellent cautionary article... I still don't get all the love for
AWS after they have shown that they have less uptime than many dedicated
server providers.

Reliability is the greatest feature you can offer, as a downed service can't
be used by anyone.

~~~
peterwwillis
The big trade-off with AWS is it allows poor people to afford lavish computing
resources for small amounts of time, at the expense of things like performance
and reliability. If you have slow, steady growth metrics it makes much more
sense to host yourself. Planning for peak capacity and buying only what you
need is cheaper than dynamically scaling what you need.

The product is geared to a specific kind of customer, and frankly I have no
idea why companies like Netflix use them. Hell, Netflix actually buys whole
instance blocks [servers] just to work around the shared i/o of EC2. You'd
think just buying a real dedicated server would be cheaper and easier.

~~~
ceejayoz
> The product is geared to a specific kind of customer, and frankly I have no
> idea why companies like Netflix use them.

I'd imagine Netflix sees enormous fluctuations in required server capacity.
Not many people watch movies at work, but when they come home, or when it's a
weekend, traffic probably spikes tremendously.

~~~
mechanical_fish
Yes, perhaps we should think of it this way: Netflix wants to run blocks of
dedicated servers, but to defray their costs by leasing their excess capacity
to thousands of other customers during non-peak hours.

And a cheap way to build out the technical, marketing, and billing
architecture for that is to partner with AWS, which has built all that
already.

------
jxcole
I think the real reason people like Amazon cloud is for people like me who are
decent programmers but who don't know or don't want to know how the system
works internally. This guy sounds like he has a lot of what I would call "ops"
knowledge which is great for him, but if I decided to start running my own
website today I probably wouldn't have the chance to learn all of that stuff.

~~~
seiji
_don't want to know how the system works internally._

I see this sentiment a lot. People seem to be proud of technical ignorance
("We don't even know where are servers are! lol!") because they are too busy
doing whizbang gollygee Important Work like making the world's four billionth
PHP CRUD app and setting up a DB without durability guarantees.

"But we don't have time to _learn_... we are busy being cool and popular!"
Sometimes you have to slow down and actually learn things. Sometimes you
should be an intern for a year before being a CEO. (It sounds like you've
already resigned to not learn things: _I probably wouldn't have the chance to
learn all of that stuff._ )

 _what I would call "ops" knowledge which is great for him_

That's kinda impressive: dismissal condescension towards someone who knows
more than you. It provides the ego with a nice padding. "They know more than I
do, but what they know is silly and useless in these modern cloud-based times.
I feel sorry they actually had to learn how things work."

Programming means manipulating a system. Basic/beginner programmers pretend
the entire system is the API of their favorite (read: ruby) programming
language. You can get pretty far not learning more than that, but you will
always be limited. You will always be pretending to understand more than you
actually do.

If you aren't a "computer person," you'll stop at the basic level. You'll spin
your wheels cranking out things that feel the same forever. What separates a
"computer person" from people just in it for the glory or money is their
insatiable knowledge hungerness. They want to be good, better, and then
optimally best.

Intermediate programmers realize (and have a mental model of) how everything
works together including a basic CPU, bus, memory, network (from hardware up
to packet switching through routing and DNS), your language stack and heap,
calling conventions, directly attached storage, network attached storage, and
half a dozen other things depending on speciality (graphics? wireless? robots?
distributed system? embedded system? web?).

Advanced programmers do not exist.

~~~
ianterrell
>> _don't want to know how the system works internally._ > _I see this
sentiment a lot._

I _like_ that sentiment a lot. We have finite time and finite brain cycles.
Stand proudly on others' shoulders and build something bigger.

There's no shame in programming in high level languages to deliver more value
more quickly. Ditto with operations, ditto with everything.

~~~
seiji
I think it's the difference between a user and an engineer. As a user, you...
use. You don't care how it works. You drive a car; you don't design a car.

Things get fuzzier in software development. Programmers are supposed to be
users of everything they aren't programming. Things are supposed to be black
boxes with fully isolated layers of abstraction. Ha. Doesn't work. To be
competent, you must be aware of, at a minimum, what pieces are under you, at
an intermedium, how they work, and at a maximum, how to create them yourself.

I'm only talking about people who want to be great technical workers though.
If you just want to make a MVP to get funding then hire dweebs, who wasted
their life learning computers, to make you rich, feel free to do that instead.

~~~
Retric
Have you taken an electron microscope to trace each line for every CPU you use
in production to _know_ how it actually works?

No? Well perhaps, not needing to dig into how things work is a useful feature
of a well designed system.

~~~
gummadi
How many software systems are even 1% as battle tested as a "CPU"? Even the
most well designed software system doesn't have the kind of test coverage of a
hardware.

~~~
Retric
CPU's are complex and _far_ from perfect, often shipping with hundreds of
'errata' at the design level plus a wide assortment of random manufacturing
errors in each chip. So, there is software that has fewer bugs and better
coverage than a modern CPU. ex: Grep

EX: Pages 19-73 with 2 to 4 errata per page for a single popular CPU design:
[http://download.intel.com/design/processor/specupdt/320836.p...](http://download.intel.com/design/processor/specupdt/320836.pdf)
And that's just limited to stuff Intel knows about.

------
hafiend
" _Instagram did everything right with load balancing, horizontal scaling and
lots of monitoring but they still went down._ "

"Right" means something different to you than me. I build real-time systems
and quasi real-time systems across RTOSs and nixes.

They relied on a single cloud provider. If you want seamless resiliency you
_need_ redundant pairs at the very least across _all_ failure scenarios.

Can you go overboard on this? Absolutely.

But did they do _everything_ to mitigate this ahead of time? Absolutely not.

Did they do this "right" according to what their business requirements (/risk
profile) were? Who knows.

------
swiil
This article reads to me like someone who lives in an old business model
watching the internet come to town and screaming that it will never catch on!
Will there be a place for legacy architecture like this for many years to
come? Yes. But each time these cloud architectures fail, we are learning how
to deal with that failure and improve our collective availability to remain up
through out disaster.

------
mfenniak
My thinking is that I want experts working on the low-level configurations
(power, racks, UPSes, etc). I don't have that expertise myself. Now, I could
try to hire some experts to set up highly available hardware from me... or...
I could rent some that has already been set up. I think the basic assumption
you have to buy into is that your cloud provider knows how to do this better
than I do.

I think if we look at a AWS reliability, the interesting and sorta-
unanswerable question is: would these applications have been more, or less,
reliable, at the same cost, if they were running the infrastructure
themselves? My guess: less. It's so easy to screw this stuff up.

------
gouranga
For me, I'd rather go dedicated or colo at least, simply because _I know_ what
the process is when TSHTF and can communicate it reliably to my clients.

With black-box cloud services such as AWS, Azure, you just don't know what is
happening.

Peace of mind is also important!

~~~
ceejayoz
On the other hand, though, if half the internet is offline with an AWS
problem, chances are pretty decent a lot of folks will just go "oh, the
Internet is having troubles again".

~~~
omh
The point about high-availability platforms is that your customer _won't_ go
"Oh, the Internet is having troubles again". They'll probably call you, or
perhaps start losing money at an eye-watering rate.

------
joeblau
I like this article. I used to work on HA systems and I saw systems that were
really HA all the way down to the electrons. Right now, I would rather have
someone like you who enjoys building HA architecture and I would rather be
writing software.

The problem is that you are good(expensive) and most people can't afford to
build out and maintain a true HA system so they go for the cheap alternative
and in that case, you get what you pay for.

~~~
famfamfam
Not to mention the fact that in addition to you being good (and therefore
expensive) to have a HA system, you would need at least two of you.

------
imperialWicket
At some level, I can agree with this sentiment. You don't know where the
servers are, you can't see the cables, and a lot of things are just plainly
out of your control.

TL;DR: AWS can realistically provide high-enough availability for most; but I
think the article makes good points for those looking to get deep into the
nines.

However, I think the real issue is that while you can make fairly HA systems
on AWS, it quickly gets expensive. A huge pull for AWS (IMO) is how cheap it
is for an org to spin up four or five servers of various specs, and keep them
running or shut them down. In order to achieve some aspect of HA through AWS,
you need to mirror your entire setup in three places (not to mention some
archives in S3, and ideally some archives on non-AWS). The cost savings start
to quickly degrade as you configure your AWS-hosted system for real HA (or
real close to HA).

IMO, many of the organizations that use AWS are in a position where they
probably wouldn't have true HA configured if they were using VPS or physical
hardware, so it's not so much of a concern. For the larger organizations who
have more funds and more dedicated Ops employees, we've only seen a couple of
issues that should have affected availability, and usually availability drops
because they cut corners. Granted - at least two of the recent outages showed
errors on the AWS-side, and decisions that Ops made for AWS regions/AZs
_should_ have been fine, but weren't. That in mind, any HA sysadmins out there
would have shuttered to think of relying on two different mechanisms that were
potentially in the same data center (that is, backup servers or load balanced
servers in the same region).

------
njharman
Ya sure. I'd rather have some downtime than have to "run cables". It's not
worth the effort for 95% of systems to be _that_ resilient.

~~~
Draiken
As stated in the article, it's not wrong to do what you're saying, but the
article is talking about people that can do that and should do that.

Quoting the article: When it gets to the point where you start measuring
downtime in dollars rather than “time I would’ve been doing something else” it
is time to move your critical infrastructure to something you know.

:)

