
Lessons from 10 Years of Amazon Web Services - yarapavan
http://www.allthingsdistributed.com/2016/03/10-lessons-from-10-years-of-aws.html
======
yarapavan
James Hamilton listed some of the highlights over the last decade at his blog
- [http://perspectives.mvdirona.com/2016/03/a-decade-of-
innovat...](http://perspectives.mvdirona.com/2016/03/a-decade-of-innovation/)

~~~
jcrites
James has put out a lot of great material. He's given a talk series and wrote
a conference paper called "On Designing and Deploying Internet-Scale Services"
which is a trove of valuable design insights:

[https://www.usenix.org/legacy/event/lisa07/tech/full_papers/...](https://www.usenix.org/legacy/event/lisa07/tech/full_papers/hamilton/hamilton_html/)
(somewhat condensed)

Slides from a talk:
[http://mvdirona.com/jrh/talksAndPapers/JamesRH_AmazonDev.pdf](http://mvdirona.com/jrh/talksAndPapers/JamesRH_AmazonDev.pdf)

He writes some of the more practical and useful advice I've seen about how to
run a successful business based on high-scale distributed systems.

One of his mnemonics that's stuck with me is the "Four Rs" of recovery-
oriented computing: Restart, Reboot, Reimage, Replace (slide 6 in PDF). A
human shouldn't be engaged to troubleshoot a problem until the platform has
first tried in successing to restart the software, reboot the OS, re-image the
machine, and finally replace the hardware entirely. After these auto-recovery
steps have failed, only then is it time to engage a human. He describes the
connection between these techniques and significantly lower operations costs.

------
jimbokun
"Primitives not frameworks"

This is why Amazon beat Google in the "as a Service" space. Google provided a
framework with AppEngine, but customers wanted the flexibility of the
primitives offered by AWS.

~~~
nopzor
I think that Amazon being first in 2005/2006 helped a lot, too. Google now
provides "primitives" through GCE. I think Google realizes that the future is
GCE not GAE.

~~~
fensterblick
But Amazon wasn't the first. The Sun Grid was first (launched in March 2006).

~~~
nopzor
Sun Grid was more about batch processing across many machines, rather than
individual VMs, wasn't it?

Plenty of people in the web hosting industry offered competitive x86 linux
instances or dedicated servers, way prior to 2006.

I think AMZN deserves credit for delivering a price compelling option that
delivered (1) elasticity (per hour billing) and (2) automation (API, available
in minutes), among other things.

------
rodionos
You have to give full credit to AWS for inventing previously unknown pricing
models. In some cases, it was actually the pricing model that created new use
cases. Glacier and S3 comes to mind.

~~~
jeffbarr
That's a very important observation. When I describe AWS to audiences I always
make clear that it is a combination of a technology and a business model.

------
axelfontaine
"A good litmus test has been that if you need to SSH into a server or an
instance, you still have more to automate."

Amen. We couldn't agree more: [https://boxfuse.com/blog/no-
ssh](https://boxfuse.com/blog/no-ssh)

~~~
bluecmd
So, with No SSH - how do you debug that one-off problem that is only on
machine abc12? I'm not talking about mutating, I'm talking about attaching gdb
to the process while it's handling requests. I'm talking about collecting CPU
profiling information from production. Stuff like that.

~~~
autotune
You could probably have a central logging service like graylog or logstash
that automatically collects that information and just sort through the logs
that way.

~~~
AznHisoka
What if that 1 server ran out of file handles and can no longer log to
anywhere?

~~~
autotune
Launch a copy of that server with a new image (which I would hope is under a
load balancer), update the sysctl.conf file template with a larger fs.file-max
limit in whatever config management system you use for that instance, resume
logging under freshly cloned instance.

~~~
dsp1234
How would you know to do that at all? The proposed issue is that the file
handles are exhausted. Thus logging doesn't actually work, and thus you don't
know anything about the problem except 'it's not working'. Your solution
presupposed that you have figured out the issue, but the question is how do
you find out about the problem at all. Without SSH/debugger/whatever, what is
the process to inspect the system to find the problem?

~~~
devonkim
After dealing with different enterprise systems in AWS for a few years now,
I'd say that this is what AWS ELB feature to remove a system out of an ASG as
part of its lifecycle exists for. If you aren't using similar horizontal
scaling strategies, you have a disaster waiting to happen basically and
basically deserve an outage you can't diagnose / RCA well. If you can't ssh in
afterwards that means your provisioning for automation / instrumentation and
you should stay be able to isolate attention to that part of an instance's
lifecycle.

