
Amateur hour at AWS - roncohen
http://blog.opbeat.com/posts/amateur-hour-at-aws/
======
davidu
I understand a certain level of displeasure at their lack of specificity while
they mitigated the issue. But... in this case, the time to remediate doesn't
really change your response to the threat. No matter what, you need to change
all your keys, generate new private keys, etc.

They got it fixed within 48 hours, globally, which, if you ask me, is
incredible at their scale.

I would hardly describe anything AWS does as amateur. But maybe that's just
me.

~~~
sebgeelen
The thing AWS does which can be described as amateur, was the communication.
AWS has always been really quiet when there's an issue on their services.

It's a nightmare to know when your problem is due to your infrastructure or if
their's a bigger scale issue at AWS cause they never talk about it...

~~~
mkr-hn
Amazon has a status page for all AWS services and provides RSS feeds for each:
[http://status.aws.amazon.com/](http://status.aws.amazon.com/)

~~~
madaxe_again
Trust me, this doesn't cover all their issues. Don't get me wrong, we love our
AWS stack, but we've had all sorts of weird stuff happen over the years, from
zombie ELBs hitting the wrong hosts to invisible SGs... sometimes we see from
twitter that others are seeing the same things, but AWS don't ever confirm or
deny anything, they just stall then tell you when it's fixed. I guess its
support in terms of something to give to your PHB, but when something goes
wrong within AWS, it's a very black box - which isn't surprising.

------
jasonkester
I read the words in your blog post and came to a completely different
conclusion.

Despite not putting time stamps on their communications (which you seem really
really upset about), they fixed everything for everybody in like a single day.
You, their customer (and better still, _me_ , their customer) didn't have to
lift a finger.

This is exactly why we farm out infrastructure to companies like Amazon. They
have a whole squad of smart people standing around leaning against a post all
day every day waiting for something like this to happen so they can jump in
and fix everything for us.

My little one-player business has no such team o' dudes on standby. If it
weren't for the fact that Amazon is cleaning up for me, I'd be two days into
having a _really bad time_ and getting no productive work done.

~~~
chris_wot
Actually, they had to re-key their certificates. They did have work to do, and
they couldn't do it till the upgrades at AWS was done.

So yeah, you'd be still having a bad day. Because you have no team 'o dudes on
standby, so your already busy staff are chaffing at the bit to get the
certificates sorted out, and potentially not being able to restore service and
constantly checking to see Amazon has completed their patching. i.e. little to
no productive work is being done.

------
bashcoder
For future reference, AWS security bulletins can be found at:

[https://aws.amazon.com/security/security-
bulletins/](https://aws.amazon.com/security/security-bulletins/)

The author suggests that Amazon only made one post, updating it throughout the
process. But at the link above you will see four posts regarding this issue,
with the first one having been updated once to add information. No, they are
not timestamped, but they are dated.

While it may be fair to criticize AWS for its customer communications during
this process, I'm fairly certain that if he had a backstage pass to such a
comprehensive process of remediating thousands of production systems with zero
downtime, he would perhaps find the team somewhat less... amateurish.

------
dm2
Obviously they were working as fast as they possibly could without risking
major outages. They probably had millions of servers to update.

I'd even argue that it's not a good idea to advertise, "these servers are
vulnerable to this attack".

AWS is massive and organizing that kind of update by an army of engineers
isn't easy.

You received a non-personalized message because AWS support probably received
tens of thousands of irate customers demanding that their systems be patched
immediately. For some reason they weren't equipped to handle that kind of
update but I'm sure they will learn from this and hopefully next time the
response will be faster, if possible.

~~~
toong
TFA gives you a clue why you need to know when all ELBs are updated:

* "However, due to nobody outside of AWS knowing exactly how ELB works, this could just mean that the machines currently responding to the requests are patched, but in the next request, it could hit an unpatched ELB machine."

* "We then wrote to support to hear if it was now safe to re-key the certificates, but did not hear from them for hours."

Summary: You re-key your certificates, thinking you are all good. Now an
attacker hits a non-patched ELB, exploits the issue and gets your new keys.

~~~
dm2
So wait until you see this message: [https://aws.amazon.com/security/security-
bulletins/heartblee...](https://aws.amazon.com/security/security-
bulletins/heartbleed-bug-update/)

and then update your certs again just to be safe.

~~~
chris_wot
So you pay for revocation of certificates twice. The first time you revoked
the certicates, someone could have compromised them immediately afterwards.
How is that a good solutions?

~~~
dm2
There's not always a perfect solution, gotta work with what you're given.

If you absolutely can't wait for the updated status message then the solution
I gave was the only solution.

If the price of the revocation (which my issuer doesn't charge) is too high
for your business then your only option is to wait.

------
smoyer
So ... as a professional organization, why does OpsBeat continue to contract
with such an amateur organization? You could scale up your own servers, round-
robin DNS (more for ELB), etc if you're not happy with their performance.

Instead, a group of dedicated professionals updated a world-wide
infrastructure in less than two days. If you were running your own systems
would you have managed that? Yes, you would have known exactly when you were
done but could you predict ahead of time when you'd be done?

So, as engineers we make trade-offs and AWS is a pretty clear winner when you
look at the TCO of having a scalable architecture. Once you've made that
trade-off, the down-side is that you don't have the ultimate flexibility
provided by a bare-metal host.

------
eknkc
So they are amateurs because;

\- They managed to create and test a deployment procedure in a couple of
hours. \- Deployed this update to thousands of machines spread into multiple
continents in 48 hours. \- There were no downtimes. No action required from
customers. \- Eveything seems to in in order now.

Yeah.. I hope they die in a fire.

~~~
sendob
I don't think you understand the nature of the problem if you think no action
by customers was required.

I think the response from amazon was fine and they are clearly not amateurs
technically, communication was very poor.

~~~
eknkc
They managed to do _everything they could_ without customer involvement. The
rest (retiring keys and stuff) can not be handled by them due to nature of the
problem.

Now that the technical issue has been resolved by other smart people, I'm
gonna replace our cert keys and be just fine.

~~~
copergi
>The rest (retiring keys and stuff) can not be handled by them due to nature
of the problem.

And they actively refused to let people know when they were safe to do that.
That is the problem.

~~~
jdmichal
I would think the post stating that "All ESBs are now patched" was that
message, no? Before that post, you were not safe. After that post, you were.

~~~
copergi
Yes, and "sit there refreshing a page over and over waiting to see if they
update it because they refuse to even give you a hint of when they expect to
be done" is not acceptable.

~~~
jdmichal
Oh, so you mean like the update before that read "expect them to be updated in
the next few hours"?

[http://blog.opbeat.com/posts/amateur-hour-at-aws/few-
hours.p...](http://blog.opbeat.com/posts/amateur-hour-at-aws/few-hours.png)

I want to be clear here that there are plenty of arguments to be made here,
but you are not making them.

~~~
copergi
Yes, that is not good enough, and it was already over a day into their
complete lack of communication. Just because you don't agree with it, doesn't
mean it is not an argument. I would expect a company of amazon's size to give
customers precise and useful info. Not a vague "should be done in a few hours"
after a day of silence.

------
peterwwillis
While the author is perhaps overly critical of the response of AWS to this
issue, he has a point. When your business is suffering from an issue related
to your provider and your customers are calling every 10 minutes to ask when
_you_ are going to fix it, you need your service provider to give you as much
timely information as possible so you can relay it to your customers. The best
thing they could have done would have been to provide gratuitous, empty
information, every half hour. Even just saying "we're still working on it
folks but no new information" will keep people calm and provide needed
feedback for concerned individuals.

------
Gigablah
I just wish people would stop using "amateur" as an insult.

~~~
forgottenpass
It is an appropriate insult when the party you're conducting business is or
presents themselves as professional.

It's about the expectations you convey to your customers. If you don't want
"amateur" to sting, don't pretend not to be one. Make sure your customers have
an accurate understanding of the services you provide and your ability to
provide them.

~~~
lazylizard
Pros -> people who do it for the money only. Amateurs -> people who do it
because they like it. why/when has 'amateur' become an insult?

~~~
alttab
I do my job for the money and because I like it. What does that make me?

------
jewel
Hopefully this openssl issue has shown large organizations the need to have a
way to quickly roll out security patches, ideally even before the vendor has
released an updated package.

Imagine tomorrow that someone finds a remotely exploitable kernel issue,
perhaps involving UDP packet handling. If you have the right infrastructure in
place, you should be able to drop the patch file in the right directory and
run a script that builds a new system package, runs some automated testing,
and then pushes that package out immediately using whatever rolling update
strategy is normally used, but at an accelerated pace.

I wish I had time to build something that makes patching system packages on
debian systems simpler, making it trivial for businesses to "fork" the
distribution as necessary to work around issues (whether they be security
critical or not). I've written more thoughts on the matter on my blog:
[http://stevenjewel.com/2013/10/hacking-open-
source/](http://stevenjewel.com/2013/10/hacking-open-source/)

If you're managing a smaller set of servers, I've been pretty happy with
apticron and nullmailer as a way to make sure security updates are applied
everywhere. It'd be nice if it could receive notification of security issues
faster, perhaps via some sort of push mechanism, but it at least gets things
taken care of within 24 hours.

------
m-app
BTW, I remember from the CloudFlare blog that they were notified in advance of
the bug and had already patched it. How come big names like AWS and Heroku did
not get this prior information? Who decides on which companies hear it before
the public does?

From the Cloudflare blog: "This bug fix is a successful example of what is
called responsible disclosure. Instead of disclosing the vulnerability to the
public right away, the people notified of the problem tracked down the
appropriate stakeholders and gave them a chance to fix the vulnerability
before it went public."

\-- [http://blog.cloudflare.com/staying-ahead-of-openssl-
vulnerab...](http://blog.cloudflare.com/staying-ahead-of-openssl-
vulnerabilities)

~~~
kcbanner
I was wondering about this myself. They must have though AWS was too big and
it would be leaked?

------
avenger123
Don't throw out the baby with the bathwater. I think most users would be happy
that the vulnerability was patched within 48 hours than the lack of proper
communication.

In this case Amazon choose to focus on fixing the problem versus communicating
every detail. The potential consequence in dollars of not fixing the issue in
a timely manner is likely in triple-digit millions. I would imagine the cost
of having some small number of users complain about how they weren't up to
date with communications isn't worth them focusing on it.

Also, I would imagine that customers such as Netflix and other major clients
likely got more in-depth communication.

I'm sure it's not going to really affect Amazon's bottom line if opbeat
decides to move to a different provider. If you are a very small fish in a big
ocean, expect to be treated that way. It's sad to say that but that's the
reality.

Personally, I'll take them getting this fixed as a fast as possible, versus
getting hourly updates telling me how they're still working it. For example, I
just want to hear that they found the MH370 plane. I'm tired of reading news
stories with the same depressing message.

------
Aqueous
This seems highly pedantic and nitpicky. It seems like you're looking for
things to criticize.

Use of Heroku as an example of communications leadership is misplaced. I have
had support request sit around unanswered for days in Heroku. Don't get me
wrong - I love Heroku. But you have to pay them an arm and a leg monthly in
order to get quick support response time. AWS, on the other hand, seems very
responsive to all requests.

------
neom
The fact that all of this was fixed so quickly given the size of AWS
infrastructure is in and of itself very impressive. Sometimes there isn't much
to say except "We're working on it" \- You can argue semantics but half the
internet was in the same boat yesterday and sure maybe they comminuted poorly
but "Amature‎ hour" isn't fair imho.

------
xerophtye
I have experienced issues with several online services, as I expect everyone
here has as well. I am impressed with how Heroku handled it with mandatory
updates every 3 hours, and handing out clear instructions to their customers
and also apologizing when the procedures caused inconvenience. Not everyone
handles these things with such care

------
robotpony
Amateur hour at upbeat.com ... OP clearly has limited experience with large
hosting vendors.

It's frustrating to wait for updated information, but AWS delivered reasonable
details as they were available. Could it improve? Probably. Does it rate worse
than other vendors? No, not at all. Consider recent Rackspace or Azure
outages, information follows hours later, sometimes days.

------
rpowers
This seems a bit harsh. While it would have been nice to have a bit more
transparency, calling them amateur is not realistic. It takes time to patch a
bunch (read thousands) of servers and one day turn around is not all that bad.

------
brryant
If you think about the sheer number of logical load balancers they have to
update (tens of thousands?) and the time it took them to update all of them,
I'm incredibly impressed by their quick turn around.

------
snorkel
I'd rather they just fix the problem than tell me detailed bedtime stories.

------
developer786
Totally off topic, but programmers, I REALLY need your help...

[https://news.ycombinator.com/item?id=7559067](https://news.ycombinator.com/item?id=7559067)

