

Postmortem of Heroku's June 23 Downtime - bensedat
https://status.heroku.com/incidents/642?postmortem

======
agwa
This incident makes me think that services like Redis should support running
with two sets of credentials at once in order to facilitate credential
rolling. As it currently stands, rolling credentials is a rather big deal with
a chance of things going wrong in the process.

Aside: the text on that page is extremely difficult to read because of poor
contrast (#8584B2 on #282936) and might be impossible for people with vision
impairment. If anyone from Heroku is reading, you should change the color
scheme to be compliant with the W3C Web Content Accessibility Guidelines. See:
[http://www.snook.ca/technical/colour_contrast/colour.html](http://www.snook.ca/technical/colour_contrast/colour.html)

~~~
jzwinck
I thought the exact same thing, but then I realized that we don’t need Redis
to actually support a choice of two passwords for a single account. Rather,
clients can be configured with a list of credentials to try. When rolling
credentials over, simply add the new ones to the clients’ lists, update the
service, then remove the old ones from the clients. Then you can wait hours or
days between steps for safety, and there is no time when system reliability is
degraded by a service instance being inaccessible.

~~~
andrewvc
That's pushing complexity back onto app devs. Functionality like that is best
solved once in the database for everyone.

------
thaumaturgy
Because I so rarely feel compelled to say this: this is a really great post-
mortem. It's technical, it's not loaded down with sales-speak, and it's
straightforward. I really hope post-mortems like this become more of a trend.

------
theGimp
This paragraph reads like a response to the criticism they received a few days
ago for scheduling maintenance at 2pm PST:

 _On June 23rd we performed a credential roll on these Redis servers in our US
cloud during a two hour scheduled maintenance window. Because we operate a
service used globally, there is a less-than 10% difference in usage between
so-called "peak hours" and “non-peak” hours. We scheduled maintenance for this
time because it was not a peak time, but moreso because this period has high
coverage from relevant engineering teams, should issues arise. By performing
maintenance during this period, we were able to react more quickly and muster
those teams within seconds._

~~~
brianmcdonough
What's the best way to muster an engineering team?

~~~
ForHackernews
Bugle reveille.

~~~
toomuchtodo
Which is the sound of everyone's SMS alert going off within seconds of each
other.

------
gdeglin
Seems like first trying this maintenance procedure in a staging environment
would have caught the problem.

------
hunvreus
> We are reviewing our internal processes to ensure that communication between
> groups is more effective, so that we can better inform our customers when
> situations occur.

I see this as the only contentious point raised by some of their users. They
are doing an outstanding job already at dealing with a large infrastructure
running a wide range of heterogeneous applications. They likely run updates on
their infrastructure on a regular basis, without anybody noticing.

However, if you're selling me on the promise of taking care of infrastructure
for me, you can't under-deliver on communicating as soon as s __t hits the
fan.

------
ironlady
I've been developing a Node site that is currently running on Heroku. This
happened the first day after launch, and to say the least my blood pressure
was through the roof all day. I was terrified if something went wrong, we
would be dead in the water. I don't think I would deal with them again (if I
had the chance).

------
bithive123
I am curious as to why they were relying on rolling Redis credentials at all
since they would have needed to pre-arrange a secure channel for Redis traffic
anyway.

------
saasdude
who in the hell is stupid enough to use heroku?

------
sneak
Ugh, the verb form of "impact" is so gross.

~~~
neurobro
So is the late 20th-century slang usage of "gross."

