
Heroku: a follow up on last week's outage - twampss
https://status.heroku.com/incidents/372
======
cagenut
Hey guys I got this, I speak cloudonaut. Here I'll translate it to sysadmin:

An admin was doing a rolling restart that triggered a bug in the loadbalancer
software. The auto restart script turned out to just make things worse by
restarting it over and over (they _always_ do), so we thought we'd just quick
throw spare capacity at it, but turns out that never works in a panicked rush
either. Also, our system designed to handle outage notifications wasn't
capacity planned, like, at all.

~~~
druiid
I know this is a joke, but from the sounds of the errors that isn't far from
true. This is also basically what happened with the big two-day outage at
Amazon a while back. It's always the automated processes that come back to
bite you it seems.

I know I have had my share of server issues, but it seems to me that many
'cloud' services out there are simply adding too many layers of abstraction
that tend to make things very, very touchy to any small issue occurring.
Because of this I try to keep my server stacks/frameworks as basic as possible
while still implementing performance oriented services like NoSQL, caching,
etc.

~~~
jsprinkles
Although I have my fair share of hesitance at worshiping cloud services, the
fact that a service is "cloud" has nothing to do with the quality of its
architecture. You can make a crucial architecture mistake designing a fleet of
dedicated PHP servers talking to a MySQL cluster just as easily as you can
building atop some cloud service.

------
nemesisj
I love Heroku, but am I the only one that thinks their choice of words
describing their architecture is a bit pretentious?

"...streaming data API which connects the dyno manifold to the routing mesh."

Give me a break!

~~~
erikpukinskis
OK, I'll bite.

Instead of "dyno", they could possibly use a word like "VM". Except that
they're not really virtual machines, nor are they EC2 instances. Read Only
Chroot Jails plus Precompiled Application, Libraries, and Environment
(ROCJPALEs?) They also have a pretty complex set of support structures that
provide connectivity to databases and other resources. Perhaps someone can
suggest an existing name for that, but I know of none.

Instead of "manifold", perhaps they could use the word "cluster". Except it's
not really a cluster, it's a set of distributed clusters. And nodes in a
cluster are typically machines. The nodes in the dyno manifold aren't
machines, virtual machines, they're ROCJPALEs. You could use the word "array",
but again, it's not really an array. It's a multi-layered, geographically
distributed structure of co-hosted application jails. "Manifold" seems as good
a term as any.

"Streaming" seems like a good word. It's specifically relevant to this
incident... they describe how the API is not atomic; that each message is
built on top of the previous entries, and the data structures are implicit in
the stream. That sounds like the definition of "streaming" to me.

"API" seems like a widely accepted term. They could've described it as a
"protocol", perhaps. But neither seems more jargony than the other.

"Data"... well I suppose "streaming API" without the data would work. But it
serves to differentiate it from a streaming video protocol.

"Mesh" has a very specific meaning. It means that you have a set of nodes that
are connected peer-to-peer and that messages travel through the network by
hopping from node to node. I'm assuming that their routing layer is organized
in this way.

"Routing" is also pretty well defined. Requests come in and need to be sent to
the machine that can serve responses to it. What would you call that instead
of routing?

I feel like people who object to this kind of language are the same folks who
object to the word "cloud". People don't take the time to understand different
strategies to provisioning and application hosting APIs, and then think these
words don't mean anything. Yeah, salespeople use the word to hustle the Same
Old Shit, but it also actually means something to people like us who are
actually building stuff.

~~~
moe
Man, that's a long and contrived justification for what amounts to a pile of
bullshit.

We have seen very elaborate post-mortems from google, facebook, twitter, and
no least from Amazon themselves (you know, the playground that heroku builds
their sandcastles in).

The aforementioned companies had no problem explaining their respective issues
in plain language that every engineer did understand.

Heroku doesn't even try to explain themselves. They just throw around fantasy
words without real explanations, seemingly overwhelmed by their own
awesomeness (in a failure report, no less).

As an engineer I feel insulted by this pamphlet. All I can gather from it is
that they screwed up and apparently somehow related to their request-routing
layer. Thanks, we knew as much _before_ reading that text.

I still have no idea what _actually_ went wrong and how they intend to prevent
it in the future. But I'll certainly advise people to avoid a company that
babbles about "control rods" when their software screws up.

~~~
erikpukinskis
Are you a Heroku customer? I am, and I understand everything they said, and I
appreciate that they went into detail about what happened.

------
Negitivefrags
"The first root cause is related to the streaming data API which connects the
dyno manifold to the routing mesh. On the dyno management side, an engineer
was performing a manual garbage collection process which created an unusual
record in the data stream. On the routing side, a bug in the subprocess of the
router which processes the incoming stream saw the record as garbage."

This is techno-babble on a scale the world has never seen!

~~~
enneff
The only unusual terms in that paragraph are "dyno manifold" and "routing
mesh", both of which are Heroku-specific technologies that Heroku users should
know of. The rest is just normal systems stuff.

~~~
tedunangst
I can understand the words individually, but I've never before seen or heard
of a manual garbage collection process creating an unusual record in a data
stream. It's like a tech jargon ad lib.

------
jyap
The problem with Heroku is that you need to be a certain level of tech savvy
to make use of their services.

We're expecting them to be the A-Grade tech wizards who can give us 0 down
time. They are after expecting thousands of people to trust their services and
to outsource the server hosting and administration duties to them.

So they tread the fine line between convenience (and related "cloud" benefits)
and "I can do this myself".

If they can't give us the assurances that they can do it better, cheaper and
more reliably than we can do it ourselves then what good are they?

If they can't capacity plan a simple System Status page (running on Rackspace)
and keep that up and running then what good are they?

And since their service appeals to a certain level of geek competence, they
also can't get away with techno babble bull shit responses to outages.

~~~
larrys
"We're expecting them to be the A-Grade tech wizards who can give us 0 down
time."

Exactly. But something like this which they said makes them seem so ordinary:

"The improved status site allows users to subscribe to notifications when an
incident is opened. As a result, our status site experienced unprecedented
spikes of load during this incident. This high load crushed the site,"

Basically saying whatever they setup for a status site choked on sending out
emails or sms, as if they were hosted on a shared server and got mentioned
simultaneously on a few major sites.

------
kennystone
Quite a few Erlang gotchas in those notes. Fault tolerant systems are really
hard to design even when you know what you're doing and are using the best
language for it (Erlang). Erlang aside, it seems the higher level architecture
may need a rethink if one bad record can bring down the whole thing.

~~~
pja
It looks like the error recovery code wasn't well tested. Error recovery code
in distributed systems is some of the hardest code to test effectively mind.

The thundering herd of recovery is especially difficult to cope with: your
error recovery code can work just fine for normal outages but then fail
completely when faced with just a few more components going dark.

------
andrewcooke
weird. they had a problem caused by a series of bugs, yet the word "test"
doesn't appear anywhere in that page.

~~~
pja
Testing distributed systems is much, much harder than doing so on a monolithic
codebase. The number of failure modes goes up very rapidly with the number of
nodes in the system & your code has to (in principle) cope with every possible
one.

~~~
andrewcooke
true, but it sounds like they (and perhaps you) have never even heard of the
chaos monkey.

~~~
pja
Randomly killing instances wouldn't have detected this particular failure mode
as far as I can see, since the error lay in the inability to resurrect a
failed process under certain circumstances.

------
ivix
Surely I'm not the only one infuriated by their choice of blue text on a blue
background?

~~~
Tyr42
I was all prepared to be angry, but it's actually ok, to be honest. I normally
dislike dark themes.

------
Trufa
I tend to feel much less angry about outages and such when the businesses take
time to explained what happened, of course this doesn't justify the lack of
up-time but I like the gesture!

------
wglb
Sounds about one inch away from an AFJ.

