
Details on today's Facebook outage - far33d
http://www.facebook.com/notes/facebook-engineering/more-details-on-todays-outage/431441338919
======
davidu
This is known generally as the "Thundering Herd" problem:

The thundering herd problem occurs when a large number of processes waiting
for an event are awoken when that event occurs, but only one process is able
to proceed at a time. After the processes wake up, they all demand the
resource and a decision must be made as to which process can continue. After
the decision is made the remaining processes are put back to sleep, only to
wake up again to request access to the resource.

This occurs repeatedly, until there are no more processes to be woken up.
Because all the processes use system resources upon waking, it is more
efficient if only one process is woken up at a time.

This may render the computer unusable, but it can also be used as a technique
if there is no other way to decide which process should continue (for example
when programming with semaphores).

Though the phrase is mostly used in computer science, it could be an
abstraction of the observation seen when cattle are released from a shed or
when wildebeest are crossing the Mara River. In both instances, the movement
is suboptimal.

From: <http://en.wikipedia.org/wiki/Thundering_herd_problem>

~~~
schrep
We actually encounter Thundering Herd problems on a very regular basis. The
Starbucks page has nearly 14M fans and posts may get tens of thousands of
comments/likes with a high update rate. You have a lot of readers on a
frequently changed value which means it is not often current in cache and you
can have a pileup on the database.

Since we encounter this on a regular basis we have built a few different
systems to gracefully handle them.

Unfortunately, the event today was not just a thundering herd because the
value never converged. All clients who fetched the value from a db thought it
was invalid and forced a re-fetch.

~~~
datums
Would monitoring the rate of invalidation and triggering an event handler help
?

------
cageface
This certainly won't be the first time that a system designed to increase
uptime actually reduces it. I've seen a lot of "redundant" systems that are
actually less reliable than simple standalones thanks to all the extra
complexity of clustering.

I guess at Facebook's scale you have to build in fallbacks but this is a
reminder that you can easily do more harm than good.

~~~
ora600
Sometimes cluster systems are mistaken for high-availability solutions while
they are actually load-balancing solutions and can decrease availability due
to added complexity and dependencies.

~~~
olefoo
Sometimes designers and builders of a system make decisions about tradeoffs
between availability and scaling and error recovery that they don't understand
until after the system is in the field. Any sufficiently complex system
becomes vulnerable to unexpected interactions between subsystems especially
when one or more subsystems fails catastrophically.
<http://en.wikipedia.org/wiki/System_accident> has more on the topic, but you
do want to read Perrow's book
[http://books.google.com/books?id=VC5hYoMw4N0C&printsec=f...](http://books.google.com/books?id=VC5hYoMw4N0C&printsec=frontcover&dq=Charles+Perrow#v=onepage&q&f=false)
Normal Accidents.

------
sethwartak
My favorite comments on fb page:

    
    
      Stick with Mysql
    
      * * * * If it aint broke, dont fix it!
    
      Me too Melissa, and it's out there in the media that a group of hackers caused the problem, is this true, Mr Robert Johnson?
    
      PLEASE !!!! WHAT CAN YOU DO TO HELP THIS FROM EVER HAPPENING AGAIN???????????????? PLEASE!!!!!!!!!!!!!!! CANDY
    
      Kip da updates comin'
    
      Did anyone get a message like I did about someone trying to access your account from another state?

~~~
swaarm
I actually came across a pretty good comment which explains it (to the layman)
pretty well:

"Marvin, the server was like a dog chasing its tail...it kept going in
circles, but never caught it. They basically had to hold the tail for the dog
so he could bite a flea on it. :) LOL"

------
jrockway
The text of the article is:

"You are using an incompatible web browser.

Sorry, we're not cool enough to support your browser. Please keep it real with
one of the following browsers:"

This is why I don't use Facebook. It's not 1990 anymore. You don't need User-
Agent sniffing.

~~~
rimantas
Whoa, who did User-Agent sniffing in 1990?

~~~
pyre
Tim Berners-Lee

------
itistoday
That was an excellent description of the problem. Too excellent, it seems, for
many of the commenters. :-p

~~~
duck
Yeah, I don't know how regular users find posts like this... but that was
_way_ over everyones head.

~~~
Malakin
This is how <http://twitter.com/facebook> (how ironic)

~~~
docgnome
Ironic? I don't really think so. They aren't exactly competitors.

------
ergo98
They built their own custom DNS server? Most of the failures that people
encountered were a failure to contact the nameserver itself. Perhaps in the
rush to try to fix it someone screwed up that as well.

~~~
dangrossman
Or maybe disabling DNS was how they purposely took down the site, then slowly
let people back in, as they said.

~~~
ergo98
DNS was never mentioned in that post. Further it's a curious way of
controlling flow, though I suppose in a moment of panic...if nothing else
works...

------
kahawe
My own theory what actually happened:
[http://www.reddit.com/r/fffffffuuuuuuuuuuuu/comments/di0v0/s...](http://www.reddit.com/r/fffffffuuuuuuuuuuuu/comments/di0v0/so_facebook_is_down_huh/)

------
praeclarum
So umm no mention of the 4chan DOS attack? I mean, not that I hang out there
or anything, but a friend told me that they organized an attack. You'know. Jus
sayin. /b/ye

~~~
ramidarigaz
DOSing Facebook? It would require a _huge_ effort to create a measurable
increase in Facebook's traffic.

~~~
staunch
Not if they found some expensive requests to use and/or if they have access to
large botnets.

~~~
whatusername
sure - if you found some really really expensive requests perhaps. As has been
stated -- It's pretty hard to put a bump in FB's traffic.

The more likely explanation is that someone on 4chan noticed FB was down --
and then started talking about DOSing it.

~~~
praeclarum
Completely possible. Except the posts to call for the attack were made prior
to the world-known outages. But I can't argue your point. I just found it
curious that no one mentioned the planned attack.

