
Today's Outage Post Mortem - buttscicles
http://blog.cloudflare.com/todays-outage-post-mortem-82515
======
druiid
As always I'm glad to see Cloudflare post such detailed outage reports. They
are one of the few providers I know of that is willing to go into such depth
and that is one of the things I appreciate about them. That said, the outage
that occurred was one that was indeed fully preventable. We don't exactly have
as many locations as they do, but for internal resources at least, not pushing
configuration changes to all devices (network included) is pretty standard
practice. Basically I imagine for them a good routine to follow might be to
script changes so that they are 'rolled out', something along the lines of
push manual changes to a scripted 'random' router set (one in country A,B,C),
wait 15 minutes and then push to the remaining router sets. That wouldn't work
for all situations, such as if the entire network is seeing a DDoS or what
have you, but I imagine they could adapt a routine that would prevent this
particular scenario.

With all of that said, as a Cloudflare customer and also having a call with
them tomorrow scheduled already over the WAF stuff, I find it a bit...
frustrating that this is occurring now and such a kind of mistake.

Edit: As an aside, I wonder if the Puppet module for Junos will be extended to
support route statements. That would make this kind of deployment much easier.

~~~
alexchamberlain
Although I agree, it would be rather hard to fight an attack if you didn't
roll out fairly quickly...

~~~
saurik
The article, however, actually states that the expectation should be for the
rule to do nothing, as the packets in question were much larger than the
maximum packet size. You thereby have to examine this as some kind of rushed
"let's try something, anything" reaction to a situation where an engineer
didn't actually understand what was happening enough to make such a call: it
is not surprising that the result was that they ended up landing squarely in
"something even more confusing has now happened and everything is offline"
territory.

------
ryguytilidie
This is pretty impressive. Keep in mind most of the team is on the west coast
so this happened at 1am on a Sunday and they put up a post mortem within
hours. Obviously you would prefer it not happen at all, but that is a great
response imo.

~~~
larrys
But this is not impressive:

"Someone from our operations team is monitoring our network 24/7."

"Someone" seems to indicate "1 person". Not "people are monitoring" but
"someone". That's it, one person monitors the network? Like the single night
guard at the warehouse?

~~~
dandelany
Why would you assume this and judge the company on it, based entirely on an
offhanded comment? You couldnt have given them the benefit of the doubt long
enough to find another of several comments which clearly indicate there was a
team of people working on the issue?

~~~
larrys
The question remains how many people actually are monitoring the network in
order to call the first responders. Is it one person, two or five?

And my use of "not impressive" was in reply to someone who said "impressive"
but more importantly thought it was "impressive" that they put up a post
mortem within hours. That's nice but it doesn't answer the question that I
had.

I stand behind my comment and re ask the question (since the info is ambiguous
we have jgramhmc saying "small team who monitor things" and we have the blog
post saying "Someone from our operations team is monitoring our network 24/7."

I don't think it's unreasonable (in the interest of transparency) to know
exactly the structure and # bodies of who monitors the network at any given
time. What is the human point of failure in the system?

I don't depend on cloudflare. But if I was running a mission critical
operation and depended on them I might setup a site visit to actually get a
feel of what is going on.

As an aside back when the .org registry got started one of the dns servers sat
in an open unguarded office under a desk accessible by the cleaning person. I
saw it when I did a site visit. And of course if you've been around long
enough you know there was a time when the root dns servers sat unguarded in
university offices.

~~~
dubcanada
You seem to be looking for "problems" where non exist. There could be 15000000
people monitoring it. It can still go down.

~~~
seanp2k2
This.

It seems that you care more about how many people are actively waiting for
something to break vs. how long their response takes. Also, (those people |
that person) probably (is | are) the first responder. I really believe that
one "first mate" watching the automated ship sail at night ready to triage a
technical problem is better than 10 guards who will promptly fall all over
themselves when the bits hit the fan.

------
powertower
> CloudFlare currently runs 23 data centers worldwide.

Shouldn't that always say - CloudFlare currently runs _in_ 23 data centers
worldwide?

Or is that just how one would phrase that if you rent multiple racks or a cage
in a datacenter? ...because I've seen that a bunch of times before from just
about everyone.

Just curious.

~~~
tlrobinson
The distinction is a bit arbitrary. As a customer you should care that their
service is geographically distributed, not whether they own the buildings
where the servers are kept.

~~~
larrys
"As a customer you should care that their service is geographically
distributed"

Don't agree. If you own the data center you have more control over it. We had
a case where the UPS systems in a data center had bad batteries and equipment
went down because the batteries failed to kick in. Since we don't own the data
center we have no realistic way to make inspections and make sure the right
thing happens or that the batteries (or the generators) are cycled and
maintained. We just have to trust. [1]

Now this may or may not matter with the way they have their redundancy setup.
But owning a data center does give you more control over more things.

"not whether they own the buildings where the servers are kept."

Owning the data center and owning the buildings are two different things.
Owning the building is owning real estate. Owning the data center is owning
the security setup, backup systems etc. Two different things.

[1] So as not to contradict myself with other things I have said I should not
say "trust" because you can always put some things in place to verify the
right thing is happening (inspections, logs etc.) if you want. But if you own
the place it's easier. If I own my home I can decide when to replace the HVAC
so it doesn't fail in the middle of the summer. If I rent that's up to the
landlord.

~~~
acdha
The downside is that this obligates you to do a wider variety of tasks. I'd be
surprised if CloudFlare has enough profit margin to afford many redundant
electrical engineers, physical network & power techs, etc. in 23 widely
dispersed locations. It's obvious that the extra expense wouldn't have helped
in this case but it certainly would consume a big chunk of money and
management time.

Part of being a business is that you have to make tradeoffs in the real world
rather than game-theoretical perfect moves. In many cases this means carefully
writing contracts because you can't afford the certain expense and distraction
of doing it in-house in the hope that the results might be slightly better.

------
yRetsyM
I'm not very educated on this end of the spectrum - but I wonder if a process
is possible where a rule or router update of some description is applied to
one router only, testing the specific schema before pushing to the rest of the
routers, thereby failing one router and not failing the rest? I understand the
need to respond as quickly as possible - but as stated in this case this was
already a manual response.

It appeals to my limited knowledge and non-existant experience that this would
be a solution to the prevention of this occurring again in the future?

~~~
eastdakota
Don't sell yourself short: our ops team has been on our internal chat talking
about how to do something exactly like this for the last hour or so. It's
difficult at our scale to truly simulate traffic, but we should be able to
roll rules out to just subsets of our network. That's already how we handle
router OS upgrades. If a small handful of data centers had crashed, likely no
one would have noticed because we've designed that fault tolerance in. This
was a problem because the crashes happened system-wide. In the end, we hadn't
anticipated that a simple filtering rule like this would cause such a router
crash, which was a bad assumption on our part.

~~~
nathannecro
Similar to yRetsyM, my domain knowledge doesn't extend to this side, but if
your network is undergoing a DDOS or some other form of attack, taking the
time to test rules in a pre-production/test server seems to be quite
dangerous. What does the ops team think about using multiple hardware vendors?

~~~
toast0
Doing the wrong fix is at least as dangerous as not handling the DDoS (as
shown in this case). Based on their general network architecture, I would
think a prudent thing would be a quick sanity test on a pre-production system
if available, then deploy to the various colos in groups at intervals that
seem appropriate given the nature of the change. If pre-production isn't
available, then having the first group be one colo limits the production
impact.

If they did some colos as vendor J and some colos as vendor C, I think it
would be manageable, but I don't really know how much of the cross colo
traffic is actually their routers talking to their routers. Homogeneity in
networks makes things easier to manage, until a platform fault breaks
everything at the same time. In this case, at least it was related to a change
they had made and happened quickly, so it was easy to determine the cause;
other platform faults may not be as easy to determine, but if only your vendor
X colos fell over, at least you'd have your vendor C colos up and something to
look for.

------
jcr
To the couldflare folks; It's refreshing to see you take responsibility, but I
think you've been a bit too hard on yourselves by taking all the blame. First
of all, what you hit was a unknown bug in JunOS, and Juniper is to blame for
their part. Using some form of staging to slow roll-out of rule changes
_might_ have saved you from a full meltdown, but when you're getting attacked,
every second counts. Slow versus fast roll-out is one of those really tough
balancing acts in your situation. You did a great job with it; by the time I
saw the "cloudflare is down" post in the newest queue, it was already back up
running again.

~~~
GigabyteCoin
Is junpier to blame for the bug in their OS? Or is cloudflare to blame for not
testing JunOS enough before relying on that OS?

~~~
eastdakota
Buck stops with us. We choose the hardware and software that runs on our
network. We test and work around thousands of bugs in it. It was up to us to
check range limits before applying them. While we'll never be perfect, one of
the things I am most proud of with the CloudFlare team is how quickly we do
learn from mistakes.

~~~
cbsmith
There is more to this story than meets the eye. This had to be an IPv6
fragment attack. Why weren't you already advertising rejection of such
packets, at least for DNS? Why would your analysis software and procedures not
already be checking for memory problems with rules that would need yo assemble
all the fragments before matching? Seems like there is more to this story than
meets the eye.

------
rainsford
That was a pretty interesting writeup and I always like it when companies are
totally (and quickly) upfront about negative events.

One thing that occurred to me though is that performing a hard reboot of the
routers required calling people to physically access the devices and took some
time to perform (as you would expect). Although I wouldn't expect it to be
needed very often, I'm sort of surprised CloudFlare doesn't have out-of-band
remote power cycle capabilities.

There may be some factor I'm not considering that would make that an
unattractive option, but it does seem like it could cut down an already quick
response time even further for any similar events in the future.

~~~
rdl
I've never seen remote power cyclers on big routers in major facilities which
have on-site remote hands, even when servers all get both IPMI/LOM board
cyclers and physical external cyclers. At most, the routers get a serial port
connected to a serial port console server or directly to a modem, and/or an
admin ethernet network.

I've seen smaller routers, CSU/DSU, etc. type devices in branch offices on
cyclers, though.

I think it's mostly that the routers usually have both good OOB management and
good watchdog (reboot on freeze) behavior, and that the PSUs in the bigger
routers tend to exceed the per-port power limits of most of the external power
cyclers.

It may be a good idea, though.

~~~
justincormack
You would need another network (not just a vlan) to run this as well, if you
are going to try to reach it when nothing else is working.

~~~
mprovost
We just hook up a DSL modem to the OOB network or plug it straight into the
OOB interface on a core router. You used to do this with actual modems but
it's cheap enough to do it with DSL these days, then you're not dependent on
any of your own network to access the device in case of failure.

~~~
macros
We've been doing this with mikrotik boxes with either wifi or usb gsm modems
depending on the what is available in the location.

~~~
rdl
Yeah, I've seen a lot of great options for OOB access: 1) At carrier hotels,
wifi (heh) 2) Cellular modems (ideal for branch offices; a lot of datacenters
have bad cell coverage inside the racks/cabinets/floor though) 3) Cross-
connect (in places with free/cheap cross connects) to someone you don't use
for transit. Can be mutual 4) Some facilities give you an OOB network,
although this often has issues (if you buy transit from them, it's possible
your outage is due to something going wrong with them, and it might take out
your OOB access)

I'm looking at the Verizon Private-IP thing (an outsourced private network
over Verizon's cell infrastructure) for OOB management of lots of CPE; the
cost per device per month is low, and then you pay for bandwidth across all of
them. Makes initial provisioning easier, plus ongoing monitoring/maintenance.

------
dododo
if you want to build a reliable system, one useful thing to do is use
equipment from multiple vendors. sure it's inconvenient, but by doing this you
can often de-correlate failures. especially if you want to improve someone
else's reliability.

e.g., from simple things like hard drives in a raid from different vendors, to
n-version programming in safety critical systems (like airplanes).

~~~
rdl
That works when the interfaces are totally standard, but edge/core routers are
not like that. Cisco supports one set of protocols for talking to other Cisco
products; another set for talking to everything else. The "everything else"
protocols suck in a lot of ways (they're ok inter-site, but not really so
great intra-site).

Same with Juniper. (there aren't really other viable options besides those
two)

You could build the same site fully independently with all-Cisco on one, and
all Juniper on another, and potentially get some better isolation from vendor
faults, but at very high expense.

You end up with much _worse_ reliability if you have a mixed Cisco/Juniper
network without a lot of additional isolation otherwise.

~~~
windexh8er
Re: "there aren't really any viable options..."

Total misconception. BGP, OSPF, ISIS, LISP, etc. are all non proprietary.
Sure, the root cause of this particular problem is that CF is using something
specific to Juniper, however router interoperability is not predicated on
components like that. This example was a tool CF operationalized, and likely
had little to do with their routing with the exception of it being a metric
they may have influenced routes with.

People who have all Cisco or all Juniper shops namely do it from a cost
perspective. Sure, there are some reasons outside of that but it's likely the
big driver. The more you buy, the more you save. And the network sales realm
is royally messed up to begin with. I've seen Juniper give 90% discounts on
hardware just to break into a Cisco shop. But, the reality of the situation is
that all of this gear is marked up well into the thousands of percent. So if
you're not getting, minimally 50% then your probably not doing yourself due
diligence.

~~~
rdl
"there aren't really any viable options" to juniper or cisco for core/edge
routers.

There are some routing protocols which interoperate (which is how different
sites on the Internet can talk to each other), but most of the protocols used
for HA or management of a given set of routers, or, more importantly, most
tested/debugged implementations of HA and device management, are Cisco or
Juniper specific.

No big deal announcing routes to your upstream if you use Juniper and they use
Cisco. Big deal if you have Cisco+Juniper and want to do HSRP (Cisco-only).

~~~
windexh8er
Well, no.

I've been in network engineering for 12+ years and I fundamentally disagree
with a lot of what is said about "networking" and interop by many programmer-
types (not casting here, but) on HN. Yes, yes, you may understand system
DevOps to a point, however I'm not sure you've spent a significant amount of
time studying Dijkstra's algorithm or truly have an idea of how to deploy a
global IPv6 overlay. I'm also not trying to be snide here but I feel that,
often times, many things that come up on HN are just fundamentally designed
wrong from PHY all the way up until the devs get a hold of the rest. I've been
in a very successful startup (think one of the top online backup services)
wherein their network was run on commodity junk hardware. They were asking me
how I'd troubleshoot this, that and the other thing - obviously with no debug
(this guy said that with a grin). First and foremost, you designed it wrong -
I can show you inefficiency in about 10 minutes of performance engineering
that I would have designed around without thinking about those things. So,
yes, I can waste time tracking down a bad NIC on your network, but if you feel
that you've earned geek cred because you fired up Wireshark and parsed through
a few simplistic ARP tables - you haven't impressed anyone but yourself.
That's when I realized I was working with professional _developers_ , and not
network architects.

Your simplistic view of FHRP is trivial at best. Maybe if you were talking
about how you'd design fault tolerance into a virtual link, say an LSP, with
something like BFD in your design I'd be more impressed than conversations
about proprietary redundancy protocols of which most network engineers won't
touch for a variety of other reasons than the big "C".

</endrant>

~~~
rdl
Virtually no network engineers (by percentage) have to do anything other than
worry about what their vendor supports for a given configuration (and usually
a fairly small set of configurations, too); it's much more about policy and
operations.

Similarly very few developers have to solve open CS problems in writing a CRUD
application (or I guess more comparably to ops, come up with a novel
implementation of a complex algorithm).

This is progress, though.

~~~
windexh8er
"Virtually no network engineers (by percentage) have to do anything other than
worry about what their vendor supports for a given configuration" - this
statement puts a perspective on your thinking. And then I read your
information on the services your company offers, and I realize that it's not
worth having a discussion.

"<redacted> takes your security very seriously." - right. That's a statement,
not information regarding the thought or implementation. There's not even a
mention of technology. <sigh>

~~~
JoachimSchipper
Be nice.

------
senthilnayagam
So as far the DOS attack was very successful. It took down the site which it
intended to and take down the network with lots and lots of the sites.

Hope lessons are learnt and your next generation is less prone to these
attacks

~~~
packetbeats
If the attacker knew about the Juniper bug and thought about a way to convince
the network operators to introduce the rule of death themselves, then this is
a nice hack indeed. It won't be easy for CloudFare to generically protect
against these types of attacks. They could either have mechanisms to revert
configurations faster or a way to test new configurations on a single router.

~~~
rurounijones
The idea that the attacker knew about the bug is, I think, a remote but
intriguing idea.

Wonder if Cloudflare need to do some tests along the lines of:

A) List up all the types of rules we usually use to mitigate these situations.
B) Run those rules on a test router with wildly unusual input values, as was
the case in this situation. C) Send test traffic using that wildly unexpected
input to see what happens.

Basically a bit of manual fuzzing

Time-consuming and maybe not worthwhile, but it could save against another
full system death.

------
DigitalSea
Rather unfortunate for the credibility of Cloudfare as a network provider, but
you've got to admire them for their honesty and it'll work out better for them
in the end. It's amazing how a few lines of code managed to bring down
Cloudfare, they could have told us anything and nobody would have been able to
question it; instead they gave us the truth and I really respect that. They
didn't blame the intern, they didn't blame their hardware or make an excuse
about a power outage. In terms of honesty Cloudfare seems to be leading the
way regardless of their public credibility or image being tainted. Very
impressive response time and resolution of the issue as well, good job
Cloudfare!

------
rdl
Wow, that's pretty fast turnaround for a post-mortem (although it looks to
have been a simple problem, so easier to figure out what to write)

~~~
jgrahamc
Our customers deserve to know what happened as quickly as possible.

------
BoyWizard
Two things:

1\. That video of the BGP routes disappearing is awesome, and

2\. A 40 minute outage sounds bad, but consider the following timeline (based
on the writeup):

> T+0: route change made, propagates

> T+10: Response team online, attempting local fixes

> T+30: Routers across 23 data centres in 14 countries hard reset and networks
> coming back up.

------
DoubleMalt
Funny that now the post mortem is down ...

~~~
jgrahamc
Yes, ironic. Unfortunately, the CloudFlare blog is hosted on posterous and
they seem to be down.

~~~
rdl
You may want to move off posterous before it goes down for good in a month,
too :)

------
onemorepassword
> Even though some data centers came back online initially, they fell back
> over again because all the traffic across our entire network hit them and
> overloaded their resources.

I know very little of networking, but this seems to be a recurring pattern
that aggravates many major outages. What surprises me is that this so often
seems to be a scenario not accounted for.

~~~
jonknee
You can only account for it by having more hardware and then it's possible
more of your hardware will fail which puts you right back to where you
started.

~~~
Dylan16807
I don't think that's the only solution. I would be willing to bet that outside
of heavy-DDoS conditions that even a tiny fraction of Cloudflare's network
could handle the incoming tcp connections and deny all of them. At that point
you don't have to worry about traffic collapsing anything. You can wait to
bring up more equipment. You can send a tiny error page. You can let X% of
requests get through and be fully served.

I bet that most of the time the domino effect happens to internet services in
general it's with nodes that are _accepting_ most requests. They _allow_
themselves to be overloaded. An active HTTP session uses orders of magnitude
more resources than simply denying the initial packet and forgetting about it
forever.

~~~
seanp2k2
You're vastly oversimplifying the problem here by only accounting for one
class of problems.

>". I would be willing to bet that outside of heavy-DDoS conditions that even
a tiny fraction of Cloudflare's network could handle the incoming tcp
connections and deny all of them." depends on the attack.

>"You can send a tiny error page. You can let X% of requests get through and
be fully served." Not usually that easy.

~~~
Dylan16807
I said _outside_ of attacks.

I call BS on saying it's not easy to limit the number of served connections
and RST the rest. Isn't this something every web server can do by itself it's
so easy?

------
jaequery
this is the type of reason why i stopped using cloudflare. there are just too
many eggs in one basket. it's as if their entire service becomes a SPOF to
your infrastructure.

~~~
driverdan
You could say the same thing about almost any of your service providers. Your
DNS provider goes out, everything goes out. The routers at the data center
with your servers go out, all your servers go out. Your CDN goes out, all of
the static assets on your site go out.

There will always be potential SPOF.

~~~
saurik
While your example with the routers at your backend is truly problematic, DNS
is designed with built-in redundancy and CDNs (which CloudFlare should not
really count as) having a world-wide outage (as opposed to "people accessing
from New York are currently having issues, as we lost one PoP") is nigh-unto
unheard of... can you imagine Akamai (or CDNetworks or EdgeCast or even
Amazon) saying "doh, all of our infrastructure everywhere just disappeared"?

The core problem with CloudFlare is that they seem to have a highly-
centralized take on what is normally a massively-decentralized solution-space,
with large numbers of value-adds they encourage customers to use without
making it clear that they treat in a haphazard manner, doing very little
testing before deploying pushing-the-envelope features while simultaneously
having very little in-house debugging expertise to handle serious issues.

(As a concrete example of that last complaint, Cydia was crippled for an
entire day due to ModMyi turning on CloudFlare's "preloader" transformation,
which apparently caused many WebKit-based browsers--including both
MobileSafari and Cydia--to entirely lock up; CloudFlare seemed to go the
entire day without noticing, which I continue to be utterly _shocked_ by, and
it was only after I told them how to fix it that they were able to acknowledge
the issue.)

<http://www.saurik.com/id/14> <\- When "Dumb Pipes" Get Too Smart, an
extensive analysis of this bug

------
Flow
I think you should have investigated why you got ~90kb packages despite having
a max pkg size of ~4kb instead of putting in that rule. :)

~~~
dubcanada
I was thinking the exact same thing.

------
noselasd
> attack packets were between 99,971 and 99,985 bytes long.

This should raise a red flag, as it must be impossible. Ethernet NICs would
just bail out on packets longer than what you've set the MTU to, and ethernet
frames would just come from the next hop in most cases. And IP packets have a
max length field of 16 bit.

~~~
naww
<http://en.wikipedia.org/wiki/Jumbogram>

> An optional feature of IPv6, the jumbo payload option, allows the exchange
> of packets with payloads of up to one byte less than 4 GiB

~~~
mprovost
Yes but they were still seeing packets bigger than the MTU of Ethernet (or
Sonet or whatever other layer 1/2 tech they're connected to the rest of the
net with). It doesn't matter what higher level protocols can handle.

~~~
bdonlan
They could've been fragmented IPv6 packets. Or it could've been a bug in their
profiler.

~~~
pyvpx
which is precisely why it seems like lunacy to roll out such an asinine
firewall rule to _every_ router. if there was ever a time to "spot check" a
change, this was it.

they didn't. and they paid the price. good on 'em for the quick and honest
post-mortem. regardless, it was a dumb move.

------
brokentone
Impressive response. 30 minute outage for something most of the hosts I've
worked with in the past would have been mystified about for hours. Then a
quick RFO and promise of proactive SLA adjustments? Next time I need a CDN or
attack mitigation I'll be talking to Cloudflare

------
tedchs
What I don't understand is why Cloudflare is making changes to their border
routers in the process of protecting their customers. I am a network engineer
and I love Juniper, but the reality is with any complex system, every change
you make has a possibility of inducing an unexpected failure. I would think
Cloudflare would have increased stability by using an architecture where the
border routers have a mostly static config, and there is a set of firewalls
(e.g. Juniper SRX 5800) behind them that are doing the actual filtering and
changing configs in response to threats.

~~~
seanp2k2
So now you have two pieces of gear to test changes on and another interaction
where stuff could break / go weird.

I don't see how that would solve anything here.

~~~
tedchs
The thing it would solve is risking all their BGP peerings going down as a
result of day-to-day service operations (i.e. every time they add a filter).

------
random42
OT: I want to pitch cloudflare for our CDN needs. Can someone estimate the
scale of cloudflare wrt. akamai (current provider), in terms of operations,
consumers etc.?

~~~
pyvpx
akamai is about 100x the size and probably 200x the price.

~~~
saurik
If you are going to go for the $3k/mo CloudFlare plan, you are already nearing
the ballpark of Akamai, and would do good to look at the many CDNs that sit in
the middle of that scale (such as CDNetworks or EdgeCast).

------
contingencies
_Developing good software comes down to consistently carrying out fundamental
practices (regardless of the technology)_ \- Paul M. Duvall

In this case: Development. Versioned change. Test or staging environment.
Tests pass. Production.

~~~
rurounijones
Meanwhile your customers are getting DDOS'ed while you are faffing about.

Yes, I fully agree that for things like software and standard network
maintenance the above is good. But as someone else mentioned in this thread.
DDoSes that require quick resolution put you between a rock and a hard place
in terms of doing things "right"

~~~
contingencies
That's true. However, look at what happens to _all_ of your customers when you
fail to test. If you haven't limited, or at least tested the extreme ranges of
allowable input to a system automatically pushing out live configuration to
all of your routers, there's nobody else to blame but yourself. Sorry. Are
most people this diligent? No. Should we be? Yes.

~~~
rurounijones
I actually wondered about testing with extreme ranges in another comment, but
this testing is done "offline" (Not on live routers and not in response to
current circumstances).

However, at least how I read it, your comment was about testing rules in
consistently in dev -> staging -> prod when you create one which I think is
not viable in this situation since you are on a very tight deadline with
immediate impact on your customers.

------
lazyjones
So what are they going to change as a consequence? It seems logical to not
rely on a single router vendor anymore, or to test new rules on a staging
setup at least for a very short time before pushing them to all routers.

~~~
rdl
Running Vendor J and Vendor C routers together means you get exposed to the
weird bugs in either's open/interoperability code, and lose out on all the
advanced features (since most of the good stuff isn't well supported in true
cross-platform vendor independent fashion).

It's probably more reasonable to split your network into a few more
independent sections and never do updates which affect everything, but unless
you're building the space shuttle (and can accept vastly higher costs and
lower performance), it's probably better to pick one hardware platform, at
least now.

~~~
eastdakota
While we'll discuss it more at length and after a bit of sleep, my hunch is
this will be closer to our approach.

~~~
rdl
The thing which annoyed me the most was losing all DNS. You really need to
have the DNS servers in separate infrastructure (ASN, netblock, while
anycasted) so there is never a case where both of your DNS are out for a
customer domain. The "CNAME" product looks pretty kludgey.

~~~
ams6110
By the same token you (the customer) should not have all your DNS eggs in one
basket.

~~~
rincebrain
CloudFlare's CDN bits require you to give them DNS delegation of your stuff,
last I looked.

~~~
rdl
There is a way around it with some of the premium accounts, but it kind of
seems like a hack.

------
TranceMan
Just wondering if the source of the large packets were from a [large] range of
hosts or maybe a single host?

Ouch if a single host activity took down ~750k websites - whether deliberate
and direct or not.

~~~
dubcanada
It's probably even more then that. CloudFlare hosts cdnjs which a ton of
people use. It could have "taken down" millions of sites. And by taken down I
mean rendered unusable.

------
ralph
Presumably Jupiter's Junos is closed-source, making investigation more
difficult? Do they provide it to some of their bigger clients under an
agreement?

------
Ecio78
I got a "Oops there was a problem" page from Postereous trying to open the
blog page/site...

------
rschmitty
Now we need a Post Mortem on the Post Mortem, as it is now down

"Oh noes! Something went wrong."

------
newman314
Posterous seems to be down.

------
graycat
Yes, case number 384,449,194 of systems management causing a system problem.
Also case number 439,224 of what looked like a localized problem quickly
causing a huge system, e.g., all 23 data centers around the world, to crash.

They have my sympathy: So, they typed in a 'rule'. At one time I was working
in 'artificial intelligence' (AI), actually 'expert systems', based on using
'rules' to implement real time management of server farms and networks. Of
course, in that work, goals included 'lights out data centers', that is, don't
need people walking around doing manual work but not the case of 'lights out'
as in the CloudFlare outage, and very high reliability.

Looking into reliability, that is, putting into a few, broad categories the
causes of outages, a category causing a large fraction of the outages was
humans doing system management, or as in the words of the HAL 9000, "human
error". Yup.

And the whole thing went down? Yup: One example we worked with was system
management of a 'cluster'. Well, one of the computers in the cluster "went a
little funny, a little funny in the head" and was throwing all its incoming
work into its 'bit bucket'. So, the CPU busy metric on that computer was not
very high, and the load leveling started sending nearly all the work to that
one computer and, thus, into its bit bucket and, thus, effectively killed the
work of the whole cluster.

As one response I decided that real time monitoring of a cluster, or any
system that is supposed to be 'well balanced' via some version of 'load
leveling', should include looking for 'out of balance' situations.

So, let's see: Such monitoring can have false positives (false alarms) and
false negetives (missed detections). So, such monitoring is necessarily
essentially a case of some statistical hypothesis testing, typically with the
'null hypothesis' that the system is healthy, applied continually in near
real-time. So, for monitoring 'balancing', we will likely have to work with
multi-dimensional data. Next, our chances of knowing the probability
distribution of that data, even in the case of a healthy system, is from slim
down to none. So we need a statistical hypothesis test that is both multi-
dimensional and distribution-free.

So, CloudFlare's problems are not really new!

I went ahead and did some work, math, prototype software, etc. and maybe
someday it will be useful, but it wouldn't have helped CloudFlare here if only
because they needed no help noticing that all their systems around the world
were crashing.

In our work on AI, at times we visited some high end sites, and in some cases
we found some extreme, high up off the tops of the charts, concern and
discipline for who, what, or why any humans could take any system management
actions. E.g., they had learned the lesson that can't let someone just type in
a new rule in a production system. Why? Because it was explained that one
outage in a year, and the CIO would lose his bonus. Two outages and he would
lose his job. Net, we're talking very high concern. No doubt CloudFlare will
install lots of discipline around humans taking system management actions on
their production systems.

Net, I can't blame CloudFlare. If my business gets big enough to need their
services, they will be high on the list of companies I will call first!

~~~
scoot
_'in the words of the HAL 9000, "human error".'_

Except that it wasn't human error, at least not in the sense that the decision
to enter the rule, or the rule itself was in error. The human error was with
the bug in the Juniper firmware that caused this rule to crash this router,
and arguably with the CloudFlare process that allows rules to be propagated to
all routers concurrently, rather than segmenting the network and testing for
success before further deployment.

~~~
kisielk
Actually this outage report is a good example of compounding systematic
errors. A among the things that went wrong: incorrect and impossible packet
sizes were detected, the rule generator generated rules matching the
impossible packet sizes, the human operator who looked at the rules and
entered them in to the router didn't notice any problems, and finally the
routers responded to the incorrect rules by starting to crash.

Had any of these steps not gone wrong there likely would not have been an
outage. It was a combination of failures that caused it.

~~~
scoot
I don't disagree with you, but I was calling in to question the suggestion
that the creation of the rule was the specific human error. By definition,
every error you listed is a human error, as even if ultimately carried out but
computers (routers), they were designed by humans.

