
Why Google Went Offline Today and a Bit about How the Internet Works - ColinWright
http://blog.cloudflare.com/why-google-went-offline-today-and-a-bit-about
======
trout
There are some other ways to fix the problem.

Last time with the Youtube problem, they advertised more specific routes. If
Pakistan was advertising a /24 network (255 IP addresses) Youtube started
advertising two /25 networks (2x 128 addresses). Since they are more specific,
they are preferred over the more broad routes. This prevents lack of
cooperation, but not malicious behavior. As well, it ends somewhere because
many networks will not pass routes smaller than say /24 or /28.

Most service providers also do 'inbound route filtering' to filter out any
routes that they do not own. This isn't a simple process, which is why PCCW
does not do it. Maybe a few more of these incidents and they will.

There's also AS Path filtering. This allows networks to be more granular in
which paths they trust, by inspecting which AS's a route has gone through. If
certain AS or AS path combinations become problematic, the internet at large
could blackhole them or do manual route filtering. This would be laborious,
but possible.

That said if someone can maliciously peer with an active BGP router, the
damage to be done is significant. I haven't seen any outage reports from this
type of attack, but I'm surprised by that.

~~~
pilom
Much more common that malicious outages is malicious creation of ghost
networks. Basically a person could say over BGP "W.X.Y.Z is at my office"
where that address isn't used by anyone anywhere else on the internet. Then
they do their bad deeds from that made up address. Lastly they remove their
route via BGP and it is as if their addresses never existed.

~~~
sounds
That might work for some unused /24's for a large organization's /8 block, but
unused IPv4 addresses are so last year!

I suppose the attack will still work for IPv6 for a long time.

~~~
wmf
There are a lot of IPv4 addresses that are assigned but not routed on the
Internet, so you can easily "borrow" them. This kind of trick does leave a
trace, though.

------
neya
Best explanation ever. Wow, seriously, this person can use the right words to
help even the non-technical people understand such a complex situation. Thanks
for posting this.

~~~
jwr
I used to manage networks and wondered, while reading the article, why it gets
so many points on HN, when it only states obvious things and doesn't really go
into detail. And then I realized most people these days have no idea about how
packets get from here to there, or even that there are packets at all. Now I
understand the appeal, but I guess this means that good introductory material
is badly needed.

------
duggan
“BGP is literally the glue of the Internet” - I think you’ll find BGP is
_figuratively_ the glue of the Internet ;)

~~~
clebio
I'm in this camp. To me, 'literally' has only one meaning. If it doesn't, the
word loses all utility. He could say 'is essentially the glue', I suppose.

~~~
glomph
You are outdated to the extent that you would have been behind the times in
the 1680s where the word was already being used to mean 'what follows must be
taken in the strongest admissible sense'.

~~~
SilasX
That's fine, as long as you have an alternative ready that takes the meaning
that the old "literally" had when I do want the statement to be taken, er,
literally.

If you don't, then it makes a lot of sense to defend literal from non-literal
usage.

What's the alternative that I can use and be understood?

~~~
nwienert
Context.

~~~
SilasX
"Hello, 911? Yeah, I've got an unconscious person here. His face is literally
purple."

Did I mean literally literally, or figuratively? And how can the previous
question have meaning?

~~~
ars
You remove the emphasizer and just say: "His face is purple."

~~~
SilasX
And then they say, "literally purple?"

~~~
jerf
The problem with this argument is that you are hypothesizing a case that, if
it were going to be happening, would be happening _now_ , not in the future.
Yet it does not. There is no great epidemic of confused 911 operators because
they can't make out whether or not someone on the other end used literally
"correctly".

While one can speculate on why you might be wrong about this being a problem,
an examination of the world around us rather strongly suggests that there's no
question that there is _something_ fatally wrong with your argument.

~~~
SilasX
There's no epidemic of _any_ single problem being caused by _any_ imprecision
in grammar. But there are lots of little, similar problems -- perhaps in non-
emergency situations -- that cause predictable, avoidable confusion because
people insist on breaking the use of important words.

If your point is that "we can make it impossible to communicate the concept
'literally' until there's an epidemic of deaths over it", then your threshold
is in a very, very wrong place.

~~~
jerf
It was your implicit threshold you were setting with your argument, not mine.
While I am gratified that you so thoroughly demolished your own argument for
me, you might want to consider your arguments a bit more tactically in the
future.

The real problem being caused here is well below the noise threshold and
certainly not worth trying to play "Holier than thou" at people on the
internet.

~~~
SilasX
>It was your implicit threshold you were setting with your argument, not mine

That wasn't my threshold; that was an example of a confusion that couldn't be
disambiguated without clear terms for literal vs figurative; it's just that it
had unusually large implications for a scenario that require fast, unambiguous
communication. (I guess we don't have to care about these scenarios?)

Your own implicit threshold of "if someone doesn't die because if it, I can
fuck up the communicative ability of a language however I feel like" is so
thoroughly stupid, I doubt you even believe it yourself, yet feel the need to
argue for it anyway.

In any case, I'm less concerned with who makes the best tactical moves than on
discerning the best idea presented. As it stands, I don't yet see any
justification for "let's get rid of this useful disambiguating feature for
literal vs figurative" -- but feel free to keep offering them; maybe your
knowledge of "tactics" could come in handy here, thought I doubt it. Tactical
arguments don't make a language useful. Rather, substance does.

And any time you ever get around to telling me how to indicate the old meaning
of "literally" you just let me know. I get that it's not a real high priority
for you right now (based on how you think), and I'm not holding by breath or
anything, but it would be really cool if you could pull it off. Thanks.

~~~
glomph
The thing is it isn't a choice we are making now. It was made over 400 years
ago and it works.

------
vr000m
There is an IETF WG called SIDR, which is working on solving this problem of
invalid BGP announcements. A good summary is available here
<http://isoc.org/wp/ietfjournal/?p=2438> and technical details are in the
related proposals.

~~~
danyork
Yes, a very good group for people to get involved with if they are interested
in this problem.

------
apaprocki
If you're interested in peering (couldn't resist the pun) behind the curtain,
read the NANOG[1] mailing list. These are the real guys keeping the Internet
up and running :)

[1]: <http://mailman.nanog.org/pipermail/nanog/>

~~~
dsl
It is worth noting that the average HN reader should probably subscribe read-
only. Unless you have your own AS and enable on routers, you should probably
call your ISP with any issues. (though an unfortunate number of people
disregard this advice, which results in smaller private splinter mailing lists
_sigh_ )

------
rlpb
> When I figured out the problem, I contacted a colleague at Moratel to let
> him know what was going on. He was able to fix the problem...

I wonder how he contacted his colleague. In this case, I presume that routing
to other networks were unaffected. But in the general case, with a future of
everything over IP, what will network engineers use to communicate about
faults?

~~~
jwr
If you run a network with BGP, you always have good contact information for
your peers. "Good" meaning direct telephone contact with tech people running
the show on the other side of the link.

~~~
Peaker
Telephony might be routed over IP too.

------
ybaumes
The author (Tom Paseka) wrote near the conclusion that himself addressed the
Google's issue, by contacting a Moratel's engineer. Do you have the same
feeling when reading the article? It sounds weird that Google did not
triggered a recovery procedure on its own.

Maybe I see bad things everywhere and you may call paranoïd, but could it be
some sort of ("false") advertising on the side of cloudfare?

~~~
archangel_one
I'm not a network engineer, but it seems like the kind of thing that might be
very hard to detect when you're already inside or near to the google.com
domain. Or maybe CloudFlare just got there first.

I don't think it's necessary to call BS on Cloudflare without any kind of
evidence at all.

~~~
amalcon
This is basically correct. BGP is weird. The addresses for one of Google's
many datacenters were routed incorrectly for packets coming from some subset
of IP space. Unless Google is running active ping tests to that subset of IP
space, the way they would normally detect it is for someone to call and
complain.

In this case, the author decided to take a shortcut and call the owner of the
"problem peer" directly.

~~~
veidr
Although only a vanishingly small percentage of Google users can _call_ and
complain. Blog or tweet or post to HN and hope Matt Cutts sees it and notifies
the right team, maybe.

~~~
Matt_Cutts
A team of Googlers could have been working on this in parallel to Tom. I'm
guessing that a sudden drop of queries like that would cause people at Google
to start digging into what happened. I don't know either way, because network
ops and BGP is pretty far from my area (search quality).

------
jhull
Couldn't a rogue government easily take down the internet this way? Seems like
if one guy in Indonesia can take out Google by accident, a government entity
could do the same.

~~~
sudhirj
The moment people realize that the rogue network was being malicious, they'd
stop trusting it - ignoring all announcements it might make. It might take a
few hours for order to be restored, though.

~~~
JohnLBevan
Would it be possible to claim to own Google's IP, then on receiving the
packets intended for Google forward them on to the real IP (without
accidentally forwarding them back to yourself)? That way someone could hijack
& interrogate these packets without being spotted (at least without causing
service outage / only adding slight delay). Alternatively could they route
these requests to a clone as an advanced phishing scam?

~~~
ErikD
That's what https is for. It should prevent them from doing anything useful
with the packets.

~~~
vegardx
Not unless you manage to forge a certificate at the same time. It has been
done before, as SSL is based on more or less the same level of trust as BGP.

------
lini
And nothing will change. At least not until someone does this with malicious
intent - script kiddie A knocks out big site, or a censoring state decides
that it should block a free speech site from the entire Internet.

~~~
forgotusername
Evil routing has been employed a whole bunch of times going back decades, most
visibly a couple of years ago when IIRC Iran (?) started advertising bad
routes for a bunch of big sites, including Google

~~~
apawloski
Pakistan null routed YouTube and accidentally took a big chunk down around the
world in 2007.

------
xxcode
To be accurate, google didn't go down today -- your pathway from your computer
to google got 'poisoned'. It wasn't Google's fault.

------
stingraycharles
For what it's worth, this is quite a vulnerability in the internet's routing
system. It's also the reason Youtube went offline after Pakistan was
deliberately announcing the wrong routes a few years ago because it didn't
agree with some videos being broadcasted by Youtube.

[http://www.ripe.net/internet-coordination/news/industry-
deve...](http://www.ripe.net/internet-coordination/news/industry-
developments/youtube-hijacking-a-ripe-ncc-ris-case-study)

<http://news.cnet.com/8301-10784_3-9878655-7.html>

~~~
highace
This worries me. Am I right in saying a malicious party could actually take
down the internet with this?

~~~
sp332
Yes, but you'd have to con a lot of big players into trusting your BGP routes
first. And the effect would only last as long as it took to change some
configurations and write you back out of the internet.

------
killermonkeys
Why wouldn't PCCW preventing its customers from publishing routes outside its
whitelist work? It has been a long time since I worked on BGP but that was
common practice from back haul carriers to ISPs even at that point (2003).
Given the same back haul provider has allowed this twice, it seems like a
reasonable ask.

~~~
jauer
Some carriers are lazy. There may also be politics involved in making national
carriers "ask" for permission to advertise routes.

------
jemfinch
Isn't the title a touch sensationalist? Google did not go "offline": it was
briefly unavailable for a relatively small number of networks.

~~~
ColinWright
You can't win. If you quote the title given, people complain. If you change it
to something more accurate, the mods change it back, and then people complain
anyway.

~~~
jemfinch
Yes; to be clear, I was complaining about the author's original title, not the
submission title.

------
hayksaakian
It really makes one wonder about the fragility of the internet.

~~~
precisioncoder
I would say the resilience is what impresses me here. The fact that it's
decentralized means that anyone can fix the internet. The fact that this one
specific problem was fixed within 26 min by individuals realizing the problem
and acting to fix it gives me a warm feeling.

~~~
kami8845
I think what you mean is that _anyone_ can break the internet (in this case a
random ISP from Indonesia) and that in that case only _very specific_ people
could fix it (probably at least a senior network engineer at said ISP).

~~~
sp332
Only specific routers that you trust (or are trusted by routers you trust) can
break your internet. You can fix your internet by un-trusting those routers.

------
ChuckMcM
At some point anonymous is going to figure out the bgp 'hack' is actually
exploitable, unlike taking the root name servers offline and we see a network
routing outage for several days. I wish it wasn't so but sometimes that is the
only way these things get fixed

~~~
dsl
First of all to pull off this "hack" you need a router, an AS number, a
transit contract with your upstream provider, BGP configured with said
upstream, and most importantly your upstream needs to be negligent enough to
not apply route filters to your session (which basically means I will only
accept routes for IPs owned by company X over company X's session).

Secondly, it is pretty easy to track down who is doing it. Assuming a rouge
employee used their employers setup (see first point) to announce once of
Google's routes and it managed to propagate, smart people at NOCs around the
world start emailing and calling each other pretty quickly. Despite CloudFlare
trying to take credit here, I'd put money on the fact the network in question
received at least a dozen phone calls and emails. There are services like
Renesys and BGPmon that "important" companies sign up for that will scream
bloody murder and start paging people if someone unauthorized originates your
prefixes.

Third, as this is a known problem, a solution is already in the works and on
its way to being implemented. Basically when you are assigned a block of IP
addresses, you also get to publish a cryptographically signed statement of how
and where that block should show up in the global routing table. See
[http://www.nanog.org/meetings/nanog49/presentations/Tuesday/...](http://www.nanog.org/meetings/nanog49/presentations/Tuesday/bgp-
origin-validation-FINAL.pdf)

~~~
guiambros
Well said, dsl. Almost two decades ago I used to run an ISP in another
country, and remember that BGP was already reasonably safe at the time (when
v4 started to be implemented), with peers normally rejecting route updates
from blocks outside your control.

Yes, there's always the risk of a trusted peer mistakenly leaking routes
publicly (and a permissive upstream provider not rejecting it outright), but
that's a low risk attack vector.

I do remember this happening a few times, but were quickly spotted and
corrected (true, the internet at the time was a _lot_ smaller; you could
probably fit _all_ sysadmins of a country in a room..)

I see this article as the CloudFlare guy trying to get credit for an act of
civility that many other sysadmins likely have done, silently, in parallel. Of
course I'm glad he did, but wouldn't expect anything less. That's just how the
internet works.

ps: thanks for the link. NANOG is something that I had long ago erased from my
brain. Had a chuckle looking at the archives :)

------
ninetax
While this does make sense if I abstract out what a BGP is, I wish I had a
deeper knowledge of how the Internet works.

Does anyone know of a book that goes from the basics of networking up to how
it's all assembled on a large scale?

A "big book of internet" if you will.

------
clebio
Since I use DuckDuckGo for searches, I probably wouldn't notice this. Not
receiving Gmail for a while wouldn't be noteworthy (at least for the first
half hour or so).

I'm confused about the times the author gives, though. The article is dated
today (11/6) and he says this happened 'today' at 6:24pm PST / 02:24 UTC. But
unless I'm mistaken, that is a time currently in the future
(<http://time.gov/timezone.cgi?Pacific/d/-8/java>). I guess he meant
yesterday?

~~~
robk
You're counting across the dateline, so for you it was 11/5.

~~~
clebio
Am I? Not snark: if I'm misunderstanding this, I truly want to know. I'm in
central US, CST, and the article gives PST. That conversion has always just
been +2 hours.

~~~
Cushman
As I read it that was 18:24 yesterday in PST, or 02:24 today in UTC. The use
of "today" may just be sloppy dating-- or it may reflect that it was today for
most of those affected.

~~~
clebio
My guess was that he wrote the article yesterday, but didn't publish it until
today. It's not a big deal, was just curious.

~~~
eastdakota
That's correct. Tom wrote the article yesterday (11/5) but I didn't review it
and hit publish until today (11/6). Sorry for the confusion.

------
rdl
I am more curious what caused the 4 minute mid-day outage a few days ago. It
wasn't BGP, since google.com was still up, but all personalization was down,
and YouTube was down.

------
flannell
Not the first time, see here;

[http://www.theregister.co.uk/2010/04/09/china_bgp_interweb_s...](http://www.theregister.co.uk/2010/04/09/china_bgp_interweb_snafu/)

------
runn1ng
Can I ask a pretty newbie question - how is BGP connected to IP, TCP and DNS
protocols? Is it sitting "below" them, "on top" of them, or is it somewhere
else?

~~~
jemfinch
First, TCP and DNS don't come into it: they both piggyback on IP (TCP
directly; DNS via UDP in typical use), so IP is all that's really relevant.

BGP is how routers communicate with each other. Every major edge router for a
network is typically connected to many other edge routers for other networks.
Each router announces what amounts to their complete routing table: i.e., for
every IPv4/IPv6 address that they know how to route, they announce what
networks it traverse on the way to the destination.

When a router is deciding which router an IP packet should hop to next, it
looks at the packet's destination IP address and consults an in-memory data
structure that it has constructed based on the BGP announcements of the
routers to which it's connected. Modulo refining nuances (MED/PREF), it looks
for two things:

1\. It routes the packet according to the most specific network it saw
announced. If it sees a packet destined for 1.2.3.4, and one connected router
A is announcing a route for 1.2.3.0/24, and another connected router B is
announcing a route for 1.2.0.0/16, it will pass along the packet to router A,
all other things being equal.

2\. As a tiebreaker for announcements with the same network specificity, it
looks at the "AS path": the set of networks that the packet will traverse. It
picks the router with the shortest path: the least number of traversed
networks.

So the answer to your direct question is that BGP is "somewhere else": it's
what routers use to communicate to each other "How will you route this IP
packet?" and then make reasonable decisions about how they should send packets
around the network.

~~~
count
To be clear - BGP runs on TCP port 179, so it sits in TCP segments, and those
are inside IP packets.

------
tomjen3
Does this mean that in the future we should ignore all routes comming from
PCCW (since they rebroadcast all rules without filtering)?

~~~
wmf
That's a good way to effectively disconnect yourself from the Internet. A lot
of ISPs are not properly filtering.

------
ambiguator
I don't know much about the processes behind the Internet, but I found this to
be a fascinating introduction.

------
halayli
Outage in an ASN != Google went Offline. The title puts the blame on Google
which isn't true.

~~~
oldcreek
The failure has nothing to do with BGP either.

------
rurounijones
Sidenote: In the comments I saw a reply about nanog.com being a great plce to
meet other networking peeps.

<http://www.nanog.com/> is currently showing a "Welcome to nginx" message

~~~
bonobo
That's because the right address is <http://www.nanog.org>

~~~
rurounijones
Ta muchly.

------
dangoldin
This is a great write up - thanks for posting. I'm slowly beginning to
understand how the internet works day by day due to posts like this.

------
sneak
"I'm a network engineer at CloudFlare and I played a small part in helping
ensure Google came back online."

Uhh, no. Without the "ensure", then maybe.

~~~
oldcreek
Even without the "ensure" ... what did this network engineer at CloudFlare do
anyway? it was a hardware failure.

~~~
saraid216
He filed a bug report.

------
louischatriot
Very interesting explanation, thank you.

------
rahasia
27 minutes, 3-5% traffic, it could means thousand of dollars lost for Google,
right? (Does it sueable?)

~~~
jQueryIsAwesome
Under which law are they going to sue? USA laws don't apply to Indians ISPs.

~~~
bonchibuji
It was an Indonesian ISP, not Indian.

------
henrymazza
> We use Google Apps for things like email so when we can't reach their
> servers

Very professional way to do so!

------
jamesinsf
Great explanation. Good show and great job! Very smart engineers at
Cloudflare!

------
yskchu
Haha, I was in HK today, and one of those hit, using PCCW services also

------
tcohen
I wish I understood more of this but still really cool!

------
sunyc
almost all bgp transit provider have prefix filtering,

------
lhnn
DAE think that the whole "BGP is broken!" argument is a bit overblown?

If you're going to have a bunch of autonomous systems/networks operating
together, with no central authority, it necessarily comes down to trust and
relationships.

Shit will occasionally happen. It's important to look at outages, figure out
the cause, and work to prevent it. Perhaps, though, this is a best practices
issue, and not some fundamental flaw in BGP.

