Hacker News new | past | comments | ask | show | jobs | submit login
Why Google Went Offline Today and a Bit about How the Internet Works (cloudflare.com)
676 points by ColinWright on Nov 6, 2012 | hide | past | web | favorite | 154 comments

There are some other ways to fix the problem.

Last time with the Youtube problem, they advertised more specific routes. If Pakistan was advertising a /24 network (255 IP addresses) Youtube started advertising two /25 networks (2x 128 addresses). Since they are more specific, they are preferred over the more broad routes. This prevents lack of cooperation, but not malicious behavior. As well, it ends somewhere because many networks will not pass routes smaller than say /24 or /28.

Most service providers also do 'inbound route filtering' to filter out any routes that they do not own. This isn't a simple process, which is why PCCW does not do it. Maybe a few more of these incidents and they will.

There's also AS Path filtering. This allows networks to be more granular in which paths they trust, by inspecting which AS's a route has gone through. If certain AS or AS path combinations become problematic, the internet at large could blackhole them or do manual route filtering. This would be laborious, but possible.

That said if someone can maliciously peer with an active BGP router, the damage to be done is significant. I haven't seen any outage reports from this type of attack, but I'm surprised by that.

Much more common that malicious outages is malicious creation of ghost networks. Basically a person could say over BGP "W.X.Y.Z is at my office" where that address isn't used by anyone anywhere else on the internet. Then they do their bad deeds from that made up address. Lastly they remove their route via BGP and it is as if their addresses never existed.

That might work for some unused /24's for a large organization's /8 block, but unused IPv4 addresses are so last year!

I suppose the attack will still work for IPv6 for a long time.

There are a lot of IPv4 addresses that are assigned but not routed on the Internet, so you can easily "borrow" them. This kind of trick does leave a trace, though.

Best explanation ever. Wow, seriously, this person can use the right words to help even the non-technical people understand such a complex situation. Thanks for posting this.

I used to manage networks and wondered, while reading the article, why it gets so many points on HN, when it only states obvious things and doesn't really go into detail. And then I realized most people these days have no idea about how packets get from here to there, or even that there are packets at all. Now I understand the appeal, but I guess this means that good introductory material is badly needed.

“BGP is literally the glue of the Internet” - I think you’ll find BGP is figuratively the glue of the Internet ;)

It depends on how one defines virtual glue. But, hey, the whole thing might be moot.


I'm in this camp. To me, 'literally' has only one meaning. If it doesn't, the word loses all utility. He could say 'is essentially the glue', I suppose.

You are outdated to the extent that you would have been behind the times in the 1680s where the word was already being used to mean 'what follows must be taken in the strongest admissible sense'.

That's fine, as long as you have an alternative ready that takes the meaning that the old "literally" had when I do want the statement to be taken, er, literally.

If you don't, then it makes a lot of sense to defend literal from non-literal usage.

What's the alternative that I can use and be understood?


"Hello, 911? Yeah, I've got an unconscious person here. His face is literally purple."

Did I mean literally literally, or figuratively? And how can the previous question have meaning?

You remove the emphasizer and just say: "His face is purple."

And then they say, "literally purple?"

The problem with this argument is that you are hypothesizing a case that, if it were going to be happening, would be happening now, not in the future. Yet it does not. There is no great epidemic of confused 911 operators because they can't make out whether or not someone on the other end used literally "correctly".

While one can speculate on why you might be wrong about this being a problem, an examination of the world around us rather strongly suggests that there's no question that there is something fatally wrong with your argument.

There's no epidemic of any single problem being caused by any imprecision in grammar. But there are lots of little, similar problems -- perhaps in non-emergency situations -- that cause predictable, avoidable confusion because people insist on breaking the use of important words.

If your point is that "we can make it impossible to communicate the concept 'literally' until there's an epidemic of deaths over it", then your threshold is in a very, very wrong place.

It was your implicit threshold you were setting with your argument, not mine. While I am gratified that you so thoroughly demolished your own argument for me, you might want to consider your arguments a bit more tactically in the future.

The real problem being caused here is well below the noise threshold and certainly not worth trying to play "Holier than thou" at people on the internet.

>It was your implicit threshold you were setting with your argument, not mine

That wasn't my threshold; that was an example of a confusion that couldn't be disambiguated without clear terms for literal vs figurative; it's just that it had unusually large implications for a scenario that require fast, unambiguous communication. (I guess we don't have to care about these scenarios?)

Your own implicit threshold of "if someone doesn't die because if it, I can fuck up the communicative ability of a language however I feel like" is so thoroughly stupid, I doubt you even believe it yourself, yet feel the need to argue for it anyway.

In any case, I'm less concerned with who makes the best tactical moves than on discerning the best idea presented. As it stands, I don't yet see any justification for "let's get rid of this useful disambiguating feature for literal vs figurative" -- but feel free to keep offering them; maybe your knowledge of "tactics" could come in handy here, thought I doubt it. Tactical arguments don't make a language useful. Rather, substance does.

And any time you ever get around to telling me how to indicate the old meaning of "literally" you just let me know. I get that it's not a real high priority for you right now (based on how you think), and I'm not holding by breath or anything, but it would be really cool if you could pull it off. Thanks.

The thing is it isn't a choice we are making now. It was made over 400 years ago and it works.

There are actually strong reasons to think ambiguity in certain cases is an asset in a language not a hindrance. This talk goes into it in part: http://www.ted.com/talks/steven_pinker_on_language_and_thoug...

Drop the word, or use one of: actually, truly, really.

Well, they make a valid point in that a word carries with it implicit attributes that should also be taken into account.

It's what really separates bad writing from very bad writing.

Then again, it's also a perspective thing.

Curious why such a specific time period. Is there a source for 1680s? The 17th century would probably be sufficient otherwise.

OED has the source if you have access to it, which I don't or I would provide it here.


Etymonline does not reproduce the OED, but it does source from it and sometimes you get lucky.

Thanks for the link! It's fairly interesting, too. That citation specifically reads: 'Erroneously used in reference to metaphors, hyperbole, etc., even by writers like Dryden and Pope, to indicate "what follows must be taken in the strongest admissible sense" (1680s), which is opposite to the word's real meaning.'

So for one, it states clearly 'erroneously used' (and indeed has this specific wording, 'strongest admissible sense').

Further, gives the 1680 number, but doesn't actually source that any further (general writing periods of Dryden and Pope? Perhaps, though not clear).

Anyway, that's fun. I miss the OED, but it's the sort of massive tome that's impractical to always have on hand.

You do realise there's more than one meaning for "glue" these days, presumably (including "something that binds together")?

There's also more than meaning for "literally" these days.

With two inflection points in the phraseology, you can literally reverse your glue.

That only makes it more likely that what was said has a semantically correct meaning though (or is that what you meant?).

He meant that there is literally only one definition for literally. Any other use of the word cuts away at its meaning (like what happened in that context).

Not really. It is literally something that binds the internet together (i.e. one of the common definitions of the word glue).

Now we can get into what "bind" means.

Or we can stop pretending like we're even a little bit confused as to what was meant at any point along the way here... natural language exists to facilitate communication and understanding, not pointless arguments over the form of idiomatic expressions.

Merriam-Webster "Ask the Editor": http://www.youtube.com/watch?v=Ai_VHZq_7eU

Also worth watching, "The History of English in 10 Minutes": http://www.youtube.com/watch?v=rexKqvgPVuA

I think it would be a stretch to suggest that the use of "literally" here was for the sake of hyperbole. It seems pretty clear that it's intended as an analogy.

Interesting videos though, didn't realize I'd stepped on another unexploded grammar mine from The War. I really should know better at this stage.

It's literally the figurative glue.

'literally' has literally become an overused cliche that does't mean anything and isn't used properly most of the time.

you can almost always take it out of the sentence it is being used in and the sentence is easier to read and makes more sense.

a new pet-hate word for me (nothing on you OP!)

Obligatory The Oatmeal reference: http://theoatmeal.com/comics/literally

I think BGP is more of a mucus than glue

There is an IETF WG called SIDR, which is working on solving this problem of invalid BGP announcements. A good summary is available here http://isoc.org/wp/ietfjournal/?p=2438 and technical details are in the related proposals.

Yes, a very good group for people to get involved with if they are interested in this problem.

If you're interested in peering (couldn't resist the pun) behind the curtain, read the NANOG[1] mailing list. These are the real guys keeping the Internet up and running :)

[1]: http://mailman.nanog.org/pipermail/nanog/

It is worth noting that the average HN reader should probably subscribe read-only. Unless you have your own AS and enable on routers, you should probably call your ISP with any issues. (though an unfortunate number of people disregard this advice, which results in smaller private splinter mailing lists sigh)

> When I figured out the problem, I contacted a colleague at Moratel to let him know what was going on. He was able to fix the problem...

I wonder how he contacted his colleague. In this case, I presume that routing to other networks were unaffected. But in the general case, with a future of everything over IP, what will network engineers use to communicate about faults?

If you run a network with BGP, you always have good contact information for your peers. "Good" meaning direct telephone contact with tech people running the show on the other side of the link.

Telephony might be routed over IP too.

Many networks engage in a practice know as peering. If the author was the peering co-ordinator for his AS, he may have likely been previously in touch with the peering co-ordinator on Moratel's side when establishing a peering relationship, and would thus have had direct contact information to a pretty direct channel (peering co-ordinators are generally also network engineers). The direct contact information being e-mail, phone, IM, or even IRC (yes, some network engineers still use it). Although, the phone (non-VOIP) would be the only option not tied in to IP.

cellphones ?

There's a trend to use VoIP on cellphones too (see LTE). So in the future this will not help at all.

Even with LTE/VoLTE, the voice packets are usually contained within the telco private network, that just happen to be IP based, with private interconnections to the other telcos.

At some point we might see most of the VoIP being transported across the internet as well, but that'll be the far future.

Is there some way to say "ignore DNS results from this provider", such that were you to spot an issue you could block that provider's information (and anyone replicating their version of the truth) and thus find a valid path. If that were possible you wouldn't be reliant on a third party to resolve the issue to get your system working, and once your system worked, you could contact them to resolve the issues for all.

If someone is giving out bad DNS records, you can just choose to use a different DNS server.

But in this case the problem was bad routes. You can certainly force your own routers to use fixed routes instead, but that doesn't help you unless everybody else along the path also does it. So it's not easy. There are tricks one can play -- like advertising your network as a set of smaller, more-specific networks (since routers will usually favor more-specific routes over more general ones).

@JohnLBevan this was not a DNS problem, but a BGP problem

NB Cannot reply under his post

Even when there's no "reply" link next to the poster's name, I think you can usually reply if you click the "link" link in the same area.

The author (Tom Paseka) wrote near the conclusion that himself addressed the Google's issue, by contacting a Moratel's engineer. Do you have the same feeling when reading the article? It sounds weird that Google did not triggered a recovery procedure on its own.

Maybe I see bad things everywhere and you may call paranoïd, but could it be some sort of ("false") advertising on the side of cloudfare?

I'm not a network engineer, but it seems like the kind of thing that might be very hard to detect when you're already inside or near to the google.com domain. Or maybe CloudFlare just got there first.

I don't think it's necessary to call BS on Cloudflare without any kind of evidence at all.

This is basically correct. BGP is weird. The addresses for one of Google's many datacenters were routed incorrectly for packets coming from some subset of IP space. Unless Google is running active ping tests to that subset of IP space, the way they would normally detect it is for someone to call and complain.

In this case, the author decided to take a shortcut and call the owner of the "problem peer" directly.

Although only a vanishingly small percentage of Google users can call and complain. Blog or tweet or post to HN and hope Matt Cutts sees it and notifies the right team, maybe.

A team of Googlers could have been working on this in parallel to Tom. I'm guessing that a sudden drop of queries like that would cause people at Google to start digging into what happened. I don't know either way, because network ops and BGP is pretty far from my area (search quality).

>Blog or tweet or post to HN and hope Matt Cutts sees it and notifies the right team, maybe.

It seems that's more or less the quasi-official support channel even for paid services from Google.

A common way to notice things like this is to subscribe to a service like Renesys or Cyclops (http://cyclops.cs.ucla.edu/) that will alert you if it sees your subnets being announced by a different AS.

Stop excusing yourself for not being something, either read more and then comment or trust your gut feeling. (not meant to sound harsh)

I think it's good to qualify your opinion with your level of expertise. There's no rule that says HN should only be for discussion by experts (hopefully there never will be), and if you don't know something for sure it's best to say so that others don't take your word as gospel. That said, I'm no expert ;).

The key quote seems to be "Looking at peering maps, I'd estimate the outage impacted around 3–5% of the Internet's population." so if it didn't affect google directly it would have to go through customer server -> network technicians which would probably take more than the 26 min that Google was down for those customers for. I'm sure they would have been right on it if it hadn't have been fixed so fast.

  It sounds weird that Google did not triggered a recovery procedure on its own.
It's possible they didn't have the personal contact details for the engineer capable of fixing the problem.

We all know how hard it can be to contact a competent person at a big corporation when you have a problem [1]. Would Google find it easier than every other human being?

[1] http://xkcd.com/806/

It sounds like cloudflare simply got there first. Unless cloudflare is outright lying (highly unlikely) they saw a problem they could fix, and fixed it. What's the false advertising there?

They didn't fix anything. Multiple people noticed, all of them contacted the network in question, then they took credit publicly when another network fixed its mistake.

Couldn't a rogue government easily take down the internet this way? Seems like if one guy in Indonesia can take out Google by accident, a government entity could do the same.

The moment people realize that the rogue network was being malicious, they'd stop trusting it - ignoring all announcements it might make. It might take a few hours for order to be restored, though.

Would it be possible to claim to own Google's IP, then on receiving the packets intended for Google forward them on to the real IP (without accidentally forwarding them back to yourself)? That way someone could hijack & interrogate these packets without being spotted (at least without causing service outage / only adding slight delay). Alternatively could they route these requests to a clone as an advanced phishing scam?

Wouldn't the packets that <evil network> forwarded on to the real Google just get routed right aback to them, because the rest of the world thinks they are Google?

It might work for somebody like China, where they have two network interfaces, so can make all Chinese networks think they are Google on one interface, then forward things on to the real Google on the other. There might be a good reason for them to do it, too, because they are also likely a trusted CA, so could forge SSL certs, too.

That is more or less the definition of a man in the middle attack. Hopefully if the website does something important(online banks, shopping, etc.), they have done something to mitigate that possibility.

That's what https is for. It should prevent them from doing anything useful with the packets.

Not unless you manage to forge a certificate at the same time. It has been done before, as SSL is based on more or less the same level of trust as BGP.

Maybe you can't read the data in the packets, but HTTPS doesn't do a thing about SIGINT (signals intelligence) which, on such a large scale, could give you a lot of valuable information.

Enough time for a purge or to gain an advantage in an invasion. This might even be on checklists for such actions.

This exactly thing happened a few years ago. China rerouted about 10% of internet traffic, presumably by accident. http://www.theregister.co.uk/2010/04/09/china_bgp_interweb_s...

And nothing will change. At least not until someone does this with malicious intent - script kiddie A knocks out big site, or a censoring state decides that it should block a free speech site from the entire Internet.

Evil routing has been employed a whole bunch of times going back decades, most visibly a couple of years ago when IIRC Iran (?) started advertising bad routes for a bunch of big sites, including Google

Pakistan null routed YouTube and accidentally took a big chunk down around the world in 2007.

I can't find much information about Iran advertising bad routed, but China did: http://bgpmon.net/?p=282

A much more useful thing to do than take out a big site is use BGP to create your own "section of the internet" to do your malicious deeds from, then afterwards remove the BGP routes and your addresses will no longer exist on the internet. So it will be as if you sent packets from a phantom network.

Thats why many people have been archiving every advertised route going back to the mid-90's.

http://archive.routeviews.org/oix-route-views/ http://www.pch.net/resources/data.php?dir=/routing-tables (link appears to be temporarily broken)

Script kiddies generally do not have access to edge routers though...

To be accurate, google didn't go down today -- your pathway from your computer to google got 'poisoned'. It wasn't Google's fault.

For what it's worth, this is quite a vulnerability in the internet's routing system. It's also the reason Youtube went offline after Pakistan was deliberately announcing the wrong routes a few years ago because it didn't agree with some videos being broadcasted by Youtube.



This worries me. Am I right in saying a malicious party could actually take down the internet with this?

Yes, but you'd have to con a lot of big players into trusting your BGP routes first. And the effect would only last as long as it took to change some configurations and write you back out of the internet.

Malicious party? It already happened by accident. http://www.renesys.com/blog/2009/02/longer-is-not-better.sht...

They would have to be a 'trusted' malicious party. But in theory, yes. I'm sure it would be reverted very quickly though.

Why wouldn't PCCW preventing its customers from publishing routes outside its whitelist work? It has been a long time since I worked on BGP but that was common practice from back haul carriers to ISPs even at that point (2003). Given the same back haul provider has allowed this twice, it seems like a reasonable ask.

Some carriers are lazy. There may also be politics involved in making national carriers "ask" for permission to advertise routes.

Isn't the title a touch sensationalist? Google did not go "offline": it was briefly unavailable for a relatively small number of networks.

You can't win. If you quote the title given, people complain. If you change it to something more accurate, the mods change it back, and then people complain anyway.

Yes; to be clear, I was complaining about the author's original title, not the submission title.

For the people affected, google was for all purposes "offline". If the estimate of 3–5% of the internet population is correct, that's a lot of people.

I think that if 3-5% of the Internet population suddenly stopped querying Google, Google would have noticed before CloudFlare.

There weren't any claims that Google were unaware of this. And when things happen at this level, resolving it in 27 minutes can only be done when there's direct contact between people that are able to do anything about it.

It really makes one wonder about the fragility of the internet.

I would say the resilience is what impresses me here. The fact that it's decentralized means that anyone can fix the internet. The fact that this one specific problem was fixed within 26 min by individuals realizing the problem and acting to fix it gives me a warm feeling.

I think what you mean is that anyone can break the internet (in this case a random ISP from Indonesia) and that in that case only very specific people could fix it (probably at least a senior network engineer at said ISP).

Only specific routers that you trust (or are trusted by routers you trust) can break your internet. You can fix your internet by un-trusting those routers.

At some point anonymous is going to figure out the bgp 'hack' is actually exploitable, unlike taking the root name servers offline and we see a network routing outage for several days. I wish it wasn't so but sometimes that is the only way these things get fixed

First of all to pull off this "hack" you need a router, an AS number, a transit contract with your upstream provider, BGP configured with said upstream, and most importantly your upstream needs to be negligent enough to not apply route filters to your session (which basically means I will only accept routes for IPs owned by company X over company X's session).

Secondly, it is pretty easy to track down who is doing it. Assuming a rouge employee used their employers setup (see first point) to announce once of Google's routes and it managed to propagate, smart people at NOCs around the world start emailing and calling each other pretty quickly. Despite CloudFlare trying to take credit here, I'd put money on the fact the network in question received at least a dozen phone calls and emails. There are services like Renesys and BGPmon that "important" companies sign up for that will scream bloody murder and start paging people if someone unauthorized originates your prefixes.

Third, as this is a known problem, a solution is already in the works and on its way to being implemented. Basically when you are assigned a block of IP addresses, you also get to publish a cryptographically signed statement of how and where that block should show up in the global routing table. See http://www.nanog.org/meetings/nanog49/presentations/Tuesday/...

Well said, dsl. Almost two decades ago I used to run an ISP in another country, and remember that BGP was already reasonably safe at the time (when v4 started to be implemented), with peers normally rejecting route updates from blocks outside your control.

Yes, there's always the risk of a trusted peer mistakenly leaking routes publicly (and a permissive upstream provider not rejecting it outright), but that's a low risk attack vector.

I do remember this happening a few times, but were quickly spotted and corrected (true, the internet at the time was a lot smaller; you could probably fit all sysadmins of a country in a room..)

I see this article as the CloudFlare guy trying to get credit for an act of civility that many other sysadmins likely have done, silently, in parallel. Of course I'm glad he did, but wouldn't expect anything less. That's just how the internet works.

ps: thanks for the link. NANOG is something that I had long ago erased from my brain. Had a chuckle looking at the archives :)

Correct of course, you would need to compromise the infrastructure of an ISP. Not that a few hundred dollars or a USB stick drive at the right place couldn't do it. Especially for a less well travelled part of the Internet.

While this does make sense if I abstract out what a BGP is, I wish I had a deeper knowledge of how the Internet works.

Does anyone know of a book that goes from the basics of networking up to how it's all assembled on a large scale?

A "big book of internet" if you will.

Since I use DuckDuckGo for searches, I probably wouldn't notice this. Not receiving Gmail for a while wouldn't be noteworthy (at least for the first half hour or so).

I'm confused about the times the author gives, though. The article is dated today (11/6) and he says this happened 'today' at 6:24pm PST / 02:24 UTC. But unless I'm mistaken, that is a time currently in the future (http://time.gov/timezone.cgi?Pacific/d/-8/java). I guess he meant yesterday?

You're counting across the dateline, so for you it was 11/5.

Am I? Not snark: if I'm misunderstanding this, I truly want to know. I'm in central US, CST, and the article gives PST. That conversion has always just been +2 hours.

As I read it that was 18:24 yesterday in PST, or 02:24 today in UTC. The use of "today" may just be sloppy dating-- or it may reflect that it was today for most of those affected.

My guess was that he wrote the article yesterday, but didn't publish it until today. It's not a big deal, was just curious.

That's correct. Tom wrote the article yesterday (11/5) but I didn't review it and hit publish until today (11/6). Sorry for the confusion.

I am more curious what caused the 4 minute mid-day outage a few days ago. It wasn't BGP, since google.com was still up, but all personalization was down, and YouTube was down.

Can I ask a pretty newbie question - how is BGP connected to IP, TCP and DNS protocols? Is it sitting "below" them, "on top" of them, or is it somewhere else?

First, TCP and DNS don't come into it: they both piggyback on IP (TCP directly; DNS via UDP in typical use), so IP is all that's really relevant.

BGP is how routers communicate with each other. Every major edge router for a network is typically connected to many other edge routers for other networks. Each router announces what amounts to their complete routing table: i.e., for every IPv4/IPv6 address that they know how to route, they announce what networks it traverse on the way to the destination.

When a router is deciding which router an IP packet should hop to next, it looks at the packet's destination IP address and consults an in-memory data structure that it has constructed based on the BGP announcements of the routers to which it's connected. Modulo refining nuances (MED/PREF), it looks for two things:

1. It routes the packet according to the most specific network it saw announced. If it sees a packet destined for, and one connected router A is announcing a route for, and another connected router B is announcing a route for, it will pass along the packet to router A, all other things being equal.

2. As a tiebreaker for announcements with the same network specificity, it looks at the "AS path": the set of networks that the packet will traverse. It picks the router with the shortest path: the least number of traversed networks.

So the answer to your direct question is that BGP is "somewhere else": it's what routers use to communicate to each other "How will you route this IP packet?" and then make reasonable decisions about how they should send packets around the network.

To be clear - BGP runs on TCP port 179, so it sits in TCP segments, and those are inside IP packets.

Rather than considering them as a meaningful stack, I think it helps to know what each does.

IP is a protocol for taking a chunk of data, slapping some addressing information on it, and then having it be sent, like an electronic letter, from one computer to another by whatever route the network thinks is best. More precisely every computer sends it to a computer it is directly connected to that it thinks is closer. Eventually, hopefully, it gets to the right place.

If you just want to send chunks of data over IP and hope that they get there, you have UDP.

TCP is a more advanced protocol where one computer contacts another, and then a stream of data starts to flow between them through a connection. Under the hood the stream is broken into chunks that are put in IP packets. And there are extra packets for things like, "Hello, trying to connect here" "I got these packets" "I'm done" and so on. Obviously TCP sits on top of IP.

DNS is a protocol for turning a human readable name like news.ycombinator.com into an IP address like Under the hood DNS uses both UDP and TCP.

BGP is a protocol that is used between routers to advertise how to route packets. BGP uses TCP to work, so it is above TCP. But that routing information is used at the IP level, so bad routes can stop IP from working. Which is what happened here. Someone advertised that they were how to get to a lot of Google addresses, so routers began sending Google traffic there. When the packets arrived, they had no idea what to do with them and dropped them. The result is that the IP layer to Google stopped working for a lot of people.

Newbie question: If BGP uses TCP to work, TCP is above IP, and routers use BGP information to route IP packets, how may bootstrapping happen if needed?

Typically no bootstrapping because IP (and thus TCP) work between hosts in the same subnet without routing.

You would typically have a /30 or /31 subnet containing a pair of routers and have the routers communicate (BGP etc) using those addresses.

BGP runs on top of TCP, which runs on top of IP.

DNS runs on top of UDP (or sometimes TCP), which runs on top of IP.

Edited to elaborate: most computers on the internet don't need to know anything about BGP. It's not directly involved when you establish connections. Think of it as an automatic configuration system running on the various routers.

Saying BGP runs on top of IP is true but doesn't tell you what BGP does or how Global scale IP doesn't work without BGP.

BGP is the protocol Internet Routers (i.e. not your home router) use to figure out how to route IP addresses to particular routers.

So your home router connects to an internet router at comcast (or your ISP). The Comcast router announces to the rest of the world "Dear world, if you want to connect to any of the IP addresses at X.X.X.X sent those packets to me and I'll deal with them."

BGP is comprised of long lived TCP connections between routers. The IP of the other router is well known and hard coded in the config afaik but I'm not a network engineer.

In terms of understanding the Internet, the OSI model is less useful than the Internet model outlined in RFC 1122 and RFC 1123, because the OSI model persists in making distinctions without a difference, like the separation of the Application and Presentation layers. (Frankly, most of the time the Link layer is entirely determined by the Physical layer, so that distinction is also of marginal use in the real world.)

On a more historical note, the Internet protocol suite beat the OSI protocol suite. Practically nobody uses the OSI protocols anymore, so why bother trying to fit the Internet protocols into the OSI model?




OSI is what I learned in school an age-and-a-half ago, that's why I included that link also. :)

Does this mean that in the future we should ignore all routes comming from PCCW (since they rebroadcast all rules without filtering)?

That's a good way to effectively disconnect yourself from the Internet. A lot of ISPs are not properly filtering.

I don't know much about the processes behind the Internet, but I found this to be a fascinating introduction.

Outage in an ASN != Google went Offline. The title puts the blame on Google which isn't true.

The failure has nothing to do with BGP either.

Sidenote: In the comments I saw a reply about nanog.com being a great plce to meet other networking peeps.

http://www.nanog.com/ is currently showing a "Welcome to nginx" message

That's because the right address is http://www.nanog.org

Ta muchly.

This is a great write up - thanks for posting. I'm slowly beginning to understand how the internet works day by day due to posts like this.

"I'm a network engineer at CloudFlare and I played a small part in helping ensure Google came back online."

Uhh, no. Without the "ensure", then maybe.

Even without the "ensure" ... what did this network engineer at CloudFlare do anyway? it was a hardware failure.

He filed a bug report.

Very interesting explanation, thank you.

27 minutes, 3-5% traffic, it could means thousand of dollars lost for Google, right? (Does it sueable?)

Under which law are they going to sue? USA laws don't apply to Indians ISPs.

It was an Indonesian ISP, not Indian.

> We use Google Apps for things like email so when we can't reach their servers

Very professional way to do so!

Great explanation. Good show and great job! Very smart engineers at Cloudflare!

Haha, I was in HK today, and one of those hit, using PCCW services also

I wish I understood more of this but still really cool!

almost all bgp transit provider have prefix filtering,

DAE think that the whole "BGP is broken!" argument is a bit overblown?

If you're going to have a bunch of autonomous systems/networks operating together, with no central authority, it necessarily comes down to trust and relationships.

Shit will occasionally happen. It's important to look at outages, figure out the cause, and work to prevent it. Perhaps, though, this is a best practices issue, and not some fundamental flaw in BGP.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact