Hacker News new | past | comments | ask | show | jobs | submit login

  Ok, so then I think your question is, if my DNS request for whatsapp.net makes it to the seattle1 PoP, will it only respond with a seattle-1 IP? That's one way to do it, but it's not necessarily the best way. Since my DNS requests could make it to any PoP, sending back an answer that points at that PoP may not be the best place to send me.

  Ideally, you want to send back an answer that is network local to the requester and also not a PoP that is overloaded. Every fancy DNS server does it a little different, but more or less you're integrating a bunch of information that links resolver IP to network location as well as capacity information and doing the best you can. Sometimes that would be sending users to anycast which should end up network local (but doesn't always), sometimes it's sending them to a specific pop you think is local, sometimes it's sending them to another pop because the usual best pop has some issue (overloaded on CPU, network congestion to the datacenters, network congestion on peering/transit, utility power issue, incoming weather event, fiber cut or upcoming fiber maintenance, etc).
Right I was hoping the DNSs of FB ought to be smarter than usual and let's say when DNS at Seattle-1 cannot reach backbone it'd respond with IP of perhaps NYC/SF before it starts the BGP withdrawal.

Thanks for the write up and I enjoy it.




> Right I was hoping the DNSs of FB ought to be smarter than usual and let's say when DNS at Seattle-1 cannot reach backbone it'd respond with IP of perhaps NYC/SF before it starts the BGP withdrawal.

The problem there is coordination. The PoPs don't generally communicate amongst themselves (and may not have been able to after the FB backbone was broken, although technically, they could have through transit connectivity, it may not be configured to work that way), so when a PoP loses its connection to the FB datacenters, it also loses its source of what PoPs are available and healthy. I think this is likely a classic distributed systems problem; the desired behavior when an individual node becomes unhealthy is different than when all nodes become unhealthy, but the nature of distributed systems is that a node can't tell if its the only unhealthy node or all nodes became unhealthy together. Each individual PoP did the right thing by dropping out of the anycast, but because they all did it, it was the wrong thing.


You are to the point and precise. This is exactly the problem.

  Each individual PoP did the right thing by dropping out of the anycast, but because they all did it, it was the wrong thing.
Somehow I feel the design is flawed because if abuses DNS server status a bit. I mean DNS server down and BGP withdrawal for the DNS server is a perfect combination, however connectivity between DNS and backend server down, DNS up and BGP withdrawal for DNS server is not. DNS did not fail and DNS should just fall back to some other operational DNS perhaps a regional/global default one.


I think this is not necessarily a flaw of the design. It's a fundamental weakness of the real world.

Either you can take the backbone being unavailable as a symbol that the PoP is broken, and kill the PoP; or you can take the backbone being unavailable as a symbol that the backbone is broken and do your best.

When either interpretation is wrong, you'll need humans to come around and intervene. It's much more common that the only the PoP is broken, so having that require intervention results in more effort.

The flaw here is more that the intervention required to get the backbone back was hard to do because internal tools to bring back the backbone relied on DNS which relied on the backbone being up. As well, there were some reports that physical security relied on the backbone being up. And that restoring the backbone needed physical access.

This isn't the first largescale FB outage where the root cause was a bad configuration was pushed globally quickly. It's really something they need to learn not to do. But, even without that, being able to get key things running again, like the backbone, and DNS, and the configuration system(s), and centralized authentication, needs to be doable without those key systems running. I suspect at least some of that will be improved on, and hopefully regularly practiced.


I'm not going to claim to be a BGP expert, but as I understand it the way BGP propagation tends to work makes it a pretty global thing just in terms of how the router hardware handles stuff, which makes it unusually tricky to avoid.

I don't disagree about the general problem mind, I just have a feeling that fixing "don't push configs globally" for BGP specifically is unusually complicated.


  internal tools to bring back the backbone relied on DNS which relied on the backbone being up
So are you referring to same DNS servers sitting outside the backbone at the various PoPs? I'd imagine some internal DNS servers which stays in the backbone at use here, unless of course the FB engineers themselves were disconnected from those internal DNS servers.


I don't recall how internal DNS was setup (and determining from the outside isn't really possible), but there were comments in the incident report that DNS being unavailable made it harder to recover.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: