
Network problems last Friday - silenteh
https://github.com/blog/1346-network-problems-last-friday
======
xtacy
Nice writeup, but it leaves me curious about the root cause:

For some reason, our switches were unable to learn a significant percentage of
our MAC addresses and this aggregate traffic was enough to saturate all of the
links between the access and aggregation switches, causing the poor
performance we saw throughout the day.

Did you work with your vendor to understand what caused the above problem? Was
it a lack of number of entries in the MAC table?

This problem aside, I am wondering why you still run layer 2 network in a
tree-like configuration. These are known not to scale well, beyond a small
LAN. An appropriate layer 3 network (with multipath routing) would ensure
there is no such flooding, and ensure you use all the precious capacity in
your switches!

~~~
imbriaco
Yeah, if you read a little further down we worked with the vendor to get them
extensive diagnostics and get to the root problem. TL;DR: Lock contention on
the CAM table.

"We have worked with our network vendor to provide diagnostic information
which led them to discover the root cause for the MAC learning issues. We
expect a final fix for this issue within the next week or so and will be
deploying a software update to our switches at that time. In the mean time we
are closely monitoring our aggregation to access layer capacity and have a
workaround process if the problem comes up again."

In terms of network scalability, a properly designed layer 2 topology can
scale quite a bit farther than our current needs. This is very much the least
disruptive change we could introduce to solve the immediate network problems
while we work on the future architecture.

~~~
xtacy
Ooh, lock contention on the _hardware_ CAM table or the software in the switch
that handles these updates? The first time I am hearing about this bug :-)

Thanks a lot for sharing this bug story!

~~~
imbriaco
It was a lock that was being held by software but also prevented hardware
updates to certain hash locations in the CAM. So MAC addresses that hashed to
those locations couldn't be learned by any means.

~~~
brooksbp
I wonder what hash was in use. Also, what workaround were you using to prevent
flooding while this issue was happening?

------
akoumjian
"We experienced 18 minutes of complete unavailability along with sporadic
bursts of slow responses and intermittent errors for the entire day."

Well, I can say we experienced worse than that. Our private repositories were
unavailable starting at 9am until 4pm PST.

~~~
imbriaco
Thanks for reminding me, my apologies. I was focused entirely on the network
components since that's the part of the problem I personally responded to and
overlooked it. I've updated the blog post:

Note: I initially forgot to mention that we had a single fileserver pair
offline for a large part of the day affecting a small percentage of
repositories. This was a side effect of the network problems and their impact
on the high-availability clustering between the fileserver nodes. My apologies
for missing this on the initial writeup.

~~~
akoumjian
Thank you for being responsive and honest.

------
zyztem
Welcome to the club of STP meltdown survivors!

Unfortunately, large L2 Ethernet networks is not scalable and prone to
episodic catastrophic failure. You can read more here:
[http://blog.ioshints.info/2012/05/transparent-bridging-
aka-l...](http://blog.ioshints.info/2012/05/transparent-bridging-
aka-l2-switching.html)

One way to make L2 network somewhat stable is to replace many little switches
with one big modular chassis for hundreds ports, like Cat6500/BlackDiamond.

Or, to minimize L2 segments and connect between them in L3 (IP routing).

~~~
nixgeek
Very little of this outage can be attributed to STP issues, and most of the
outage seems to be down to a software fault with the switch itself not
learning MAC addresses correctly. I'm not sure how having one big chassis
switch helps here, since I've experienced many an IOS bug and if anything,
putting all your eggs in one big/modular basket just means when the basket
breaks all your eggs get smashed.

~~~
drcross
The problem sounds like it was a bug with vender interoperability based on the
unidirectional link detection that most people run on fiber based uplinks.
Personally I would have went with a homogenous environment with staged
deployment. While the server guys seem smart, having a professional network
design team commissioned to do the work should have prevented this by labbing
it up properly in the first instance. That said, I completely understand
outages because projects like these are all a game of calculated risk
management.

~~~
imbriaco
Our goal with this change was not to radically redesign our network in one
bite. We are making incremental improvements in our existing environment to
solve very specific, ongoing problems.

Given the flexibility to completely rearchitect our network you might see
different decisions. Stay tuned. :)

------
ajtaylor
This is a great writeup! I love reading these types of postmortems because I
always end up learning something new.

------
ChuckMcM
Wow, I totally empathize with that pain. Let me guess, you've got switches
from Blade Networks (aka IBM) ? :-)

Large scale networking changes like this are so challenging to pull off, one
missed cable, one mis-configured switch, and blam! everything is nuts.

~~~
wmf
_Large scale networking changes like this are so challenging to pull off, one
missed cable, one mis-configured switch, and blam! everything is nuts._

Only because industry best practices are incredibly fragile. A little DRY
would go a long way towards eliminating misconfigurations.

~~~
ChuckMcM
Perhaps, but I wonder if you are dismissing the challenge of the physicality
of it too quickly. Few processes prepare you for plugging in a cable where 2
of the four pairs are only marginally connected. Or an SFP+ cable where its
connected enough to see the lasers but is generating a lot of noise that the
buffers are trying valiantly to compensate for.

~~~
wmf
Sure, hardware faults and firmware bugs are always going to exist and will be
hard to resolve. But the problem is currently so bad that solving, say, half
of it would still be a huge improvement.

------
jauer
> Last week the new aggregation switches finally arrived and were installed in
> our datacenter.

It sounds like you rushed these switches into production, maybe with
insufficient testing.

There are all kinds of bugs and weird interactions in network hardware and
software that cause problems that you can't anticipate.

You have got to lab it up and do sanity checks before deploying (referring
specifically to the lacp/port chan problems).

~~~
Diederich
We always, always, always evacuate the datacenter before we do any non-trivial
network changes. And then trickle traffic back in.

Lab testing is fine, and is a must, but it may or may not have caught this.

Datacenter agility is the only way to scale and have high availability.

~~~
nixgeek
But you also work for Facebook, and consequently have slightly larger scale
problems than most get to worry about.

True multi-datacenter availability in an active-active manner is actually
fairly difficult to achieve.

------
raides
The excuse that the application grew faster than scalable is amateur hour.
This entire article makes any true sys engineer cringe in their stomach.

You are ~100 million dollar company and it seems like you drew your systems
architecture with crayons. The article is upsetting. The lack of segmentation
is embarrassing.

"oh it's the switch's fault, it doesn't learn MACs fast enough" - actually you
could subnet your racks and use f*n vlans. You might use public Ips on
everything but this could still be educational for the company.

Your solution to all of this was to spend twice as much on a "staging"
network. Something doesn't seem right here.

It makes me cringe when I see any one sentence that has the following three
words in it: escalate, network, vendor.

This isn't a boeing airplane, you cannot just rely on the vendor. This article
just gives me a good sense of job security in the field of sys engineering. I
really think that they should sit down and really go over their network. A
bridge loop like this for a company this large is pretty amateur. Github you
can do so much better.

~~~
nixgeek
Since staging environments are often scaled down but functionally equivalent,
it does not track that having one involves spending twice as much.

It is also nothing to do with the switch learning 'fast enough' and everything
to do with the switch having a fault which entirely prevented it learning
certain MAC addresses. Would you care to clarify how your solution (above)
would help in this situation, and how you fix switch firmware issues without
escalating to your vendor?

------
dkhenry
[shameless plug]

Hey github, sounds like you need SevOne. You could have diagnosed this issue
with one TopN report and been done with it.

[/shameless plug]

edit: See following thread for a full explanation .

~~~
imbriaco
Shameless is exactly right. Any inclination I may have had to even consider
looking at your product just evaporated.

~~~
dkhenry
I am curious as to why that would be ? I think that the product that I helped
develop would provide immediate value for someone in this situation, and I
think it would be unethical to suggest it without at least a cursory
acknowledgement of my bias.

Or does the fact that I work at a company solving these problems preclude me
from participating in the conversation ?

~~~
mfringel
At a guess, it's because you don't describe how it would help.

"Just do a TopN report" is not a diagnosis, analysis, or furtherance of the
discussion. If you want to do an effective shameless plug, write a few
sentences about how your product would have helped from the beginning of the
incident, through its resolution. Show how it would have saved time/effort and
given more visibility into the problem.

This was a multi-hour incident with code-level escalation to the network
hardware vendor; assume complexity.

~~~
dkhenry
So I see. I think I didn't take the time to define my terms, or give the full
sales pitch around them. I am not going to give the sales pitch, but TopN is a
style of report that can query in real time every metric across every
performance metric available and find not only find and alert on ones outside
of normal operating parameters, but predict which ones will go haywire in a
given time frame. This means no going around to all your switches to figure
out which ones are causing problems.

In this specific instance if you had an indicator on an OID
.1.3.6.1.2.1.17.4.3.1.1 you could see how big the MAC table was and have
noticed when it dropped below the expected value. This would have alerted
immediately letting you know not only that you had a problem, but on which
switches there were issues. Alternatively you could have set alerts on any
number of packet level indicators.

------
dkhenry
This is the kind of situation that I think screams for OpenFlow[1]. It seems
issues like this would be easier to avoid and faster to troubleshoot.

1\. <http://www.openflow.org/>

~~~
ChuckMcM
Can you say more about this? I'm quite familiar with OpenFlow but I've not
seen anything that would help with problems like "I need to increase the cross
sectional bandwidth of this collection of 1000 machines by 50%, and keep those
machines running and serving data."

~~~
wmf
I am building an OpenFlow-based fabric where the switches have no
configuration; all decisions are made by the controller and pushed down to the
switches as soft state. STP is not used, so no ports are ever blocked and all
forwarding is shortest-path. The goal is to be fully dynamic so that adding
links or switches triggers hitless path recomputation as necessary.

This technically doesn't come out until Tuesday, but here's a peek:
[http://conferences.sigcomm.org/co-
next/2012/eproceedings/con...](http://conferences.sigcomm.org/co-
next/2012/eproceedings/conext/p49.pdf)

~~~
ChuckMcM
Ohh, that is very nice, thanks for the link. We happen to have G8264's at the
center of our network ...

------
cagenut
I've got a pet theory here that this is going to be a trend over the next few
years. A lot of companies github's age were built on the "we misinterpreted
devops as noops" attitude, which works great for a few years, but somewhere in
the year 3 - 5 range the entropy and technical debt compound faster than a non
existent or small/inexperienced ops team can keep up with.

~~~
holman
Ops has been at the core of what GitHub does pretty much since day one. I'm
not sure there's a team more celebrated within GitHub as our ops guys, to be
honest.

------
ctime
This is Cisco Nexus gear with Bridge Assurance enabled, probably 5K to 7K
uplinks, IMHO

~~~
nixgeek
What odds are you giving on that?

    
    
      https://twitter.com/markimbriaco/status/276438853257162752

------
brooksbp
Were you running LACP on the LAGs?

~~~
imbriaco
Yes.

------
hcarvalhoalves
GitHub is unresponsive as today, again.

~~~
imbriaco
Have you contacted support? We are not having any known problems today and our
traffic patterns look normal.

