

GitHub: Recent Load Balancer Problems (RCA) - wfarr
https://github.com/blog/949-recent-load-balancer-problems

======
vimalg2
Reminds me of the last Github post I'd read about their load-balancer setup
from their server guy(2009): [http://www.anchor.com.au/blog/2009/10/load-
balancing-at-gith...](http://www.anchor.com.au/blog/2009/10/load-balancing-at-
github-why-ldirectord/)

Willy Tarreau (author of HAproxy) had sparked a nice discussion in the
comments section, that time.

------
xtacy
The post mentions that heartbeats timeout when the load spikes momentarily. I
have a few questions, would love to hear answers if it's okay to share :-)

    
    
        1. What load spiked?  Is it the network/CPU load?
    
        2. By spiked (be it network or CPU), do you mean
           the load went all the way to 100%?   Or was it
           some threshold like say 90% of the available
           capacity?
    
        3. What's the heartbeat time interval?
    

Thanks, (EDIT: spacing)

~~~
jnewland
1\. IO and CPU load spiked so much that the system was basically unresponsive
over SSH. We think it was due another Xen VM swapping out of control.

2\. Was 10 seconds with a 10 second timeout (way to low to run `xm list` in a
loaded situation). It's now 90 seconds with a 90 second timeout.

------
ww520
They actually have a pretty good HA setup.

One thing to remember is that a HA cluster is for handling node failure (power
loss, faulty hardware, faulty software, etc). It is not for handling capacity
related failure. If the servers are overloaded with too many requests, they
will fail regardless of the HA setup. Capacity monitoring and capacity
planning are still needed to maintain uptime.

------
seiji
I rarely see an install of Heartbeat/Pacemaker/CRM preventing more downtime
than they cause. If you add in DRBD on top, you get an entire suite of false-
HA infrastructure.

~~~
stock_toaster
Just curious, but what have you seen working well?

At a previous gig we used heartbeat with haproxy, and it worked pretty well.
We would drop connections on cutover, but it was considered 'acceptable' for
our purposes at the time. I wanted to try whackamole with haproxy, but we
never got around to it.

~~~
seiji
The only IP failover I trust is carp
(<http://www.openbsd.org/faq/pf/carp.html>) on OpenBSD/FreeBSD. Once set up
properly with syncing, you lose no state on a failover (all connection and NAT
state is gossiped between cluster nodes sharing an IP address).

The only downside is most services aren't well tested under OpenBSD/FreeBSD
these days so you may end up hitting a few edge cases in software designed and
tested only under Linux.

~~~
ww520
For most apps and servers, migrating the IP state is not enough. The app
server's connection state cannot be easily migrated. E.g. The MySQL connection
of a failed MySQL server will be gone. Migrating the IP connection to a new
MySQL server won't do anything.

Carp is good for firewall server redundancy because IP state is all that
firewall is maintaining.

