
Context aware MySQL pools via HAProxy - samlambert
http://githubengineering.com/context-aware-mysql-pools-via-haproxy/
======
sciurus
This is similar to how we direct traffic to MySQL at Eventbrite. We also have
multiple MySQL clusters with complex topologies for different services. Every
server that can access MySQL has a local HAProxy that it connects to. This
local HAProxy is checking the health of a pool of dedicated HAProxy servers.
Those HAProxy servers are hitting a script run by xinetd on the MySQL servers
to determine whether a server should be in its pool. Here is where we differ
the most from Github. Instead of having two backends per frontend, we only
have one, but the health check itself is context-aware. In addition to takings
its own lag into account, it also queries HAProxy to determine the number of
healthy servers in its backend pool. As the percentage of healthy servers goes
down, the threshold for declaring itself unhealthy based on lag goes up.
Additionally, there are some extra safeguards to make sure we never dip below
a certain number of servers in the pool. We use variations on this approach
for other services like Solr and Redis as well.

BTW, if managing this sort of infrastructure excites you, we're hiring!
[https://jobs.lever.co/eventbrite/cd7d6b95-0226-4e03-b9fa-c15...](https://jobs.lever.co/eventbrite/cd7d6b95-0226-4e03-b9fa-c157f7920b6e)

------
subway
So, I see more and more organizations using "chatops". Are they really
delegating control of their infrastructure out to Slack/HipChat? If not, how
does authentication/authorization work?

~~~
buro9
At CloudFlare we use a self-hosted, and accessible via corp VPN only, Hipchat.

However it's important to note that this is primarily information inbound, and
not instructions outbound.

Additionally, the disabling of a Hipchat account is near the top of the list
of the security incident response team.

If a laptop is lost, if a machine compromised, the ability for those accounts
to continue broadcasting information of any kind is immediately neutralised.

The risks are real, but one can mitigate them by:

1\. Securing access to the chat environment (self-hosting, VPN only, different
auth to this than to the rest of the infrastructure)

2\. Not having chatops send instructions to other systems

3\. Having well practised plans for disabling accounts in response to
incidents

4\. All of the above is no replacement for access control and auditing on all
other systems

I'm not going to speak on behalf of the security team, but the idea that the
chatops would send instructions to systems is pretty horrifying even if they
originated from a machine or person.

It's bad enough to imagine the scenario that if someone's chat account were
compromised and someone could gain access to the VPN to then use the chat,
that social engineering via chat is likely to be more successful than by other
mediums (people trust chat). Hence ensuring those accounts are disabled very
quickly.

~~~
sciurus
So you make chatops secure by not having chatops, it sounds like.

Being able to gather information in a chat room is helpful, but I don't think
its fair to call it chatops unless you have can act on that information
through chat, too.

To use examples from the blog post, both querying for the slaves in a pool and
marking one down for maintenance, all through chat.

------
chucky_z
What are the pros/cons of doing this in HAProxy natively?

Could you just use something like a health-check on a cluster system like
zookeeper/etcd/consul/keepalived/... to simply peek at the lag time, and then
mark a replica as unhealthy and regenerate your HAProxy config?

Just curious as I have very little experience with HAProxy. I do everything
with consul right now and it works very well (previously I used keepalived +
vrrp, or pacemaker/corosync but no multicast "in the cloud.").

------
bluedonuts
This is cool. I dont think it would be useful in my case though.

99% of the time the cause of lag I see is some heavy inserts on the master.
These hit the slaves at the same time so this wouldn't mitigate that issue.

It got me to wondering, what is the common cause of slave lag for other
people?

~~~
spotman
All kinds of reasons. A few quick examples:

\- say one of the slaves actually has some monitoring system break and now
runs out of disk space, and stops the ability to write locally, so the lag
will go up and/or health check will see its replication is failing, and it
will automatically be taken out of the pool.

\- say you run a site where a user has a list of songs they like. some users
have larger libraries than others, but there is a few outliers that have 100x
the amount of songs. your growing fast and you have some code paths that are
not optimized for this, and some features that folks barely use that you don't
spend time on. one of these users goes on your site and uses this feature and
now you have 1 slave that is lagging because the overly intense query landed
on it.

\- say you have a high traffic site, or you are in a datacenter where you are
sharing networking gear with a very high traffic neighbor, and the datacenter
has over provisioned the networking gear, so there is no headroom for
bandwidth. now you have slave lag if your replication is passing thru a switch
at saturation. maybe this is only present some times of the day, on some
slaves, so they will be taken out of the pool so servers with better network
paths are prioritized.

------
alexandre_m
You have to be careful to tune your xinetd script to increase the cps and
per_source limits accordingly to the load that your pool of servers generate
with this proxy forwarding.

Assuming that all servers that need to connect to MySQL have this local tcp
3306 proxy setup, each time you increase your pool of servers the load of
incoming check requests increase as well.

Xinetd is quite nasty and by default will stop responding to requests for like
10 seconds after some defaults. This will effectively "block" your whole
database connection flows, especially when you restart services at batch
during a rolling update, and under some other circumstances...

------
ec109685
It would be interesting to know GitHub's policies on persistent MySQL
connections. If apps hold long lived connections, even if a server was deemed
unhealthy, I don't think there would be anything to force the app to reconnect
and find a new slave?

------
vbernat
In addition, they could also have used agent checks to enable each MySQL
server to decide of its own weight. This way, more busy servers could receive
less traffic until the situation improves.

------
vhost-
I did something very similar on a 5 read cluster, but wrote custom NetScaler
health checks to pull laggy replicas out of the virtual server pool.

------
zzzeek
as a note this is pretty much the only way to run Galera (multi-master MySQL
variant) from HAProxy - you need to provide a custom HTTP service that checks
that the cluster node in question is ready to serve requests. Galera's flow
control feature likely does a lot that the tests here for "lag" accomplish.

~~~
alexandre_m
Indeed, that's common practice with Galera.

Percona xtradb cluster provides a clustercheck service script listening on
port 9200 (http).

------
peterwwillis
Er, so, are they using a separate database application to abstract the mysql
operations? They don't mention it in the article. Not abstracting your
database leads to major issues from unusual load and operations. And there's a
lot of performance, portability, release engineering, etc wins. I don't see
what other companies are doing so i'm not sure how common it really is.

~~~
sciurus
What do you mean by "a separate database application"?

~~~
peterwwillis
Instead of:

    
    
      web app
              -> haproxy
                         -> mysql
    

Doing this:

    
    
      web app
              -> backend app
                             -> haproxy
                                        -> mysql
    

It allows a dedicated application layer to handle database
communication/changes in a more intelligent way than a proxy/load balancer
can. Helps account for scaling issues, batches transactions, retries on
failure/inconsistency, throttles, caches, advanced logic, schema/application
change resiliency/consistency, network segregation, restricts operations, etc
etc. Nobody seems to implement it unless there's a wide range of applications
with different needs, or when things break.

