BTW, if managing this sort of infrastructure excites you, we're hiring!
However it's important to note that this is primarily information inbound, and not instructions outbound.
Additionally, the disabling of a Hipchat account is near the top of the list of the security incident response team.
If a laptop is lost, if a machine compromised, the ability for those accounts to continue broadcasting information of any kind is immediately neutralised.
The risks are real, but one can mitigate them by:
1. Securing access to the chat environment (self-hosting, VPN only, different auth to this than to the rest of the infrastructure)
2. Not having chatops send instructions to other systems
3. Having well practised plans for disabling accounts in response to incidents
4. All of the above is no replacement for access control and auditing on all other systems
I'm not going to speak on behalf of the security team, but the idea that the chatops would send instructions to systems is pretty horrifying even if they originated from a machine or person.
It's bad enough to imagine the scenario that if someone's chat account were compromised and someone could gain access to the VPN to then use the chat, that social engineering via chat is likely to be more successful than by other mediums (people trust chat). Hence ensuring those accounts are disabled very quickly.
Being able to gather information in a chat room is helpful, but I don't think its fair to call it chatops unless you have can act on that information through chat, too.
To use examples from the blog post, both querying for the slaves in a pool and marking one down for maintenance, all through chat.
Maybe there are more lessons devops could learn from botnets. Why mess about with ansible and the like when you could just infect all your machines with a worm? :)
It seems like you either have to delegate authN to your chat service, with authZ handled by hubot-auth, or find a way to get your chat client to pass a token along with every chatops request.
Could you just use something like a health-check on a cluster system like zookeeper/etcd/consul/keepalived/... to simply peek at the lag time, and then mark a replica as unhealthy and regenerate your HAProxy config?
Just curious as I have very little experience with HAProxy. I do everything with consul right now and it works very well (previously I used keepalived + vrrp, or pacemaker/corosync but no multicast "in the cloud.").
99% of the time the cause of lag I see is some heavy inserts on the master. These hit the slaves at the same time so this wouldn't mitigate that issue.
It got me to wondering, what is the common cause of slave lag for other people?
- say one of the slaves actually has some monitoring system break and now runs out of disk space, and stops the ability to write locally, so the lag will go up and/or health check will see its replication is failing, and it will automatically be taken out of the pool.
- say you run a site where a user has a list of songs they like. some users have larger libraries than others, but there is a few outliers that have 100x the amount of songs. your growing fast and you have some code paths that are not optimized for this, and some features that folks barely use that you don't spend time on. one of these users goes on your site and uses this feature and now you have 1 slave that is lagging because the overly intense query landed on it.
- say you have a high traffic site, or you are in a datacenter where you are sharing networking gear with a very high traffic neighbor, and the datacenter has over provisioned the networking gear, so there is no headroom for bandwidth. now you have slave lag if your replication is passing thru a switch at saturation. maybe this is only present some times of the day, on some slaves, so they will be taken out of the pool so servers with better network paths are prioritized.
Assuming that all servers that need to connect to MySQL have this local tcp 3306 proxy setup, each time you increase your pool of servers the load of incoming check requests increase as well.
Xinetd is quite nasty and by default will stop responding to requests for like 10 seconds after some defaults. This will effectively "block" your whole database connection flows, especially when you restart services at batch during a rolling update, and under some other circumstances...
Percona xtradb cluster provides a clustercheck service script listening on port 9200 (http).
-> backend app