However I do wonder why they choose such a complicated thing instead of looking into something like BGP that would've replaced their ARP setup.
Also they should've probably tried to look into K8s since inside a cluster such problems do not exist.
(There you have the problem to have a good LB at the outside of your cluster)
Just a guess, but anecdotally it seems to me that inside a modern mid-size dev org there are a lot more engineers familiar with K/V stores like etcd/consul/ZK than IP routing protocols like OSPF and BGP.
It's not necessarily the wrong decision to forgo an existing well-tested tool in favour of a hand-rolled solution, if you don't have any experts on that existing tool in your org.
It will, however, lead to other problems, especially:
- reinvention of the wheel (aka: time waste)
- starting from scratch: technologies like OSPF, BGP, or (when we're at higher levels) my/pgSQL vs noSQL are sometimes decades old and battle tested. Which means that there are a lot more edge cases that have been discovered and fixed/documented, as well as best practices that have evolved. Same goes for security. Or even fundamental stuff like ACID criteria (which many of the noSQL friends have had to re-discover and learn)...
There's a reason yuge enterprises and government offices still run on mainframes (or, mainframe software on modern hardware): the stability and resilience of these beasts with uptimes in year-scales is something the hipster crap crowd is enviously looking at. And there's a reason that redesign projects in these areas fail so spectacularly: there are decades worth of domain knowledge and experience encoded in these systems. It's impossible to recreate this in a year or two, especially when outsourcing the work to body shops.
Only thing I wonder is: it should be, by now, common knowledge (or: state of the art) that outsourcing yuge projects to body shops, especially foreign ones, leads to disasters. So why are managers and politicians not prosecuted or otherwise held responsible when screw-ups originate? After all, when one e.g. designs a building that violates fire code and people die as a result, or one grossly ignores security best practices, the responsible persons can get charged, too...
> run on mainframes (or, mainframe software on modern hardware): the stability and resilience of these beasts with uptimes in year-scales is something the hipster crap crowd is enviously looking at
Why so much ignorance? Distributed systems are not hipster things, it's an actual science that helps to understand why no amount of battle testing can make this old tech resilient, it's fundamentally broken. There is just no resilience without at least proper distributed algorithms.
You don't need application level "discovery" when the IP never changes; the network will tell you where it currently lives.
I'll take a layer 3 network over raft/consul any day and all night.
Where could I go to read up on this without being a networking expert? Are there any safe ways to play with BGP in a lab environment? How expensive would said lab be to set up?
How expensive would said lab be to set up?
it heavily depends on what you want.
for a high available lab you would of course need multiple nodes/routers, depends on what you want to do.
for just playing with bgp, actually using bird http://bird.network.cz/ is good enough.
Actually ripe sometimes publishes articles with BGP.
Here is one that sets up a ANYCAST routing with BGP https://labs.ripe.net/Members/samir_jafferali/build-your-own...
BIRD is super easy to setup, too.
Especially when there are several-year old bugs open against consul, where it just completely loses consensus and can't deal with losing a single node(0)
I don't have skin in the game but it makes sense that a much older, tested technology would have more people who know it well.