In addition, it seems like the quorum configuration shown here is pretty poor - only uses two nodes. This is a recipe for split-brain.
Which will be internally backed by exactly the kind of infra the OP is describing.
* MetalLB doesn't work in most cloud environments because the networking is weird.
* MetalLB is extremely new while VIP+Nginx/HAP has been working forever and is the standard choice for implementing a HA LB.
* MetalLB plays the same exact role as what the OP is laying just in a different style that has nothing to do with k8s. MetalLB could have been unicast VRRP for all it matters. The value is the code to instrument the LB from within k8s which can work with any external LB.
In k8s load balancers are inherently external to the cluster and need to maintain their state. It doesn't really matter how this is accomplished but Pacemaker/Corosync is the "off the shelf and supported by your distro with good docs" option.
"Why don't you just use $managed_k8s?" -- Because it defeats the purpose of learning how to administer it.
Why is everybody treating k8s like it's something special and magical here and needs it's own solution? Your k8s cluster is a group of app servers with a little bit of networking sprinkle so that you can reach every app from any node and some API sprinkle to allow the app servers to instrument external resources.
The k8s cluster needs something outside itself to route traffic to all the nodes. There isn't a way around this. From an infra perspective a k8s cluster is a black box of generic app servers that want to be able to control the LB with an API integration. Nothing else.
Sure, but most cloud environments are handled by other LoadBalancer controllers specific to that environment. See: Kubernetes Cloud Controller Manager.
> MetalLB is extremely new while VIP+Nginx/HAP has been working forever and is the standard choice for implementing a HA LB.
I do not disagree with that. Hell, I've filed/fixed metallb issues myself already.
> MetalLB plays the same exact role as what the OP is laying just in a different style that has nothing to do with k8s. MetalLB could have been unicast VRRP for all it matters. The value is the code to instrument the LB from within k8s which can work with any external LB.
Sure, but the author is not describing it to act as a Kube LoadBalancer, but as a separate entity with its own configuration. So to actually then use Kubernetes meaningfully they still need to bring something that actually fulfills LoadBalancer requests. Might as well use the same code for both cases.
> In k8s load balancers are inherently external to the cluster and need to maintain their state. It doesn't really matter how this is accomplished but Pacemaker/Corosync is the "off the shelf and supported by your distro with good docs" option.
Why? You can easily keep all state and components within k8s, as MetalLB does.
> The k8s cluster needs something outside itself to route traffic to all the nodes. There isn't a way around this.
Sure, that's called BGP to a ToR that can then ECMP-route traffic where needed, and that's what metallb gives you.
I guess I should have said "its own state." I mean you can shove just about any software into a container and get k8s to run it but the point is that k8s can't load balance by itself without managing an environment specific external resource. Because I would count BGP as using the router in the same way that it would use HAProxy.
Would "load balancers are necessarily on the outside of the cluster" be better phrasing?
Are you sure? All encountered k8s installations (including ones rolled from scratch by me) use Ingress controllers as the option for getting traffic into the cluster. Community NGINX Ingress controller is the de facto standard. One need to use LoadBalancer service type because of its managed origin (metallb is a different beast and I wouldn't recommend it, if you need to load balance traffic to your ingresses on on-premises infra, it's better to do it outside of k8s). Anyway, you lose all flexibility and observability of Ingress solutions with LoadBalancer service types if you use them directly as traffic routers to your backend.
Without being able to create LoadBalancer services there's no easy way to get any traffic into your cluster other than using NodePorts, and these have tons of shortcomings.
BTW, community NGINX Ingress controller is able to ingress L3 (TCP) traffic into the cluster.
Sure, but this means that you cannot use LoadBalancers, which is painful. It means every payload has to be configured both at k8s level and then externally. That somewhat defeats the use of k8s as a self-service platform internally in an organization (other dev/ops teams need to go through a centralized ops channel to get traffic ingressing into the cluster if for some reason they can't use an Ingress).
> BTW, community NGINX Ingress controller is able to ingress L3 (TCP) traffic into the cluster.
Yes, but it's configurable via a single ConfigMap (which limits self-service if you're running a multi-tenant orga-wide cluster, unless you bring your own automation), and still you only have one namespace of ports, ie. a single external address for all ports - unless you complicate your 'external' LB system even further.
With all these caveats, I really don't understand why not just run metallb.
Do you know whether ingress controllers like nginx-ingress-controller honor the readiness status of pods from a service and so only sends traffic to ready pods from the concerned deployment?
I was looking into this because if I have a pod with 3 containers, if 1 container wasn't running, nginx-ingress stopped serving traffic. I actually want to continue serving traffic if a certain container is still running, not all 3 running necessarily
Do you know or have any documentation on how nginx-ingress actually does its readiness/liveness checks?
If any of the containers in a Pod aren't ready, the endpoints controller removes the Endpoint object corresponding to that Pod. ingress-nginx watches these Endpoints objects to determine which Pods it should send traffic to.
Edit: As to your use case, I think you should remove the readiness probe from that one container you don't care about.
Also by using the GKE load balancer for ingress, you lose out on a lot of things, like password protection, or certain rules you want with nginx.
I've tried my best to stick with GKE LoadBalancer, but it's just an awful experience. Now I have GKE to load balance traffic to nginx-ingress, the level 3 load balancer is just not flexible enough and in general annoying to configure.
And yes, I agree that both are extremely slow to reconfigure and the that Ingress controller is inflexible.
I'm not saying that you need to always use a LoadBalancer for all payloads. But you do need one to ingress traffic in a sensible manner if you're running your own N-I-C (which you have to do on bare metal, and which, as you said, you end up using even on GKE).
I'm tired of k8s as a hammer tool where every project looks like a nail.
It's scary to imagine how many servers are idling with zero traffic load doing nothing. Even in-house systems without any scaling ambition deployed to k8s.
Of course, you have to reserve 1 cpu + 1 gb for k8s needs on each server (you have 3 minimum) sitting here reserved, doing nothing. You can't use it for rare spikes of load (compiling, some rare data upgrades) and you have to buy additional resource to cover you needs while default resource requirements just idling because zero traffic of the system.
I'm sure it wasting more resources than all crypto-fads combined.
I'm happy to see more companies realise that and using just bare servers or docker swarm / nomad if they want to dive into DevOps and CI/CD practices.
K8s is a great tool for particular problems (you have at least tens/hundreds of servers, DevOps team more than 2 employees, you are getting at least 1k RPS at low traffic part for the day, have several SRE and so on).
Most probably your system is overengineered and resume driven if you need static k8s cluster and don't have any consumer facing interfaces (i.e. general low traffic usage pattern).
If you have different opinion I'll be happy to hear your points.
Not that I disagree with your post, I really dislike Kube and would like to see something else take its place, but there are situations where I could see using it just to use the same tools and workflows across the entire org.
The problem of hype is that it's usually to evangelize marginal projects while better ones march along in obscurity.
And it's also applies to OpenStack which is more than alive. While its ecosystem consists of horde of different solutions, plain nova/neutron/cinder are matured and efficient solutions to roll your own private IaaS.
It's also a lot easier to debug and see what's happening without that daemon sitting in the middle of all the traditional linux tools.
just look at the CVEs from recent years:
* docker doomsday
* escaping like a rkt
* cryptojacking? - that didn't even exist until containers were here!
The official documentation itself addresses this.
They are stuck in the same mindset and just think Kubernetes is a automation tool.
I see it all day long as I interview people.
Because haproxy and nginx are proven technologies with limited and very well known failure modes, which means there's exactly 5-6 well documented, well understood and very well known ways haproxy and nginx can fail.
Experienced ops people understand that one does not optimize for the blue skies -- during "everything works wonderfully" all non completely broken technologies perform at approximately the same level. Rather these people optimize for the quick recovery from the "it is not working" state.
* If one is doing it in a cloud and wants to avoid potential issues with the clients behind broken DNS resolvers, then one simply nails the entry point instances to specific IP addresses and in event of an entry point failure, assigns an IP address of a failed instance to a standby instance, resulting in a nearly immediate traffic swap.
* If one is running entry points on physical hardware, then the solution is to bind the entry points to a virtual IP address floating between the instances using VRRP.
* Finally, if one wants to be really super-clever and not drop sessions using controlled fail-over, one does does VRRP + service mac addresses similar to Fastly's faild or Google's MagLev. (But really, this is addressing 99.99999% reliability when in most of business cases 99.9% would do just fine and 99.99% would be amazing ).
Using BGP on a L2 vlan for handling a single digit number of IP addresses allocated to the entry points is akin to using K8S to host a static HTML page.