>This way, one single cluster shares the same big, flat space of IP addresses; not only do pods have the same IP inside the container, but they can talk to other pods using the other pod's IP, too.<
Why is having a big, flat namespace important? Routers route. Clos L3 networks are no longer a fancy thing. They're commonplace now. I don't see any advantage of having a flat network.
> One part of it is the iptables proxy magic that it does to allow services to have dynamically assigned IPs, too, with simple load-balancing between them.<
Ah yes, the iptables "magic". We call this, slowness and obfuscation. People who understand how to run networks don't like handwavy magic. We like simple, elegant concepts. Kubernetes networking is very far from simple and elegant. It's blackbox "magic".
>You don't need more complex overlay networking stacks such as Calico, Flannel or Weave right away.<
I run a native L3 network so have no need for an overlay network on top of it. That said, I'd argue that the overlay junk is probably easier for non-networking-fluent developers to setup and run compared to routing in AWS.
Kubernetes networking can be summed up thusly:
Great for developers who know nothing about networking but want to run at hyperscale. Terrible for people who actually know how to run networks properly.
Kubernetes ingress is garbage. Stop apologizing for it.
IPv6 would also get rid of 99% of these overly complicated hand-wavy solutions that Kubernetes proponents constantly tout as strong points. Give each node a /64, and you're set.
I was replying in the context of the newbie who was asking for assistance. Your reply is rather tone-deaf in that regard.
You're arguing against the value of a "big, flat namespace", yet you're also arguing for IPv6, which itself is a big, flat namespace? Do you see the contradiction, perhaps?
Dedicated CIDR for pods is important because it's simple. The symmetry is simple to explain, simple to understand; the same simplicity you'd get from IPv6.
Moreover, it's an abstraction that can be implemented however you want (custom routing on L3, SDN overlay, BGP). Not everone has a native L3 network. If you're on Google Cloud Platform, you get a virtual L3, but with other clouds, the networking is a bit more old hat. So again, simplicity and convenience. As for "overlay junk", the entirety of the Google Cloud itself is virtualized over what is probably the world's most sophisticated SDN overlay, so, well, some people's junk is other people's ragingly successful business, I suppose.
I'm not sure why you categorize the automatic iptables rules that Kubernetes set up as slow or obfuscated. It's only magical in the sense that Kubernetes automatically makes its cluster IPs load-balanced, a convenient system that you are in no way forced to use. If you have a better setup, feel free to use it instead.
We use Kubernetes ingress. It works. It could be better, but it's not "garbage". I really recommend against putting everything in such categorical terms. Everything in your comment is "junk" and "garbage", and the people who designed it (Google!) are morons who don't understand networking, somehow. That kind of arrogance on HN just makes you look foolish.
/24 per node is one option, but not the only option. But that gives you max 254 pods per node.
The simplest option is to just use routing [1]. You don't have to use an SDN. Not sure if DHCP is one of the officially supported options.
I know there are people out there who use MACvlan/IPvlan. Some people discourage these types of virtualized networks because the packet manipulation can be inefficient (unless the NIC explicitly supports it; I believe some support VXLAN?) and can hamper the kernel's scheduling.
With respect coordinating loads of route tables, when its a flat network is nothing short of ludicrous.
Firstly _statically_ assigning an address range to each node is utter madness, firstly it limits the containers you can have. Secondly its terribly inflexible, its perfectly possible to have a beefy server have more than 254 containers running.
Thirdly it ties up a huge address ranges with _no_ flexibility. If you have nodes assigned to certain duties (like DB pods) then it can only realistically have a few containers. So the rest of the address range is wasted.
What is so frustrating is that all of this is automatically taken care of using DHCP and macvlan.
In the example thats linked, why isn't there a second adaptor on a different VLAN? Thats a far more simple and visible way of linking things together. I just don't see why you'd want to willingly fiddle with routing table when on a normal flat network its done for you, automatically.
This is a config value; if you want more containers per node, use a /23 or a /22 instead. It's entirely up to the operator, there's nothing magical about the default choice of /24 (except for it being easier to perform arithmetic on).
> Thirdly it ties up a huge address ranges with _no_ flexibility.
If you're using 10/8, then you have 16 bits' worth of /24 subnets, so 65k nodes by default. It's true that there are some companies in the world that have to worry about this limit, but for almost everybody I don't think this is a real problem.
Indeed, but its something extra that _you_ have think about after you've setup your VPC (if youre on AWS) not only does it mean you can deploy/configure un routable IPs by accident, its using a mechanism that _slows down_ your VPC, and adds a minefield of confinguration errors. its just madness.
It's just a LAN, why would you ever statically assign IPs? especally at scale, especially if you have a dynamic ever changing workload. Deploy a pod, two network interfaces, macvlan & AWS does the rest. Put a cloudwatch alert for DHCP exhaustion, or put a resource limit in for each AZ.
Put it this way: Why do you want to have to think about subnets _after_ you've created your VPC? (unless you've reached a limit...)
>The ARP table might be bigger, but thats a different issue.
But this is the problem that most designs are trying to solve. Large L2s are notoriously fragile. 1,000 nodes, 50-100 pods/node is a lot of ARPs. And sometimes you want partitions between endpoints for security/isolation.
I agree with you about static assignment of addresses. But that's why (most) CNIs work with a controller of some kind for IPAM.
IMO, the problem complexity is hard to compress. You need to distribute/manage MAC addresses, routes, and/or state. Different designs would favor one over another.
Then you just move the routing problem to your gateway/router, and it'll end up exploding because of too many routes in the table (one per container), instead of only one per container host.
Why is having a big, flat namespace important? Routers route. Clos L3 networks are no longer a fancy thing. They're commonplace now. I don't see any advantage of having a flat network.
> One part of it is the iptables proxy magic that it does to allow services to have dynamically assigned IPs, too, with simple load-balancing between them.<
Ah yes, the iptables "magic". We call this, slowness and obfuscation. People who understand how to run networks don't like handwavy magic. We like simple, elegant concepts. Kubernetes networking is very far from simple and elegant. It's blackbox "magic".
>You don't need more complex overlay networking stacks such as Calico, Flannel or Weave right away.<
I run a native L3 network so have no need for an overlay network on top of it. That said, I'd argue that the overlay junk is probably easier for non-networking-fluent developers to setup and run compared to routing in AWS.
Kubernetes networking can be summed up thusly: Great for developers who know nothing about networking but want to run at hyperscale. Terrible for people who actually know how to run networks properly.
Kubernetes ingress is garbage. Stop apologizing for it.
IPv6 would also get rid of 99% of these overly complicated hand-wavy solutions that Kubernetes proponents constantly tout as strong points. Give each node a /64, and you're set.