>This way, one single cluster shares the same big, flat space of IP addresses; n...

atombender · on Dec 5, 2017

I was replying in the context of the newbie who was asking for assistance. Your reply is rather tone-deaf in that regard.

You're arguing against the value of a "big, flat namespace", yet you're also arguing for IPv6, which itself is a big, flat namespace? Do you see the contradiction, perhaps?

Dedicated CIDR for pods is important because it's simple. The symmetry is simple to explain, simple to understand; the same simplicity you'd get from IPv6.

Moreover, it's an abstraction that can be implemented however you want (custom routing on L3, SDN overlay, BGP). Not everone has a native L3 network. If you're on Google Cloud Platform, you get a virtual L3, but with other clouds, the networking is a bit more old hat. So again, simplicity and convenience. As for "overlay junk", the entirety of the Google Cloud itself is virtualized over what is probably the world's most sophisticated SDN overlay, so, well, some people's junk is other people's ragingly successful business, I suppose.

I'm not sure why you categorize the automatic iptables rules that Kubernetes set up as slow or obfuscated. It's only magical in the sense that Kubernetes automatically makes its cluster IPs load-balanced, a convenient system that you are in no way forced to use. If you have a better setup, feel free to use it instead.

We use Kubernetes ingress. It works. It could be better, but it's not "garbage". I really recommend against putting everything in such categorical terms. Everything in your comment is "junk" and "garbage", and the people who designed it (Google!) are morons who don't understand networking, somehow. That kind of arrogance on HN just makes you look foolish.

KaiserPro · on Dec 5, 2017

I'm struggling to understand why you'd want to manually assign a /24 to each node? that seems very 1990s

Can't each container be bound to a virtual network interface(macvlan) and use DHCP? That allows the network to configure and manage the address pool.

No fiddling with routing tables (well not for each node) and it allows peering of VPCs simply

atombender · on Dec 5, 2017

/24 per node is one option, but not the only option. But that gives you max 254 pods per node.

The simplest option is to just use routing [1]. You don't have to use an SDN. Not sure if DHCP is one of the officially supported options.

I know there are people out there who use MACvlan/IPvlan. Some people discourage these types of virtualized networks because the packet manipulation can be inefficient (unless the NIC explicitly supports it; I believe some support VXLAN?) and can hamper the kernel's scheduling.

[1] https://medium.com/@rothgar/no-sdn-kubernetes-5a0cb32070dd

KaiserPro · on Dec 5, 2017

With respect coordinating loads of route tables, when its a flat network is nothing short of ludicrous.

Firstly _statically_ assigning an address range to each node is utter madness, firstly it limits the containers you can have. Secondly its terribly inflexible, its perfectly possible to have a beefy server have more than 254 containers running.

Thirdly it ties up a huge address ranges with _no_ flexibility. If you have nodes assigned to certain duties (like DB pods) then it can only realistically have a few containers. So the rest of the address range is wasted.

What is so frustrating is that all of this is automatically taken care of using DHCP and macvlan.

In the example thats linked, why isn't there a second adaptor on a different VLAN? Thats a far more simple and visible way of linking things together. I just don't see why you'd want to willingly fiddle with routing table when on a normal flat network its done for you, automatically.

theptip · on Dec 5, 2017

> firstly it limits the containers you can have.

This is a config value; if you want more containers per node, use a /23 or a /22 instead. It's entirely up to the operator, there's nothing magical about the default choice of /24 (except for it being easier to perform arithmetic on).

> Thirdly it ties up a huge address ranges with _no_ flexibility.

If you're using 10/8, then you have 16 bits' worth of /24 subnets, so 65k nodes by default. It's true that there are some companies in the world that have to worry about this limit, but for almost everybody I don't think this is a real problem.

KaiserPro · on Dec 6, 2017

> This is a config value

Indeed, but its something extra that _you_ have think about after you've setup your VPC (if youre on AWS) not only does it mean you can deploy/configure un routable IPs by accident, its using a mechanism that _slows down_ your VPC, and adds a minefield of confinguration errors. its just madness.

It's just a LAN, why would you ever statically assign IPs? especally at scale, especially if you have a dynamic ever changing workload. Deploy a pod, two network interfaces, macvlan & AWS does the rest. Put a cloudwatch alert for DHCP exhaustion, or put a resource limit in for each AZ.

Put it this way: Why do you want to have to think about subnets _after_ you've created your VPC? (unless you've reached a limit...)

chris_marino · on Dec 5, 2017

>The ARP table might be bigger, but thats a different issue.

But this is the problem that most designs are trying to solve. Large L2s are notoriously fragile. 1,000 nodes, 50-100 pods/node is a lot of ARPs. And sometimes you want partitions between endpoints for security/isolation.

I agree with you about static assignment of addresses. But that's why (most) CNIs work with a controller of some kind for IPAM.

IMO, the problem complexity is hard to compress. You need to distribute/manage MAC addresses, routes, and/or state. Different designs would favor one over another.

KaiserPro · on Dec 5, 2017

but thats what subnets are _for_ nobody in thier right mind runs more than a /22 on a VLAN/partition.

In this case I think the traditional model works well, has excellent documentation, and scale much better than the alternatives, especailly in AWS.

adamtulinius · on Dec 5, 2017

Then you just move the routing problem to your gateway/router, and it'll end up exploding because of too many routes in the table (one per container), instead of only one per container host.

Or maybe I'm wrong. :)

KaiserPro · on Dec 5, 2017

if its a flat network then there is only one route. The ARP table might be bigger, but thats a different issue.

There is no difference between this and VM hosts.