
IPvlan overlay-free Kubernetes Networking in AWS - theatrus2
https://eng.lyft.com/announcing-cni-ipvlan-vpc-k8s-ipvlan-overlay-free-kubernetes-networking-in-aws-95191201476e
======
ggm
I know this invites some eye-rolling, but can somebody explain to me why the
k8s people insist on ignoring IPv6 and the possibilities of large address
fields?

Down the bottom, which is where 'things we probably will never do' is when
IPv6 comes in the door.

Azure (for instance) is a fully IPv6 enabled fabric. Microsoft "get" IPv6.
They are all over it. They understand it, its baked into the DNA. So how come
K8s people just kind of think "yea.. nah.. not right now"?

Because proxy Ipv6 at the edge is really sucky. We should be using native
IPv6, preserve e2e under whatever routing model we need for reliability, and
gateway the V4 through proxies in the longer term.

(serious Q btw)

~~~
andrewstuart2
They're not ignoring it. It's being actively worked on, and is expected to be
in alpha for the 1.9 release [1].

The issue [2] has existed for over 3 years, so it's not a new suggestion.

[1]
[https://github.com/kubernetes/features/issues/508](https://github.com/kubernetes/features/issues/508)

[2]
[https://github.com/kubernetes/kubernetes/issues/1443](https://github.com/kubernetes/kubernetes/issues/1443)

~~~
jamiesonbecker
> "not ignoring it" .. "actively worked on" .. "for over 3 years" ..

Not to diminish the very real challenges in getting IPv6 implemented, but this
is an interesting turn of phrase.. especially because rolling out IPv6 would
actually _solve a whole class of problems_ (and I'm not even a particularly
big advocate of the need for IPv6, since most things should still be NATed
anyway.)

(And especially considering parent's phrase "baked into the DNA" at Azure.)

~~~
ggm
_(And especially considering parent 's phrase "baked into the DNA" at Azure.)_

S/Azure/Microsoft/ -Azure has IPv6 but I think its Immature.

[https://azure.microsoft.com/en-au/updates/ipv6-for-azure-
vms...](https://azure.microsoft.com/en-au/updates/ipv6-for-azure-vms/) Is
about the underlying VM architecture, not support for Kubernetes network
models.

------
deepakjois
Not directly related, but can someone recommend a beginners resource to
understand Kubernetes networking? There are some good ones out there that
explain basic Kubernetes concepts like pods, replicas etc. But networking
seems to be a more complicated topic, and most intro guides skip over it.

~~~
andrewstuart2
The networking model itself is amazingly simple, and part of what makes
Kubernetes so much easier to use. The rules are as follows:

All containers can communicate with all other containers without NAT.

All nodes can communicate with all containers (and vice-versa) without NAT.

The IP that a container sees itself as is the same IP that others see it as.

When using Docker by itself, you get into all sorts of complicated situations
because most running containers have an IP address that's host-specific and
not routable for any other machines. This makes networking across hosts a
giant pain. Kubernetes takes that away by making things behave exactly how
you'd hope they'd behave. My IP as I see it is reachable by anybody in the
cluster who has it (policy permitting).

The simplicity of working in this networking model means that there's a little
more work for the networking infrastructure to handle, making sure that IPs
are allocated without collision and that routes are known across many hosts.
Several technologies exist to build these bridges, including old-school tech
that has solved these exact problems for decades like BGP (see Calico/canal).

Ultimately, there's no silver bullet. I'd recommend giving the k8s networking
page a read. [1]

[1] [https://kubernetes.io/docs/concepts/cluster-
administration/n...](https://kubernetes.io/docs/concepts/cluster-
administration/networking/)

~~~
scurvy
That's amazingly simple?

I notice that you conveniently left out the "ingress" component. Stuff in K8s
talking with other K8s stuff is easy. Getting the flows into K8s apps from
outside the K8 network is amazingly clunky in its current state.

~~~
atombender
Network is simple from the container's point of view. It's less simple outside
the container.

But outside the container, the strategy is still much simpler than other
solutions (most of which predate Kubernetes). Kubernetes chooses to give every
pod its own IP. This means choosing an internal network such as 10.x.x.x, and
giving each machine a slice of it. This way, one single cluster shares the
same big, flat space of IP addresses; not only do pods have the same IP inside
the container, but they can talk to other pods using the other pod's IP, too.

But a key point is that Kubernetes is designed to take care of most of it. One
part of it is the iptables proxy magic that it does to allow services to have
dynamically assigned IPs, too, with simple load-balancing between them. The
second part is the many built-in plugins for different, more complicated
overlay strategies. Kubernetes' automatic configuration works out of the box
on, say, AWS, without anything magical — Kubernetes natively talks to AWS to
set up a routing table so that packets end up where they should. You don't
need more complex overlay networking stacks such as Calico, Flannel or Weave
right away.

As for ingress, it has absolutely been Kubernetes' weakest point for several
years, and the Kubernetes team knows this perfectly well. That said, it's not
_complicated_ , thanks to the above. Once you have, say, Nginx listening on a
port, routing traffic into the cluster is a matter of setting up a load
balancer (at least on clouds like GCP, DigitalOcean and AWS), something which
Kubernetes even can do automatically for you. The weak links are the ingress
controller — the Nginx one is popular because it's stable and supports common
features such as TLS, whereas others such as Voyager and Traefik are lagging —
as well as the impedance mismatch with cloud LBs such as the Google Load
Balancer.

So far, Kubernetes' ingress support has been generic: One ingress object can
be used to "drive" different HTTP servers. The problem being, of course, that
all HTTP implementations which have different settings (timeouts, TLS certs,
CDN functionality) and concerns that the current, simple ingress format cannot
support. I'm expecting this to change soon. Ingress portability really isn't
an important concern, and the generic ingress format is a bottleneck for the
ingress functionality to mature.

~~~
scurvy
>This way, one single cluster shares the same big, flat space of IP addresses;
not only do pods have the same IP inside the container, but they can talk to
other pods using the other pod's IP, too.<

Why is having a big, flat namespace important? Routers route. Clos L3 networks
are no longer a fancy thing. They're commonplace now. I don't see any
advantage of having a flat network.

> One part of it is the iptables proxy magic that it does to allow services to
> have dynamically assigned IPs, too, with simple load-balancing between
> them.<

Ah yes, the iptables "magic". We call this, slowness and obfuscation. People
who understand how to run networks don't like handwavy magic. We like simple,
elegant concepts. Kubernetes networking is very far from simple and elegant.
It's blackbox "magic".

>You don't need more complex overlay networking stacks such as Calico, Flannel
or Weave right away.<

I run a native L3 network so have no need for an overlay network on top of it.
That said, I'd argue that the overlay junk is probably easier for non-
networking-fluent developers to setup and run compared to routing in AWS.

Kubernetes networking can be summed up thusly: Great for developers who know
nothing about networking but want to run at hyperscale. Terrible for people
who actually know how to run networks properly.

Kubernetes ingress is garbage. Stop apologizing for it.

IPv6 would also get rid of 99% of these overly complicated hand-wavy solutions
that Kubernetes proponents constantly tout as strong points. Give each node a
/64, and you're set.

~~~
lobster_johnson
I was replying in the context of the newbie who was asking for assistance.
Your reply is rather tone-deaf in that regard.

You're arguing against the value of a "big, flat namespace", yet you're also
arguing for IPv6, which itself is a big, flat namespace? Do you see the
contradiction, perhaps?

Dedicated CIDR for pods is important because it's _simple_. The symmetry is
simple to explain, simple to understand; the same simplicity you'd get from
IPv6.

Moreover, it's an abstraction that can be implemented however you want (custom
routing on L3, SDN overlay, BGP). Not everone has a native L3 network. If
you're on Google Cloud Platform, you get a virtual L3, but with other clouds,
the networking is a bit more old hat. So again, simplicity and convenience. As
for "overlay junk", the entirety of the Google Cloud itself is virtualized
over what is probably the world's most sophisticated SDN overlay, so, well,
some people's junk is other people's ragingly successful business, I suppose.

I'm not sure why you categorize the automatic iptables rules that Kubernetes
set up as slow or obfuscated. It's only magical in the sense that Kubernetes
automatically makes its cluster IPs load-balanced, a convenient system that
you are in no way forced to use. If you have a better setup, feel free to use
it instead.

We use Kubernetes ingress. It works. It could be better, but it's not
"garbage". I really recommend against putting everything in such categorical
terms. Everything in your comment is "junk" and "garbage", and the people who
designed it (Google!) are morons who don't understand networking, somehow.
That kind of arrogance on HN just makes you look foolish.

~~~
KaiserPro
I'm struggling to understand why you'd want to manually assign a /24 to _each_
node? that seems very 1990s

Can't each container be bound to a virtual network interface(macvlan) and use
DHCP? That allows the network to configure and manage the address pool.

No fiddling with routing tables (well not for each node) and it allows peering
of VPCs simply

~~~
lobster_johnson
/24 per node is one option, but not the only option. But that gives you max
254 pods per node.

The simplest option is to just use routing [1]. You don't _have_ to use an
SDN. Not sure if DHCP is one of the officially supported options.

I know there are people out there who use MACvlan/IPvlan. Some people
discourage these types of virtualized networks because the packet manipulation
can be inefficient (unless the NIC explicitly supports it; I believe some
support VXLAN?) and can hamper the kernel's scheduling.

[1] [https://medium.com/@rothgar/no-sdn-
kubernetes-5a0cb32070dd](https://medium.com/@rothgar/no-sdn-
kubernetes-5a0cb32070dd)

~~~
KaiserPro
With respect coordinating loads of route tables, when its a flat network is
nothing short of ludicrous.

Firstly _statically_ assigning an address range to each node is utter madness,
firstly it limits the containers you can have. Secondly its terribly
inflexible, its perfectly possible to have a beefy server have more than 254
containers running.

Thirdly it ties up a huge address ranges with _no_ flexibility. If you have
nodes assigned to certain duties (like DB pods) then it can only realistically
have a few containers. So the rest of the address range is wasted.

What is so frustrating is that all of this is automatically taken care of
using DHCP and macvlan.

In the example thats linked, why isn't there a second adaptor on a different
VLAN? Thats a far more simple and visible way of linking things together. I
just don't see why you'd want to willingly fiddle with routing table when on a
normal flat network its done for you, automatically.

~~~
theptip
> firstly it limits the containers you can have.

This is a config value; if you want more containers per node, use a /23 or a
/22 instead. It's entirely up to the operator, there's nothing magical about
the default choice of /24 (except for it being easier to perform arithmetic
on).

> Thirdly it ties up a huge address ranges with _no_ flexibility.

If you're using 10/8, then you have 16 bits' worth of /24 subnets, so 65k
nodes by default. It's true that there are some companies in the world that
have to worry about this limit, but for almost everybody I don't think this is
a real problem.

~~~
KaiserPro
> This is a config value

Indeed, but its something extra that _you_ have think about after you've setup
your VPC (if youre on AWS) not only does it mean you can deploy/configure un
routable IPs by accident, its using a mechanism that _slows down_ your VPC,
and adds a minefield of confinguration errors. its just madness.

It's just a LAN, why would you ever statically assign IPs? especally at scale,
especially if you have a dynamic ever changing workload. Deploy a pod, two
network interfaces, macvlan & AWS does the rest. Put a cloudwatch alert for
DHCP exhaustion, or put a resource limit in for each AZ.

Put it this way: Why do you want to have to think about subnets _after_ you've
created your VPC? (unless you've reached a limit...)

------
muxator
For those wondering what's the difference between macvlan and ipvlan, the main
ipvlan paper [0] summarizes its raison d'être:

> This is especially problematic where the connected next-hop e.g. switch is
> expecting frames from a specific mac from a specific port.

e.g.: if the host is attached to a managed switch with a strict security
policy, macvlan would not work.

[0]
[https://www.netdevconf.org/0.1/sessions/28.html](https://www.netdevconf.org/0.1/sessions/28.html)

~~~
KaiserPro
This is what I'm trying to understand. Macvlan appears to be a much better
solution as it allows 1-1 mapping and piggybacking onto all the
automatic/set&forget mechanisms that AWS provides.

Obviously it needs a switch at the otherside that can handle a huge and quick
changing arp table. Also if you have mac address limiting typical on edge
switches, its a non flyer

------
SEJeff
Just wanted to give a shout out to kube-router[1], a really fantastic solution
if you want to use BGP, that will soon support not needing bgp by implementing
a featureset similar to flannel's hostgw support. They are really good about
addressing things in the open on their github[2]. BGP is, by definition, "web
scale" as it runs most routing for the internet. Lower latency and much higher
throughput than any sort of overlay network.

[1] [https://www.kube-router.io](https://www.kube-router.io)

[2] [https://github.com/cloudnativelabs/kube-
router](https://github.com/cloudnativelabs/kube-router)

~~~
hueving
Saying BGP is "web scale" is a bit misleading because it has to be very
carefully aggregated for it to route the entire internet.

If you do something like advertise a /32 for each container you can very
quickly fill up TCAMs on your network hardware (in particular cheap top of
rack switches that are pervasive in data centers).

The entire v4 internet is something like 600k prefixes right now and the
routers that can handle that many prefixes at line rate are irritatingly
expensive. ToRs as of a couple of years ago when I last tested this would fall
over at 1-10k prefixes.

So be careful when looking at BGP solutions because it's very easy to have a
BGP topology that doesn't scale, despite it being the exchange protocol for
the Internet.

~~~
jlgaddis
In addition to what _SEJeff_ said, as long as you design your IP addressing
correctly you'll be fine. By that, I mean hierarchically. Just divide up
whatever IP network you're using (e.g. 10/8) and make sure you allocate
"enough" to each rack/whatever.

Assuming everything is nice and hierarchical, you can easily aggregate an
entire rack to a single prefix. Even the shitty ToR switches can usually
handle a couple thousand prefixes, which should be plenty if done correctly.

Obviously you shouldn't be advertising /32s.

> _The entire v4 internet is something like 600k prefixes right now ..._

Just checked my edge routers and it looks like we're up to ~671k prefixes here
and that number is still increasing everyday.

~~~
X-Istence
You should be advertising /32's and /128's. From the hypervisor to the TOR,
then the TOR aggregates if possible and advertises to a spine.

At least, that's what you do if you use Calico and want to be able to use
hypervisor migration when using it with OpenStack.

Just make sure your TOR's can handle the amount of routes necessary, have a
default route from the hypervisor to the internet, and from the TOR to the
spine, and have the spine advertise a 0's route. So now the spine is the only
place where you need beefier routers that can support more than the TOR's in
terms of routes.

With some intelligence in the IPAM solution your host will get a /26 (or a
/64) and will advertise that entire range, and only a single /32 is advertised
if the VM/container moves to another hypervisor host (to support things like
live-migration).

That being said, most TOR's can handle quite a large amount of routes these
days. When I was at a telco we had some gear that did up to 128k routes, so
splitting between IPv4/IPv6 we had 64k routes each. Which is plenty, even for
larger clusters.

~~~
jlgaddis
> _You should be advertising /32's and /128's. From the hypervisor to the TOR,
> then the TOR aggregates if possible and advertises to a spine._

Yeah, that's what I meant. I thought that was clear from "aggregate an entire
rack to a single prefix" but perhaps not.

When I said don't advertise /32s I meant "globally" and as a general rule (I
didn't realize what I said would be taken so literally). There can be
exceptions, of course, such as in the live-migration case you mentioned.

------
paxy
> Announcing cni-ipvlan-vpc-k8s

Rolls right off the tongue, doesn't it?

~~~
andrewstuart2
CIVK, pronounced civic?

~~~
frogperson
Are doing acronyms of acronyms now?

~~~
andrewstuart2
Have you looked up PHP or GNU lately? :D

At least these terminate when resolved.

------
chris_marino
It all about trade offs. We've built a CNI for k8s and have looked into all of
the techniques described. It seems that Lyft's design is a direct reflection
of their requirements.

To the extent your requirement match theirs, this could be a good alternative.
The most significant in my mind is that it's meant to be used in conjunction
with Envoy. Envoy itself has its own set of design tradeoffs as well.

For example, Lyft currently uses 'service-assigned EC2 instances'. Not hard to
see how this starting point would influence the design. The Envoy/Istio model
of proxy per pod also reflects this kind of workload partitioning. Obviously,
a design for a small number of pods (each with their own proxy) per instance
is going to be very different from one that needs to handle 100 pods (and
their IPs), or more, per instance.

Another is that k8s network policy can't be applied since the 'Kubernetes
Services see connections from a node’s source IP instead of the Pod’s source
IP'. But I don't think this CNI is intended to work with any other network
policy API enforcement mechanism. Romana (the project I work on) and the other
CNI providers that use iptables to enforce network policy rely on seeing the
pod's source IP.

Again, this might be fine if you're running Envoy. On the other hand, L3
filtering on the host might be important.

Also, this design requires that 'CNI plugins communicate with AWS networking
APIs to provision network resources for Pods'. This may or may not be
something you want your instances to do.

FWIW, Romana lets you build clusters larger than 50 nodes without an overlay
or more 'exotic networking techniques' or 'massive' complexity. It does it via
simple route aggregation, completely standard networking.

~~~
warp_factor
Not all NetworkPolicy implementations base themselves on Source//Destination
IPs. I can think specifically of Trireme//Cilium that are using metadata in
order to enable policies.

~~~
chris_marino
I knew that. What I didn't know was if either of these could apply network
policy to these endpoints. Guessing that since they each require their own
CNI, there will be probs. So, whether the CNI uses iptables, or not, not clear
how network policy API can be enforced.

------
bogomipz
The author states:

>"Unfortunately, AWS’s VPC product has a default maximum of 50 non-propagated
routes per route table, which can be increased up to a hard limit of 100
routes at the cost of potentially reducing network performance."

Could someone explain why increasing from 50 to 100 non-propagated routes in a
VPC results in network performance degradation?

------
netingle
IIUC ENIs are limited to 2 per host on small instances, 15 per host on larger
ones. Doesn't this approach limit the number of Pods per host? I'm already
running about 20 pods per host, and I don't more containers per host is
atypical.

------
tamalsaha001
How does it compare to AWS' own CNI plugin? [https://github.com/aws/amazon-
vpc-cni-k8s](https://github.com/aws/amazon-vpc-cni-k8s)

~~~
lambda
If you read the article, you'll see:

> Lincoln Stoll’s k8s-vpcnet, and more recently, Amazon’s amazon-vpc-cni-k8s
> CNI stacks use Elastic Network Interfaces (ENIs) and secondary private IPs
> to achieve an overlay-free AWS VPC-native solutions for Kubernetes
> networking. While both of these solutions achieve the same base goal of
> drastically simplifying the network complexity of deploying Kubernetes at
> scale on AWS, they do not focus on minimizing network latency and kernel
> overhead as part of implementing a compliant networking stack.

~~~
scurvy
How does it compare with Romana? They added a VPC router specifically for
large K8 clusters on AWS.

[https://github.com/romana/vpc-router](https://github.com/romana/vpc-router)

~~~
chris_marino
Both the Lyft and AWS CNIs use ENIs, Romana's CNI does not. But more
specifically, vpc-router works along with Romana's IPAM to aggregate routes so
that each VPC route can forward traffic for multiple instances. So, instead of
one route per instance, you need only 1 routes per n instances. Where n is set
by how much aggregation you want (configurable).

The net effect is that you can build large clusters without running out of VPC
routes and no overlay is needed when traffic crosses AZs.

When a route is used to forward traffic for multiple instances, the target
instance acts as router and forwards traffic to the final destination
instance. This works because instances within an AZ have routes installed on
them to the pod CIDRs on the other instances in the zone, so any one of them
can perform this forwarding function.

Romana only piggybacks routes when there are no more VPC routes available, so
for small cluster it's just like kubenet. For large clusters routes it uses
all the instances to forward traffic so that none of them become a bottleneck.

