
Kubernetes Networking: Behind the Scenes - onig1990
https://itnext.io/kubernetes-networking-behind-the-scenes-39a1ab1792bb?source=friends_link&sk=b5b666c68375f7c60f76faa8571baa21
======
schmichael
On the Nomad project we've taken a distinct approach to networking: treat the
IP layer as an implementation detail and attempt to abstract it away. This has
led to a few design decisions:

1\. Global node addressing - Given a region + node ID you can query any node
from any other node.

2\. Global allocation (pod) addressing - Given a region + allocation ID you
can interact with individual allocations and tasks (metrics, files, logs,
remote exec, etc).

3\. Global name-based service addressing (via Consul) - whether via Consul's
DNS interface, catalog API, or Connect service mesh, services discover one
another _by name_ instead of by IP or port. The service's address (ip+port) is
hopefully an opaque implementation detail that never needs to be externally
observed.

With all 3 of these taken together there's no need for overlays, distinct
subnets, or unique IPs-per-allocation. Datacenters may even have overlapping
subnets and Consul Connect will still allow services to communicate via global
name based addressing. Even without Consul, almost all Nomad commands will
route to a remote region and execute on the correct node regardless of subnet
overlapping.

This does cause some friction at the edge: when translating from external IP
based routing to internal services (aka ingress). However, there are lots of
load balancers that integrate with Consul and therefore translate between IP
and name based routing adequately.

Really any service unhappy with our "route by name, not by ip" design is going
to be a bit awkward to use, but routable IPs and ports are selected and
exposed by Nomad's scheduler - or advertised in Consul if the address is
overridden at runtime.

------
raesene9
If getting a better understanding of exactly how Kubernetes network is of
interest, I'd also recommend this episode of TGIK
[https://www.youtube.com/watch?v=IhbJ3ll4usI](https://www.youtube.com/watch?v=IhbJ3ll4usI)

In general actually, I've found TGIK really useful in improving my
understanding of the underpinnings of k8s. They have a load of deep dive
episodes that focus on a specific part of the solution and generally have a
load of hands-on examples.

~~~
Already__Taken
Shame the audio got ruined somewhere along the way.

~~~
raesene9
it fixes about 2:30-3:00 in.

------
londons_explore
I feel like k8s networking is 10x more complex than it needs to be...

I think it's because they still want to use the socket API, but to "pretend"
you can open a socket to a pod, or to a service, etc.

Instead, they should have defined a new API, like
"connect('k8s:service:foobar?selectionstrategy=leastloaded')"

Then they could have made simple LD_PRELOAD shims to get their magic API to
work with existing software without modification.

There would be no need for millions of IPtables rules, virtual IP addresses,
port remapping, proxy processes, etc.

~~~
justinsaccount
That is basically how consul connect works:
[https://www.consul.io/docs/connect/native/go.html](https://www.consul.io/docs/connect/native/go.html)

~~~
atombender
Not as I understand it. It uses DNS resolution, so you connect to something
like foo.web.service.consul. Kubernetes's KubeDNS works exactly the same way
with Kubernetes service objects.

~~~
justinsaccount
From that page:

    
    
      // Create an instance representing this service. "my-service" is the
      // name of _this_ service. The service should be cleaned up via Close.
      svc, _ := connect.NewService("my-service", client)
    

and when http is used:

> The hostname used in the request URL is used to identify the logical service
> discovery mechanism for the target. It's not actually resolved via DNS but
> used as a logical identifier for a Consul service discovery mechanism. It
> has the following specific limitations:

~~~
atombender
Ah, I see. That's presumably possible because Go has its own DNS resolver, and
it (net.DefaultResolver) can be overridden by libraries; no LD_PRELOAD needed.

~~~
justinsaccount
it's because connect.NewService is the consul client, not something like
net.DialTCP

~~~
atombender
Oh, they require all TCP connections to go through that. That's a bit heavy-
handed — requires rewriting all client code to adapt to that pattern. The nice
thing about KubeDNS is that you don't have to. Same with Istio.

~~~
justinsaccount
I brought up the native client because you mentioned "they should have defined
a new API".

~~~
atombender
That wasn't me. I'd argue the opposite. Though I'd also maybe argue that SRV
records is what we should have standardized on for this stuff a long time ago.

------
Already__Taken
It's surprising k8s being so new seems to have no concept of ipv6

~~~
sascha_sl
Because ipv6 is not a simple addition, but a complex suite of new protocols
meant to replace v4, except that is never going to happen in our lifetimes.

v6 should've simply been v4 with more bits

~~~
forgot-my-pw
No backwards compatibility could be IPv6's biggest mistake. Now everyone is
forced to maintain 2 IP addresses.

~~~
gatherhunterer
I think that’s intended as a feature. IPv4 is just not good enough, we will
run out of addresses. I think they want to kill IPv4 and replace it rather
than improving upon it. At some point programs will ignore IPv4 altogether and
our IP addresses will not be fundamentally flawed anymore.

~~~
sascha_sl
layers are only ever added, never removed

------
k__
Did anyone do some measurements on k8s/container latency overhead in networks?

~~~
loriverkutya
Yes. And they are _really_ easy to find.
[https://machinezone.github.io/research/networking-
solutions-...](https://machinezone.github.io/research/networking-solutions-
for-kubernetes/) [https://itnext.io/benchmark-results-of-kubernetes-network-
pl...](https://itnext.io/benchmark-results-of-kubernetes-network-plugins-cni-
over-10gbit-s-network-updated-april-2019-4a9886efe9c4)

Edit: added one more link

~~~
k__
I'm sorry that I'm an _idiot_.

Thanks for the links.

