
Container Networking with Vxlan, BGP and WireGuard - tobbyb
https://www.flockport.com/guides/advanced-container-networking
======
KaiserPro
One does not simply go from a flat network to overlays. Overlays are slow,
difficult, cause really odd failures and are often hilariously immature. They
are the experimental graph database of the network world.

Just have a segregated network, and let the VPC/dhcp do all the hard stuff.

Have your hosts on the default VLAN(or Interface if your cloudy), with its own
subnet (Subnets should only exist in one VLAN.) Then if you are in cloud land,
have a second network adaptor on a different subnet. If you are running real
steel, then you can use a bonded network adaptor with multiple VLANs on the
same interface. (The need for a VLAN in a VPC isn't that critical because
there are other tools to impose network segregation.)

Then use macvtap, or macvlan(or which ever thing that gives each container a
macaddress) to give each container its own IP. This means that your container
is visible on that entire subnet, either inside the host or without.

There is no need to faff with routing, it comes for free with your VPC/network
or similar. Each container automatically has a hostname, IP, route. It will
also be fast. As a bonus it call cane be created at the start using
cloudformation or TF.

You can have multiple adaptors on a host, so you can separate different
classes of container.

Look, the more networking that you can offload to the actual network the
better.

If you are ever re-creating DHCP/routing/DNS in your project, you need to take
a step back and think hard about how you got there.

70% of the networking modes in k8s are batshit insane. a large amount are
basically attempts at vendor lock in, or worse someone's experiment thats got
out of hand. I know networking has always been really poor in docker land, but
there are ways to beat the stupid out of it.

The golden rule is this:

Always. Avoid. Network. Overlays.

~~~
stargrazer
I will have to take the other side of that golden rule. Not sure where it came
from. But when one has a decent handle on the tools at hand, they work
wonderously well.

I have bare metal servers tied together with L3 routing via Free Range Routing
running BGP/VxLAN. It Just Works.

No hard coded vlans between physical machines. Just point-point L3 links.
Vlans are tortuous between machines as a Layer 2 protocol, given spanning tree
and all of its slow to converge madness.

Therefore a different Golden Rule:

Always. Overlay. Your. Network.

Leave a note if you'd like more details.

~~~
utopian3
OP was mostly talking about cloud + docket containers. Your use-case is
unrelated and seems to make sense.. But I still agree with OP and I believe
overlays in the cloud is generally an anti-pattern of unnecessary complexity.

------
exabrial
Site is having issues atm... but I'll throw something out there I'd really
like to see.

We encrypt 100% of our machine-to-machine traffic at the TCP level. There's a
lot of shuffling of certs around to get some webapp to talk to postgres, then
have that webapp serve https to haproxy, etc.

I'd be awesome if there was a way your cloud servers could just talk to each
other using wiregaurd by default. We looked at setting it up, but it'd need to
be automated somehow for anything above a handful of systems :/

~~~
KaiserPro
> just talk to each other using wiregaurd by default

I don't understand why you'd want to do this?

I use wireguard to join machines on disparate networks into one.

However to do it inside the same VPC, I just don't get. If you don't trust
your VPC surely you need to be moving off the cloud?

~~~
ctrlc-root
I agree with your viewpoint but I'm also aware of several security standards
that explicitly specify all traffic between hosts needs to be encrypted.
Sometimes it's easier to meet the standard verbatim than try and justify an
exception. If you already use a configuration management tool it shouldn't be
a lot more overhead to install some certificates.

~~~
ownagefool
If you think about these things like physical networks, you can do things like
run an interface in promiscious mode and sniff traffic.

Further, leaving your VM, you hit a shared NIC and network cables, so you
start to worry about phyiscal layer attacks.

Amazon specifically states they handle these issues, and indeed they likely
do, but how do you know? If you're able to easily encrypt by using something
like istio, then why not?

More specifically:

"Packet sniffing by other tenants: It is not possible for a virtual instance
running in promiscuous mode to receive or“sniff” traffic that is intended for
a different virtual instance. While customers can place their interfaces into
promiscuous mode, the hypervisor will not deliver any traffic to them that is
not addressed to them. This includes two virtual instances that are owned by
the same customer, even if they are located on the same physical host. Attacks
such as ARP cache poisoning do not work within EC2. While Amazon EC2 does
provide ample protection against one customer inadvertently or maliciously
attempting to view another’s data, as a standard practice customers should
encrypt sensitive traffic."

------
j0057
In my mind, a "layer 2 subnet" really doesn't mean anything. Subnets are
things that happen in IP, that is, layer 3, and layer 2 is the physical
connection, ie. Ethernet or WLAN, which don't have the concept of subnets.

Edit: also the OSI layer model was specified in the eighties, and isn't all
that accurate in 2019 to describe how our networks actually work.

~~~
KaiserPro
I'd argue that the closest thing to a layer 2 subnet is a VLAN.

~~~
stargrazer
But there isn't a one to one relationship.

A subnet should only be in one vlan, but there are networks where there is
more than one subnet in a vlan.

Whether that is appropriate or not, that would be a different topic.

~~~
KaiserPro
Yes, but that's layer 3 +.

A VLAN will isolate macs so that only those adaptors in that VLAN can see each
other. Granted, there isn't really a concept of a netmask based subnet, but
then that's because you don't really have control over one's physical address.

Now, you can have an adaptor in more than one VLAN, which is the point of
them. As I said its not a perfect analogy, but then they are there to achieve
different things based on different semantics.

------
chaz6
Can we have a version using IPv6 instead of legacy IPv4? It would make things
a lot simpler (no need for any fancy routing or nat).

~~~
geofft
IPv6 doesn't save you from any routing problems that IPv4 won't save you from.
While IPv6 tries to hide the layer 2/layer 3 distinction from you, it doesn't
actually make your physical network magically work differently. Internally
IPv6 tries to implement this hiding using multicast - same as the VXLAN
suggestion in the article. If you overload your network infrastructure's
multicast support, at best you fall back to broadcast, which is just like
reconfiguring your physical network to bridge all your layer 2 segments into
one: if that won't work for you in IPv4, it won't work in IPv6. (And at worst,
it stops routing correctly.) If you don't have multicast support at all in
your network infrastructure, which as the article points out isn't common to
have on cloud networks, then IPv6 won't be able to help you. You'll still need
fancy routing and tunneling to make things work, whether you address machines
with IPv4 and IPv6.

In my experience, IPv4 has the strong advantage of being familiar and well-
supported, which means that _when_ (not if) your network infrastructure starts
to act up, it's easier to figure out what's going on. IPv6 works great if you
have robust, reliable multicast support on all your devices and nothing ever
goes wrong.

~~~
tialaramex
IPv4 numbering sucks. IPv6 lets you stop worrying about that.

In IPv4 you're going to need RFC1918 addresses, and then you're going to have
to make sure that _your_ RFC1918 addresses don't conflict with any _other_
RFC1918 addresses that inevitably absolutely everything else is using or else
you'll get hard-to-debug confusion. No need in IPv6, you should use globally
unique addresses everywhere, there are plenty and you will not run out.

Everybody who has ever used a single byte to store a value they were convinced
wouldn't need to be more than a few dozen, and then it blew up because
somebody figured 300 ought to fit and it doesn't already knows in their heart
that they shouldn't be using IPv4 in 2019.

~~~
geofft
Oh, yes, IPv6 saves you from worrying about addressing, which is a huge
headache in IPv4. I agree with that and IPv4 address conflicts are a personal
frustration. IPv6 doesn't save you from "fancy routing" and mostly does not
save you from "nat," though. That's what I was responding to.

I'm hesitant to use IPv6 because it is _not_ merely IPv4 + more addresses,
it's IPv4 + more addresses + a very clever design that hides the L2 vs. L3
distinction by relying heavily on multicast groups + a replacement for ARP + a
replacement for DHCP + etc. etc. etc. I know I shouldn't be using IPv4 in
2019, but I don't have a better option. I'm not excited about clever systems,
hiding, the assumption that multicast works reliably, losing the last few
decades of monitoring and debugging tools, happy eyeballs, etc., and I'm not
willing to subject my users to the resulting outages simply because it'll save
me the headache of thinking about numbering.

------
corndoge
This article uses Quagga - they really should be using FRRouting, which was
forked from Quagga in 2017 by the core Quagga developers and has 4 times as
many commits (16000[0] vs 4000[1]), far more features, bugfixes, etc. Quagga
has been dead for over a year.

[0] [https://github.com/FRRouting/frr](https://github.com/FRRouting/frr)

[1] [http://gogs.quagga.net/Quagga](http://gogs.quagga.net/Quagga)

~~~
kortilla
More commits, more features, and more bug(fixe)s are not really selling points
for something as critical as BGP routing.

Would you trust two compared TCP implementations using those stats as well?

For something simple like this post, using quagga is completely fine and
probably much better that using the latest Swiss Army knife.

~~~
orf
This comment completely misses the point. There is a distinction between
"complete" and "dead", to whatever degree any software can be called
"complete".

The Quagga source repo[1]'s certificate expired over 6 months ago. Looking at
the Bugzilla[2] report (also with an expired certificate) there are 14
blockers, 49 critical and 69 issues that have not been resolved.

So no, I'd agree with the parent comment that using a project as seemingly
dead as Quagga for something as critical as BGP routing is putting yourself on
shaky ground at the very least.

1\. [https://gogs.quagga.net/Quagga](https://gogs.quagga.net/Quagga)

2\.
[https://bugzilla.quagga.net/report.cgi?x_axis_field=bug_seve...](https://bugzilla.quagga.net/report.cgi?x_axis_field=bug_severity&y_axis_field=bug_severity&z_axis_field=&no_redirect=1&query_format=report-
table&short_desc_type=allwordssubstr&short_desc=&product=Quagga&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&resolution=---&longdesc_type=allwordssubstr&longdesc=&bug_file_loc_type=allwordssubstr&bug_file_loc=&keywords_type=allwords&keywords=&deadlinefrom=&deadlineto=&bug_id=&bug_id_type=anyexact&emailassigned_to1=1&emailtype1=substring&email1=&emailassigned_to2=1&emailreporter2=1&emailcc2=1&emailtype2=substring&email2=&emaillongdesc3=1&emailtype3=substring&email3=&chfieldvalue=&chfieldfrom=&chfieldto=Now&j_top=AND&f1=noop&o1=noop&v1=&format=table&action=wrap)

~~~
kortilla
You missed the point. It’s a demo doing trivial bgp stuff that hasn’t changed
for 15 years.

It’s like someone doing a demo on some text processing where they use grep and
the top comment is some jerk saying that map-reduce would be better because
some new large systems use it and it’s being actively developed.

------
alexandre_m
"Vxlan uses multicast which is often not supported on most cloud networks. So
its best used on your own networks."

Not entirely correct.

Linux has had unicast vxlan for quite some time.

Flannel is doing unicast and works pretty much anywhere.

See "Unicast with dynamic L3 entries" section:
[https://vincent.bernat.ch/en/blog/2017-vxlan-
linux](https://vincent.bernat.ch/en/blog/2017-vxlan-linux)

~~~
YZF
VXLAN is just encapsulating L2 VLANs in UDP packets. Sounds like some
confusion about linux implementation details.

~~~
alexandre_m
It depends on the implementation of the control plane and how you maintain the
mesh between the different servers (L2<=>L3 for arp resolution, mac learning).

Historically vxlan was a multicast thing, but not anymore.

Flannel (popular among the container networking solutions) will maintain its
state in etcd by watching the Kubernetes resources then program the linux data
plane with static unicast entries for the neighbors.

