
Introducing the Fan – simpler container networking - TranceMan
https://www.markshuttleworth.com/archives/1471
======
tobbyb
This looks unbelievably simple, on the lines of why hasn't it been done
before.

So you ping 10.3.4.16 and your host automatically 'knows' to just send it to
17.16.4.16 where lying in wait, the receiving host simply forwards it to
10.3.4.16. I like it.

This is a vexing problem for containers and even VM networking. If they are in
a NAT you need to create a mesh of tunnels across hosts, or you create a flat
network so they are all on the same subnet. But you can't do this for
containers on the cloud with a single IP and limited control of the networking
layer.

Current solutions include L2 overlays, L3 overlays, a big mishmash of GRE and
other type of tunnels, or VXLAN multicast unavailable in most cloud networks,
or proprietary unicast implementations. It's a big hassle.

Ubuntu have taken a simple approach, no per node database to maintain state
and uses commonly used networking tools. And more importantly it seems fast.
And it's here and now. That 6gbps suggests this does not compromise
performance like a lot of other solutions tend to do. It won't solve all
multi-host container networking use cases but will address many.

~~~
jpgvm
You can use any method to program the VXLAN forwarding table, you don't need
to use multicast.

This can even be done on the command line using iproute2 utilities:
[https://www.kernel.org/doc/Documentation/networking/vxlan.tx...](https://www.kernel.org/doc/Documentation/networking/vxlan.txt)

Though you should probably use netlink to do it programatically. Personally I
like to combine netlink + Zookeeper or similar to trigger edge updates via
watches.

~~~
tobbyb
Are you referring to the fdb tables, I tried that some months ago but it
didn't seem to work. Maybe its changed now. I will give it a shot. Any tips?

I remember seeing a patch floating around that added support for multiple
default destinations in VXLAN unicast but I think some objections were raised
and it's not made it through. At least it's not there in 4.1-rc7. That would
be quite nice to have.

[http://www.spinics.net/lists/netdev/msg238046.html#.VNs3rIdZ...](http://www.spinics.net/lists/netdev/msg238046.html#.VNs3rIdZmyY)

~~~
jpgvm
Oh, multiple default destinations - that would be very cool!

Right now I am using netlink to manage FDB entries, last time I tried iproute2
utility worked too...

The only tricky thing about doing it with netlink is that the FDB uses the
same API as the ARP table. Specifically
RRTM_NEWNEIGH/RTM_DELNEIGH/RTM_GETNEIGH, apart from that it's pretty simple
though.

------
frequent
I would want to object and say "There is IPv6 in the cloud!".

We have developed re6stnet in 2012. You can use it to create an ipv6 network
on top of an existing ipv4 network. It's open source and we are using it
ourselves internally and in client implementations since then.

I wrote a quick blogpost on it: [http://www.nexedi.com/blog/blog-
re6stnet.ipv6.since.2012](http://www.nexedi.com/blog/blog-
re6stnet.ipv6.since.2012)

The repo is here in case anyone is interested:
[http://git.erp5.org/gitweb/re6stnet.git/tree/refs/heads/mast...](http://git.erp5.org/gitweb/re6stnet.git/tree/refs/heads/master?js=1)

~~~
dustinkirkland
There's IPv6 in some clouds.
[https://forums.aws.amazon.com/thread.jspa?messageID=536049](https://forums.aws.amazon.com/thread.jspa?messageID=536049)

~~~
teraflop
From the link in the parent comment:

> Well, most importantly we have stable IPv6 everywhere - including on IPv4
> legacy networks.

------
kbaker
Sorry, I just gotta rant a bit... this is a really bad hack, that I wouldn't
trust on a production system. Instead of doubling down and working on better
IPv6 support with providers and in software configuration, and defining best
practices for working with IPv6, they just kinda gloss over with a 'not
supported yet' and develop a whole system that will very likely break things
in random ways.

> More importantly, we can route to these addresses much more simply, with a
> single route to the “fan” network on each host, instead of the maze of
> twisty network tunnels you might have seen with other overlays.

Maybe I haven't seen the other overlays (they mention flannel), but how does
this not become a series of twisty network tunnels? Except now you have to
manually add addresses (static IPv4 addresses!) of the hosts in the route
table? I see this as a huge step backwards... now you have to maintain address
space routes amongst a bunch of container hosts?

Also, they mention having up to 1000s of containers on laptops, but then their
solution scales only to 250 before you need to setup another route + multi-
homed IP? Or wipe out entire /8s?

> If you decide you don’t need to communicate with one of these network
> blocks, you can use it instead of the 10.0.0.0/8 block used in this
> document. For instance, you might be willing to give up access to Ford Motor
> Company (19.0.0.0/8) or Halliburton (34.0.0.0/8). The Future Use range
> (240.0.0.0/8 through 255.0.0.0/8) is a particularly good set of IP addresses
> you might use, because most routers won't route it; however, some OSes, such
> as Windows, won't use it. (from
> [https://wiki.ubuntu.com/FanNetworking](https://wiki.ubuntu.com/FanNetworking))

Why are they reusing IP address space marked 'not to be used?' Surely there
will be some router, firewall, or switch that will drop those packets
arbitrarily, resulting in very-hard-to-debug errors.

\--

This problem is already solved with IPv6. Please, if you have this problem,
look into using IPv6. This article has plenty of ways to solve this problem
using IPv6:

[https://docs.docker.com/articles/networking/](https://docs.docker.com/articles/networking/)

If your provider doesn't support IPv6, please try to use a tunnel provider to
get your very own IPv6 address space.

like [https://tunnelbroker.net/](https://tunnelbroker.net/)

Spend the time to learn IPv6, you won't regret it 5-10 years down the road...

~~~
epistasis
And what about those environments where only IPV4 is available, as
specifically addressed in this article?

There are lots of overlay networks, are those all hacks too?

>Except now you have to manually add addresses (static IPv4 addresses!) of the
hosts in the route table?

This does not appear to be true at all, based on the configuration that's
posted at the bottom of the article.

~~~
kbaker
The overlay networks are not necessarily hacks - just a souped-up, more
distributed, auto-configured VPN. Also, especially in flannel's case, you hand
it IPv4 address space to use for the whole network, so there is a bit more
coordination of which space gets used.

Even so, there are are lots of ways to get IPv6 now, I would think anywhere
where you could use this fan solution to change firewall settings and route
tables on the host, you could also setup an IPv6 tunnel or address space. Even
with some workarounds for not having a whole routed subnet, like using Proxy
NDP.

It seems like a much more future-proof solution than working with something
like this. Just my 2c...

~~~
epistasis
You do not appear to be familiar with the problem domain that this addresses,
and I think the fan device addresses the problem very very well compared to
its competitors! It's nothing like a VPN, it's just IP encapsulation without
any encryption or authentication. And it's far far far less of a hack than the
distributed databases currently used for network overlays, like Calico or
MidoNet or all those other guys, IMHO. For example take this sentence from the
article:

> Also, IPv6 is nowehre to be seen on the clouds, so addresses are more scarce
> than they need to be in the first place.

There are a lot of people that are using AWS and not in control of the entire
network. If they _were_ in control of the entire network, they could just
assign a ton of internal IP space to each host. IPV6 is great, sure, but if
its not on the table its not on the table.

We will be testing the fan mechanism very soon, and it will likely be used as
part of any LXC/Docker deploy, if we ever get to deploying them in production.

~~~
vosper
I don't know much about networking, but if you're on AWS and are using VPC
then don't you have full control of the entire (virtual) network?

~~~
epistasis
Sure, but it's only IPV4:

> Additionally, VPCs currently cannot be addressed from IPv6 IP address
> ranges.

[http://aws.amazon.com/vpc/faqs/](http://aws.amazon.com/vpc/faqs/)

And then you still have the problem of only so many IPs per host, so it
doesn't help with lots of containers.

~~~
rwmj
Anyone got the inside story on why in 2015 Amazon doesn't support IPv6?

~~~
falcolas
IPv6 is hard. It's hard to optimize, it's hard to harden, and it's hard to
protect against.

One small example: How do you implement a IPv6 firewall which keeps all of
China and Russia out of your network? (My apologies to folks living in China
and Russia, I've just seen a lot of viable reasons to do this in the past).

Another small example: How do you enable "tcp_tw_recycle" or "tcp_tw_reuse"
for IPv6 in Ubuntu?

~~~
stormbrew
None of this really applies to VPC (which is a private virtual network for
only your own hosts and access is restricted lower down than at the ip layer).
You actually can have a public IPv6 address on AWS, it just has to go through
ELB.

~~~
teraflop
You actually have it a bit backwards: you can only assign an IPv6 address to
an ELB if it's _not_ in a VPC.

[http://docs.aws.amazon.com/ElasticLoadBalancing/latest/Devel...](http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/elb-
internet-facing-load-balancers.html)

Crazy, right? Especially since new customers are forced to use VPC and don't
even have the option of falling back to EC2-Classic.

~~~
stormbrew
To be clear, I was not saying that you can give an ELB in a VPC an IPv6
address. I was saying you can give a non-VPC ELB an IPv6 address. Basically I
was pointing out that, however imperfect, Amazon has chosen to prioritize
public access to IPv6 over private use of it.

~~~
teraflop
Ah, sorry for the misunderstanding then.

------
ademarre
I'd like to see a better explanation of how this compares to the various
Flannel backends
([https://github.com/coreos/flannel#backends](https://github.com/coreos/flannel#backends)),
and also how this would be plugged into a Kubernetes cluster.

~~~
dustinkirkland
IP-IP is the first encapsulation supported, but the Fan is engineered in a way
that any encapsulation scheme can be added easily. We'll be adding support for
VXLAN, GRE, STP tunnels as well. My colleagues will have blog posts about how
to enable Kubernetes clusters with Fan Newtorking.

------
regularfry
Or you could go somewhere with IPv6. The number of places with an IPv4-only
restriction is only going to drop.

~~~
falcolas
That's been the promise for... how many years now? IIRC, it was before EC2
even existed, and we're obviously not there yet.

Also, it's worth noting that IPV6 is nowhere near as battle hardened as IPV4;
there's too many optimization and security gaps to depend on it in production.
I've watched a few network gurus burn themselves out attempting to harden a
corporate network against IPV6 attacks while keeping it usable.

~~~
regularfry
> That's been the promise for... how many years now? IIRC, it was before EC2
> even existed, and we're obviously not there yet.

For some values of "we". If you're stuck on EC2, yeah, you've got a problem.

~~~
coldtea
But if you don't use EC2 but some bizarro provider noone really uses then you
can have IPv6...

~~~
regularfry
Nobody ever got fired for buying IBM, right?

~~~
coldtea
Because AWS is some legacy dinosaur and not the world leader in such
infrastructure, right?

------
rsync
"Also, IPv6 is nowehre to be seen on the clouds, so addresses are more scarce
than they need to be in the first place."

We've[1] had ipv6 addressable cloud storage since 2006.

Currently our US (Denver), Hong Kong (tsuen kwan o) and Zurich locations have
working ipv6 addresses.

[1] You know who we are.

~~~
nacs
What percent of the traffic is actually flowing through the ipv6 ones compared
to v4 -- <10% I'd guess (just curious)?

------
paulasmuth
I don't seem to get it. How is this different from just using a non-routed IP
per container?

~~~
maccam94
To build on epistasis's comment a bit, this creates a private network for all
containers on all hosts that reside in the same /16 network. So if you have a
VPC of up to 65k machines, each machine can run up to ~250 containers that can
all talk directly to each other by just relying on basic network routing. This
is better than your typical private NAT bridge networking because containers
on different hosts can talk to each other without having to set up port
forwarding or discovering what port that particular application server is
running on.

------
api
Probably doesn't matter much here, but 240.0.0.0/4 is _hard coded to be
unusable_ on Windows systems. It's in the Windows IP stack somewhere. Packets
to/from that network will simply be dropped.

~~~
dustinkirkland
You can use absolutely any /8 that you want. I used 250.0.0.0/8 in my
examples, but the FanNetworking wiki page uses 10.0.0.0/8\. You can use any
one you want. Have at it ;-)

~~~
api
On Windows?

------
stephengillie
I've read the article twice. Did they just reinvent putting DHCP behind a NAT?
What does that combination of systems not do that Fan does?

    
    
      *Remap 50 addresses from one range to another.
      *Dynamically assign those addresses to servers.
      *Special Something that Fan does.
    

What's the benefit of using a full class A subnet when you are only using 250
addresses?

~~~
dustinkirkland
Sure, DHCP/NAT is used by each container, to get out. But how does it route to
another container on some other host elsewhere in your cloud? That's what the
Fan addresses.

~~~
stephengillie
> Sure, DHCP/NAT is used by each container, to get out.

So...Each physical host has its software networking pass through a NAT before
hitting the physical adapter? And it's already using DHCP to assign addresses
to its containers?

> But how does it route to another container on some other host elsewhere in
> your cloud? That's what the Fan addresses.

You use DNS to create a lookup table matching container to IP? Since this
isn't being done, it must not work. What does Fan do instead?

Is there some other complicating factor to which I'm ignorant? Are we talking
about having multiple Kubernetes clusters inside containers inside VMs inside
a physical host?

Also, HOW does Fan address this problem? What does it use instead of one kind
of database lookup or another, like a distributed database system such as DNS?
Or does Fan use fancy subnet math?

------
geku
Seems to be a smart solution but it only works when you have control over the
"real" /16 network if I understand it correct? E.g. having multiple nodes on
multiple cloud providers with completely different IP addresses not in the
same /16 network will not work, correct?

------
GauntletWizard
Why do people keep giving whole IP addresses to every little container? It's a
terrible management paradigm compared to service-discovery and using hostports
for every address.

~~~
falcolas
Because the alternative methods for handling intra-container communication are
even bigger messes.

The 1.6 docker solution involved a double nat and relied on iptables -
resulting in some fairly serious bottlenecks and pathalogical edge cases. It
also required a third party solution for handling discovery of IPs for
services.

Opening up containers to access the host network interfaces breaks the
encapsulation promises of containers, and is thus not available to all people.
Conceptually, it also creates holes in the idempotent service model, since
they have to be aware of port conflicts.

The 1 IP per container model, such as VXLan, Flannel, Kubernetes, Docker 1.7
etc. is one of the more effective methods of countering the problem, at the
cost of guzzling IP address space, and requiring a gateway to escape the
virtual network tunnel.

~~~
otterley
> Opening up containers to access the host network interfaces breaks the
> encapsulation promises of containers

Who made that promise? It was never a feature of Linux containers to
virtualize the NIC.

~~~
jpgvm
You would be surprised the number of people that don't understand containers
are a form namespacing + isolation not a form of virtualisation.

------
rcarmo
This is very neat indeed, and I'd love to try it out, but the launchpad links
are broken. Anyone know where I can get the package for Ubuntu armhf? Or the
source?

