
One way to make containers network: BGP - bartbes
http://jvns.ca/blog/2016/07/16/calico/
======
chrissnell
It's really not that difficult to network containers. We're using flannel [1]
on CoreOS. We're using flannel's VXLAN backend to encapsulate container
traffic. We're Kubernetes users so every kube pod [2] gets it's own subnet and
flannel handles the routing between those subnets, across all CoreOS servers
in the cluster.

I was skeptical when we first deployed it but we've found it to be dependable
and fast. We're running it in production on six CoreOS servers and 400-500
containers.

We did evaluate Project Calico initially but discovered some performance tests
that tipped the scales in favor of flannel. [3] I don't know if Calico has
improved since then, however. This was about a year ago.

[1] [https://github.com/coreos/flannel](https://github.com/coreos/flannel)

[2] A Kubernetes pod is one or more related containers running on a single
server

[3] [http://www.slideshare.net/ArjanSchaaf/docker-network-
perform...](http://www.slideshare.net/ArjanSchaaf/docker-network-performance-
in-the-public-cloud)

~~~
moondev
Is flannel used in Kubernetes for networking by default? Or is it something
that needs to enabled and configured separately?

~~~
amouat
Kubernetes has a requirement that containers (more accurately "pods") can
connect via a "flat networking space". How this is achieved varies between
deployments, flannel, calico and weave are all common approaches. Kelsey
Hightower's "Kubernetes the Hard Way" simply configured it at the router
level: [https://github.com/kelseyhightower/kubernetes-the-hard-
way/b...](https://github.com/kelseyhightower/kubernetes-the-hard-
way/blob/master/docs/07-network.md)

~~~
moondev
That makes sense. Thanks for the link as well, i've been looking for something
exactly like it. Looks like a great resource!

------
chris_marino
Another solution to this problem is Romana [1] (I am part of this effort). It
avoids overlays as well as BGP because it aggregate routes. It uses its own IP
address management (IPAM) to maintain the route hierarchy.

The nice thing about this is that nothing has to happen for a new pod to be
reachable. No /32 route distribution or BGP (or etcd) convergence, no VXLAN ID
(VNID) distribution for the overlay. At some scale, route and/or VNID
distribution is going to limit the speed at which new pods can be launched.

One other thing not mentioned in the blog post or in any of these comments is
network policy and isolation. Kubernetes v1.3 includes the new network APIs
that let you isolate namespaces. This can only be achieved with a back end
network solution like Romana or Calico (some others as well).

[1] romana.io

------
crb
On the topic of "why do we need a distributed KV store for an overlay
network?" from the blog: there's a good blog post about why Kubernetes doesn't
use Docker's libnetwork.

[http://blog.kubernetes.io/2016/01/why-Kubernetes-doesnt-
use-...](http://blog.kubernetes.io/2016/01/why-Kubernetes-doesnt-use-
libnetwork.html)

~~~
jaytaylor
Thanks! I had almost forgotten about the container networking mayhem. Would
love to find out what progress has been made over the past six months since
the blog post was written.

~~~
bboreham
Kubernetes now uses CNI to configure interfaces in its most common "kubenet"
configuration, and (obviously) when you put it into "CNI" mode.

Most container network offerings - Calico, Flannel, Weave, etc. ship with a
CNI plugin.

Docker have not altered their network plugin API.

(I work on Weave Net, including the plugins for both Docker and CNI)

------
jlgaddis
BGP seems a needlessly complex solution to this problem. VXLAN would, IMO, be
a much better fit.

(--Network engineer who manages BGP for an ISP)

~~~
ymse
I don't know. With VX?LANs you need a whole lot of other infrastructure to
manage e.g. routes, load balancing or tenant-specific ACLs. By coercing BGP
into providing isolation, you can slap on routing, load balancing, ACLs "for
free", i.e. managed by the same control plane.

If you just need isolation I agree with you. But I actually find the Calico
solution rather elegant when looking at the whole package.

(--System administrator who manages VXLAN on a public cloud)

------
mrmondo
We're just about to switch to BGP internally using Calico (mentioned in
another comment, I believe performance is good now), we run around 300-600
containers currently using our implementation using Consul+Serf. We'll drop a
talk on it once we've made the switch if anyone is interested. We're
deliberately avoiding flannel because of the tunnelled networking and added
complexity that we don't feel we want to introduce.

------
e12e
I've for a long time wondered if anyone has successfully just gone full ipv6
only with a substantial container/vm roll-out. On paper it should have:

1) enough addresses. Just enough. For everything. For everyone. Google-scale
enough.

2) Good out-of-the box dynamic assignment of addresses.

And finally, optional integration with ipsec, which I get might in the end be
over-engineered, and under-used -- but wouldn't it be nice if you could just
trust the network (you'd still have to bootstrap trust somehow, probably
running your own x509 CA -- but how nice to be able to flip open any book on
networking from the 80s and just replace the ipv4 addressing with ipv6 and
just go ahead and use plain rsh and /etc/allow.hosts as your entire
infrastructure for actually secure intra-cluster networking -- even across
data-centres and what not. [ed: and secure nfsv3! wo-hoo]).

But anyway, have anyone actually done this? Does it work (for a meaningfully
large value of work)?

~~~
otterley
You don't really need IPv6 to do this. IPv4 is sufficient; you just need to
assign more than 1 IP address to your physical interfaces. This has been
possible since at least the early 1990s.

The problem is that many cloud providers (ahem EC2) don't make this trivially
easy like they should.

~~~
tw04
They don't make it trivial because we're out of IPv4 address space. Which
would be the reason for doing this with IPv6. The automatic addressing of IPv4
is also nowhere near as simple as v6, not even close.

~~~
otterley
Both AWS and GCE can allocate private 10.0.0.0/8 subnets for internal networks
(e.g. VPCs). There is no address scarcity in such subnets.

~~~
e12e
There's a big difference between routable (on the Internet) addresses and non-
routable addresses. For one, you can trivially merge two different sets of
resources (say from two different organizations or projects) without address
conflicts, if they both have globally unique, routable addresses.

LAN networking is great. Internet networking is so much better, it has
effectively given birth to a new technological age.

~~~
otterley
I don't disagree on these particular points. My contention is solely that lack
of IPv6 is not a showstopper.

~~~
tw04
It absolutely is a showstopper if you don't have applications that are NAT
friendly. And in 2016 if you are forcing people to use NAT and claiming that's
a fix, you're doing the entire world a disservice.

~~~
otterley
I've been hearing this argument for 15 years now and it hasn't changed. Sure,
there's a public address shortage these days, but the incessant whining by
end-to-end purists who think NAT is an unbearable sin falls on deaf ears just
as much now as it ever did.

I'm not against V6 by any means, but internal IPv4 private subnets are simply
not a problem for the 99+% of who still somehow manage to get things done in
the real world.

------
lobster_johnson
BGP looks really complex. Isn't OSPF (BGP's "little brother") a much
attractive choice here? It's still complex, but should be much simpler.

Another attractive alternative to Flannel is Weave [1], run in the simpler
non-overlay mode. In this mode, it won't start a SDN, but will simply act as a
bridge/route maintainer, similar to Flannel.

[1] [https://www.weave.works/products/weave-
net/](https://www.weave.works/products/weave-net/)

~~~
detaro
BGP IMHO is much simpler than OSPF. No different area types, support for
communities, no need to keep a link-state database for the entire network in
all nodes, ...

~~~
tptacek
People keep talking about link state database overhead, but how significant is
this in reality? The graphs we're talking about, even in huge deployments, are
small.

If you're running etcd or consul, I'm not sure you retain the right to call
LSA flooding "complicated". It's simple compared to RAFT!

~~~
detaro
Size probably doesn't matter that much until you start to fill entire
datacenters, OSPF nowadays should work with hundreds of routers as well. Would
be interesting to see how failure cases compare, if I remember right one of
the arguments for BGP in data-center fabrics was that the updates following
them stay more localized. (EDIT: a description how Microsoft uses it for
really large networks, slide 11 talks about surges:
[http://www.janog.gr.jp/meeting/janog33/doc/janog33-bgp-
nkpos...](http://www.janog.gr.jp/meeting/janog33/doc/janog33-bgp-
nkposong-1.pdf))

I find BGP easier to understand, and I don't see what benefit OSPF would have.
(Not that I really have non-trivial experience with either, have only used
them at home and toy networks)

------
delinka
Have I misunderstood something here? We don't BGP on a local networks. Via
ARP, a node says "who has $IP?" Something answers with a MAC address. The
packet for $IP is wrapped in an Ethernet frame for that MAC address. If the IP
isn't local to your network, your router answers with its own MAC, and the
packet is framed up for the router.

BGP is the process by which ranges of IPs are claimed by routers. Is Calico
really used by docker containers in this way?

~~~
iheartmemcache
Yeah, have an upvote. This is totally a case of "using an industrial core
drill with tungsten carbide bits when all you need is an Ikea drill and a
chinese cheese-grade bit".

Sounds like this guy just found out about a cool new tool and decided to blog
about it. BGP can be used on local networks but it's total overkill for a
docker situation where all of your instances are likely in the same rack
(often on the same machine!). Ifyou don't even have an AS from ARIN/RIPE
there's no reason to even touch this (as you pointed out, it's all the
protocol is designed to broadcast to the _public internet_ \-- e.g. 'hey I own
this AS which has rights to this net-block, direct packets in this fashion
please!' Jeez.

I have no idea what the CPU overhead of running this is but I'm sure it's not
trivial, especially if the BGP daemon is tuned to retain any significant
amount of the whole BGP table (RAM/swap issues galore I'd imagine). Granted
the article is titled 'one way'.. which is empirically true, it's a Rube
Goldberg way of going about networking.

(n.b., OSPF is what people use for "BGP" within your own intranet, even when
you have tens of thousands of boxes. It's called _" border gate protocol"_ for
a reason..)

(Sorry, I don't use docker so I can't actually make a constructive comment
telling you what the canonical/ _right_ way of doing it is, but I can assure
you, this is not it.)

~~~
jauer
(I didn't downvote you, but this might be why someone did...)

Eh, BGP on local networks is common enough that it was novel maybe a decade
ago. It's perfect for running on servers to announce /32 addresses upstream to
ToR switches. OSPF is actually more heavyweight & complex since you have to
carry link state, do neighbor elections, etc. Ref: NANOG talk in 2003:
[https://www.nanog.org/meetings/nanog29/presentations/miller....](https://www.nanog.org/meetings/nanog29/presentations/miller.pdf)

You don't even need an AS from your RIR for BGP to be useful on internal
networks, just pick one (or more) of the private ASNs and roll with it.

Current best practice for internal networks (on the SP side at least) is to
use OSPF to bootstrap BGP by enabling IP reachability for p2p & loopback
addresses. After that customer / service routes are carried by BGP, not by
OSPF. This is because BGP doesn't spam link state and has better policy knobs.
You get a double win because OSPF converges faster with fewer routes and if
you have redundant links your BGP sessions won't flap because the loopback
addresses stay reachable. Ref: NANOG talk in 2011:
[https://www.nanog.org/meetings/nanog53/presentations/Sunday/...](https://www.nanog.org/meetings/nanog53/presentations/Sunday/bgp-101-NANOG53.pdf)

CPU/RAM overhead is insignificant with a bgpd like BIRD or Quagga. They work
for internet-scale routing (currently over 610,000 prefixes) with trivial
load. An Atom C-series with a few GB of RAM can deal with internet BGP routing
updates (CPU becomes significant for packet forwarding, not so much
maintaining routing tables).

I'll take a boring 10-year-old routing setup using iBGP on cluster of route
reflectors running BIRD with servers numbered out of a /24 per rack routed by
a ToR l3switch with each service using BGP to announce a /32 over new and
exciting L2 overlay networks any day. Troubleshooting is easier without having
to work through different layers of encapsulation, dealing with MTU, trusting
a novel control plane for whatever overlay network your using, etc.

~~~
gonzo
> CPU/RAM overhead is insignificant with a bgpd like BIRD or Quagga. They work
> for internet-scale routing (currently over 610,000 prefixes) with trivial
> load. An Atom C-series with a few GB of RAM can deal with internet BGP
> routing updates (CPU becomes significant for packet forwarding, not so much
> maintaining routing tables).

We've got a DPDK-enabled VRouter that will run over 12Mpps on a C2758 (8 core)
with a full BGP table.

------
sargun
More on this here: [https://medium.com/@sargun/a-critique-of-network-design-
ff85...](https://medium.com/@sargun/a-critique-of-network-design-
ff8543140667#.2fwstossu) \-- BGP isn't just about containers. It's about
signaling. It's a mechanism for machines to influence the flow of traffic in
the network.

This isn't container weirdness. This is because networks got stuck in 2008. We
still don't have have IPv6 SLAAC. Many of us made the jump to layer 3 clos
fabrics, but stopped after that. My belief is because AWS EC2, Google GCE,
Azure Compute, and others consider this the gold standard.

IPv6 natively supports autoconfiguring multiple IPs per NIC / machine
automagically*. This is usually on by default as part of the privacy
extensions, so in conjunction with SLAAC, you can cycle through IPs quickly.
It also makes multi-endpoint protocols relevant.

Containers and bad networking because of the lack of IP / container is a well-
known problem, it's even touched on in the Borg paper, briefly: One IP address
per machine complicates things. In Borg, all tasks on a machine use the single
IP address of their host, and thus share the host’s port space. This causes a
number of difficulties: Borg must schedule ports as a resource; tasks must
pre-declare how many ports they need, and be willing to be told which ones to
use when they start; the Borglet must enforce port isolation; and the naming
and RPC systems must handle ports as well as IP addresses.

Thanks to the advent of Linux namespaces, VMs, IPv6, and software-defined
networking, Kubernetes can take a more user-friendly approach that eliminates
these complications: every pod and service gets its own IP address, allowing
developers to choose ports rather than requiring their software to adapt to
the ones chosen by the infrastructure, and removes the infrastructure
complexity of managing ports.

But, I ask, what's wrong with the Docker approach of rewriting ports?
Reachability is our primary concern, and I'm unfortunately BGP hasn't become
the lingua franca for most networks ("The Cloud"). I actually think ILA
([https://tools.ietf.org/html/draft-herbert-
nvo3-ila-00#sectio...](https://tools.ietf.org/html/draft-herbert-
nvo3-ila-00#section-4.5)) / ILNP (RFC6741) are the most interesting approaches
here.

~~~
bboreham
> what's wrong with the Docker approach of rewriting ports

It requires that you rewrite the software trying to talk to that port, to make
it aware that you've put the new port number in a special environment
variable.

~~~
sargun
Have you looked at Docker bridge mode? and Mesosphere's VIPs?

What do you think of them?

~~~
bboreham
Docker bridge only works between containers on one machine; this is exactly
why we wrote Weave Net two years ago, to let you network simply between
containers running anywhere.

I hadn't considered using Virtual IPs to reverse out port-mapping. I guess it
would work provided you have good connectivity between hosts - it would be a
nightmare to try to configure a firewall where the actual ports in use jump
around all the time.

Also such schemes require that you know in advance which ports each component
listens on, and that there are no loops in the graph. Both of these
requirements can be constraining.

------
NetStrikeForce
Or you could NAT on the host and deploy simpler overlay networking:
[https://github.com/pjperez/docker-
wormhole](https://github.com/pjperez/docker-wormhole)

You can deploy this on any machine (container or not) and have it always
reachable from other members of the same network, which could be e.g. servers
on different providers (AWS, Azure, Digital Ocean, etc)

~~~
q3k
You should probably mention that this is a PaaS.

(and maybe also that you are affiliated with them)

~~~
NetStrikeForce
Hi,

Sorry, I should have made it explicit. As it's my own repo and my profile's
email address gives away I'm part of Wormhole I didn't think about making a
statement on the coment; but you're right.

Thanks!

------
tptacek
Especially since there isn't really a policy-routing component to this, isn't
BGP pretty _extremely_ complicated for the problem Calico is trying to solve?

Stipulating that you need a routing protocol here (you don't, right? You can
do proxy ARP, or some more modern equivalent of proxy ARP.), there's a whole
family of routing protocols optimized for this scenario, of which OSPF is the
best-known.

~~~
X-Istence
Calico is installing routes in the Linux kernel. Those routes are pulled out
and distributed using BIRD. BIRD can do OSPF instead if you'd like.

All Calico cares about is that routes are distributed across various systems,
they don't necessarily care how you do it (configure BIRD however you'd like).

BGP is surprisingly simple and easy to set up with BIRD. Setting up a route
reflector with local hosts on the same L2 all peering with each other and
suddenly you can route whatever IP's you want by announcing them to your
peers.

Why do people think BGP is complicated?

~~~
hueving
>Why do people think BGP is complicated?

Read your own paragraph before this question. Why do I need to run another
process to exchange routes and configure a mesh or a route reflector? As an
admin that's just another mess of processes and communication to worry about.

Just because BGP is easy for you does not mean it's easy for most server
admins and devs without heavy networking backgrounds.

~~~
X-Istence
Wait what? How else should we be exchanging routes? Should we shove them into
a distributed key value store and then having each of the nodes pull out the
routes and installing them?

> As an admin that's just another mess of processes and communication to worry
> about.

Yet we fully expect admins to understand and build HA redundant clusters for
databases, or how to manage and update all the machines under their control,
and a variety of other tasks.

There is nothing inherently different about running a BGP speaking daemon.
It's all config.

I don't have a heavy networking background at all. I'm a software engineer
that's currently working as a system architect, but even I can understand
something as simple as a route distribution system.

------
cthalupa
There's a lot of misinformation in this.

>A Linux container is a process, usually with its own filesystem attached to
it so that its dependencies are isolated from your normal operating system. In
the Docker universe we sometimes talk like it's a virtual machine, but
fundamentally, it's just a process. Like any process, it can listen on a port
(like 30000) and do networking.

A container isn't a process. It's an amalgamation of cgroups and namespaces. A
container can have many processes. Hell, use systemd-nspawn on a volume that
contains a linux distro and your container is basically the entire userspace
of a full system.

>But what do I do if I have another computer on the same network? How does
that container know that 10.0.1.104 belongs to a container on my computer?

Well, BGP certainly isn't a hard requirement. Depending on how you've setup
your network, if these are in the same subnet and can communicate via layer 2,
you don't need any sort of routing.

>To me, this seems pretty nice. It means that you can easily interpret the
packets coming in and out of your machine (and, because we love tcpdump, we
want to be able to understand our network traffic). I think there are other
advantages but I'm not sure what they are.

I'm not sure where the idea that calico/BGP are required to look at network
traffic for containers on your machine came from. If there's network traffic
on your machine, you can basically always capture it with tcpdump.

> I find reading this networking stuff pretty difficult; more difficult than
> usual. For example, Docker also has a networking product they released
> recently. The webpage says they're doing "overlay networking". I don't know
> what that is, but it seems like you need etcd or consul or zookeeper. So the
> networking thing involves a distributed key-value store? Why do I need to
> have a distributed key-value store to do networking? There is probably a
> talk about this that I can watch but I don't understand it yet.

I think not at all understanding one of the major players in container
networking is a good indication it might not yet be time to personally write a
blog about container networking. Also absent is simple bridging.

Julia generally writes fantastic blogs, and I know she doesn't claim to be an
expert on this subject and includes a disclaimer about how this is likely to
be more wrong than usual, but I feel like there was a lot of room for
additional research to be done to produce a more accurate article. I
understand the blog is mostly about what she has recently learned, and often
has lots of questions unanswered... But this one has a lot of things that are
answered, incorrectly :(

~~~
amouat
I've never read any of her blogs before (that I can recall) and I do agree
there are some misunderstandings. But it was clearly written as as a brain
dump to help other people going through the same process and it largely
achieves that goal. I really don't like the concept "don't write a blog unless
you're an expert" \- we would lose out on lots of valuable discussions and
helpful articles, especially for beginners. For example, we wouldn't have this
HN thread with useful commentary if it wasn't for the author.

I think the best approach is to constructively comment and engage to improve
the article. The important thing is to do this in a positive manner so that
the author feels they have done something of value and started a conversation.
It is surprisingly difficult to do this, and I've certainly failed on
occasion, but it is definitely worth trying.

~~~
cthalupa
I'm certainly not suggesting the need to be an expert to write a blog - I just
personally would be more cautious with some statements if I was writing on
something I wasn't knowledgable in.

------
philip1209
The internal OpenDNS docker system, Quadra, relies on BGP for a mix on of on-
prem and off-prem hosting:

[http://www.slideshare.net/bacongobbler/docker-with-
bgp](http://www.slideshare.net/bacongobbler/docker-with-bgp)

------
otterley
The real problem is that cloud providers don't provide out-of-the-box
functionality to assign more than one IP to a network interface. If they did
this, there wouldn't even be an issue.

I've been requesting this feature from the EC2 team at AWS for some time about
this, to no avail. You can bind multiple interfaces (ENIs) to an instance (up
to 6, I think, depending on the instance size), each with a separate IP
address, but not multiple IPs to a single interface.

BGP, flannel, vxlan, etc. are IMO a waste of cycles and add needless
complexity to what could otherwise be a very simple architecture.

~~~
feisuzhu
It's not a waste of cycles. You are just pushing your responsibilities to AWS,
the cycles and memory for routing your additional IPs are required anyway.

~~~
otterley
Why do you think that maintaining the network should be my responsibility, as
opposed to the provider's?

------
dozzie
Oh boy. And containers were supposed to make things _easier_.

~~~
api
You could just have every container get a magic IPV6 address that just works.

[https://www.zerotier.com/community/topic/67/zerotier-6plane-...](https://www.zerotier.com/community/topic/67/zerotier-6plane-
ipv6-addressing)

Full disclosure: this is ours.

~~~
catern
Or you could just give every container a real IPv6 address, no need for any
magic...

~~~
api
Nothing against that either but some people can't do it due to hybrid
deployments, providers that don't give you a /64, or providers that don't
offer V6 at all.

Currently Amazon, Google, and Azure have no native IPv6 support.

Also many are allergic to the security implications. You have to be rigorous
with ip6tables and making sure everything speaks SSL or another encrypted
protocol using authentication in both directions. Many things do not support
SSL or don't support bidirectional auth.

Personally I doubt overlay networks are going away. Most backplane software
like databases, distributed caches and event servers, etc. offers literally no
security because it's all built with the assumption that it will run on a
secure backplane. I have personally railed against this for years but I've
found that it's a total waste of breath.

~~~
jeff_marshall
Which provider won't give you a /64 (that supports IPv6 at all)?

Even my residential cable provider gives me a /60\. Anything less seems
absurd.

~~~
api
I've seen smaller ones do /128\. Probably just cluelessness.

The bigger issue is the need for a secure backplane, which will remain until
all server software authenticates all sockets in a strong way.

~~~
jeff_marshall
I agree re: the secure backplane (though I would prefer to see it done outside
server software in a more comprehensive manner).

I'm surprised that the bar isn't set a bit higher inside the cloud provider
infrastructure for tenant separation at the network level. I suppose it boils
down to the lack of assurance at an even lower level (who trusts Xen these
days?) that seems unlikely to be fixed in the short term.

