I was skeptical when we first deployed it but we've found it to be dependable and fast. We're running it in production on six CoreOS servers and 400-500 containers.
We did evaluate Project Calico initially but discovered some performance tests that tipped the scales in favor of flannel.  I don't know if Calico has improved since then, however. This was about a year ago.
 A Kubernetes pod is one or more related containers running on a single server
However, if you run it on AWS, it can automatically configure a bridge (cbr0) and configure up the VPC routing table for you.
GCE (Google's managed Kubernetes on Google Cloud) also handles this automatically.
There's also experimental support for Flannel built into K8s, which can be enabled with a flag. Not sure if it's worth using.
However the OSS Kubernetes has code to configure routes on GCE same as it does for AWS.
The nice thing about this is that nothing has to happen for a new pod to be reachable. No /32 route distribution or BGP (or etcd) convergence, no VXLAN ID (VNID) distribution for the overlay. At some scale, route and/or VNID distribution is going to limit the speed at which new pods can be launched.
One other thing not mentioned in the blog post or in any of these comments is network policy and isolation. Kubernetes v1.3 includes the new network APIs that let you isolate namespaces. This can only be achieved with a back end network solution like Romana or Calico (some others as well).
Most container network offerings - Calico, Flannel, Weave, etc. ship with a CNI plugin.
Docker have not altered their network plugin API.
(I work on Weave Net, including the plugins for both Docker and CNI)
(--Network engineer who manages BGP for an ISP)
If you just need isolation I agree with you. But I actually find the Calico solution rather elegant when looking at the whole package.
(--System administrator who manages VXLAN on a public cloud)
That's not required here.
If you give me 10 machines on a L2 domain, I can set up a private network on top of those 10 machines, and advertise what IP is where to each of them by sharing a routing table... I can of course manually add routes saying a /24 is located on Host 1, and another /24 is on Host 2, or...
What better way to share a routing table than with a route distribution protocol of some sort?
So plop BIRD with BGP on all of the machines, peer em, and have them pull routes out of the Linux routing table and insert routes as necessary.
Now if I spin up a container on Host 1, I advertise a /32 for that IP, and Host 2-10 can all know to forward packets for that /32 to Host 1. If I move that container or IP to Host 2, BGP announces it to all the other hosts and traffic starts flowing there instead.
There is no requirement that Calico (or BIRD rather) peer with any existing BGP infrastructure... you can of course do that, but there is no requirement.
Stop making BGP sound like it's some bad evil thing that's difficult to understand.
1) enough addresses. Just enough. For everything. For everyone. Google-scale enough.
2) Good out-of-the box dynamic assignment of addresses.
And finally, optional integration with ipsec, which I get might in the end be over-engineered, and under-used -- but wouldn't it be nice if you could just trust the network (you'd still have to bootstrap trust somehow, probably running your own x509 CA -- but how nice to be able to flip open any book on networking from the 80s and just replace the ipv4 addressing with ipv6 and just go ahead and use plain rsh and /etc/allow.hosts as your entire infrastructure for actually secure intra-cluster networking -- even across data-centres and what not. [ed: and secure nfsv3! wo-hoo]).
But anyway, have anyone actually done this? Does it work (for a meaningfully large value of work)?
The problem is that many cloud providers (ahem EC2) don't make this trivially easy like they should.
People suffer from a serious case of Stockholm syndrome wrt. ipv4 addresses and non-routable networks and what-not. There are some (very few) good use-cases for NAT - in most all other cases it just makes everything more complicated - for no real gain (well, you get to avoid buying networking equipment that supports ipv6...). And don't get me started on the "but it helps with security"-crowd... If you need a firewall, get a firewall. Stop conflating it with accidental features of limited address space.
I would have experimented more with ipv6 on my personal server already, if only I could get broadband that actually supported ipv6 (apparently my DSLAM is from the 90s). But now I'm moving, so hopefully I can get that last bit sorted. If nothing else it would appear most 4g networks support ipv6.
For non-personal, non-limited use, one would probably need to set up a fleet of ipv4 proxies/load-balancers -- but I'd be more than happy if I could just move to ipv6 and stop caring about the rest of the luddites ;-)
The main feature draw (on paper) of ipv6 isn't that it enables anything new, it's just that it allows simple stuff to be simple (again). And radical simplicity can be a great feature.
LAN networking is great. Internet networking is so much better, it has effectively given birth to a new technological age.
I'm not against V6 by any means, but internal IPv4 private subnets are simply not a problem for the 99+% of who still somehow manage to get things done in the real world.
Another attractive alternative to Flannel is Weave , run in the simpler non-overlay mode. In this mode, it won't start a SDN, but will simply act as a bridge/route maintainer, similar to Flannel.
If you're running etcd or consul, I'm not sure you retain the right to call LSA flooding "complicated". It's simple compared to RAFT!
I find BGP easier to understand, and I don't see what benefit OSPF would have. (Not that I really have non-trivial experience with either, have only used them at home and toy networks)
BIRD supports OSPF, so if you'd like to import/export routes using OSPF you can.
BGP is the process by which ranges of IPs are claimed by routers. Is Calico really used by docker containers in this way?
Kubernetes enforces a specific rule: Each pod (a group of containers) must be allocated its own cluster-routable IP address. This vastly simplifies Docker setups: In a way, it containerizes the network, just like Docker containerizes processes. It's the only sane way to manage containers, in my opinion.
This system requires something that can hand out IPs and ensure that they're routable on every machine. That something can be done in different ways, range from extremely simple to rather complex. For example, you could have something that acts like a bridge and coordinates with other nodes to find available IPs, and simply maintains the routing table on the nodes themselves in sync with this shared database (Flannel can run in this mode). Or you could use an SDN-defined overlay network (e.g. Weave).
Using a real routing protocol also immediately gives you access to traffic control, shaping, monitoring and redundancy tools, hardware support and knowledge that network administrators have been applying for years.
Sounds like this guy just found out about a cool new tool and decided to blog about it. BGP can be used on local networks but it's total overkill for a docker situation where all of your instances are likely in the same rack (often on the same machine!). Ifyou don't even have an AS from ARIN/RIPE there's no reason to even touch this (as you pointed out, it's all the protocol is designed to broadcast to the public internet -- e.g. 'hey I own this AS which has rights to this net-block, direct packets in this fashion please!' Jeez.
I have no idea what the CPU overhead of running this is but I'm sure it's not trivial, especially if the BGP daemon is tuned to retain any significant amount of the whole BGP table (RAM/swap issues galore I'd imagine). Granted the article is titled 'one way'.. which is empirically true, it's a Rube Goldberg way of going about networking.
(n.b., OSPF is what people use for "BGP" within your own intranet, even when you have tens of thousands of boxes. It's called "border gate protocol" for a reason..)
(Sorry, I don't use docker so I can't actually make a constructive comment telling you what the canonical/right way of doing it is, but I can assure you, this is not it.)
Eh, BGP on local networks is common enough that it was novel maybe a decade ago. It's perfect for running on servers to announce /32 addresses upstream to ToR switches. OSPF is actually more heavyweight & complex since you have to carry link state, do neighbor elections, etc. Ref: NANOG talk in 2003: https://www.nanog.org/meetings/nanog29/presentations/miller....
You don't even need an AS from your RIR for BGP to be useful on internal networks, just pick one (or more) of the private ASNs and roll with it.
Current best practice for internal networks (on the SP side at least) is to use OSPF to bootstrap BGP by enabling IP reachability for p2p & loopback addresses. After that customer / service routes are carried by BGP, not by OSPF. This is because BGP doesn't spam link state and has better policy knobs. You get a double win because OSPF converges faster with fewer routes and if you have redundant links your BGP sessions won't flap because the loopback addresses stay reachable. Ref: NANOG talk in 2011: https://www.nanog.org/meetings/nanog53/presentations/Sunday/...
CPU/RAM overhead is insignificant with a bgpd like BIRD or Quagga. They work for internet-scale routing (currently over 610,000 prefixes) with trivial load. An Atom C-series with a few GB of RAM can deal with internet BGP routing updates (CPU becomes significant for packet forwarding, not so much maintaining routing tables).
I'll take a boring 10-year-old routing setup using iBGP on cluster of route reflectors running BIRD with servers numbered out of a /24 per rack routed by a ToR l3switch with each service using BGP to announce a /32 over new and exciting L2 overlay networks any day. Troubleshooting is easier without having to work through different layers of encapsulation, dealing with MTU, trusting a novel control plane for whatever overlay network your using, etc.
We've got a DPDK-enabled VRouter that will run over 12Mpps on a C2758 (8 core) with a full BGP table.
And you are categorically wrong. This is a very good way to do it.
This isn't container weirdness. This is because networks got stuck in 2008. We still don't have have IPv6 SLAAC. Many of us made the jump to layer 3 clos fabrics, but stopped after that. My belief is because AWS EC2, Google GCE, Azure Compute, and others consider this the gold standard.
IPv6 natively supports autoconfiguring multiple IPs per NIC / machine automagically*. This is usually on by default as part of the privacy extensions, so in conjunction with SLAAC, you can cycle through IPs quickly. It also makes multi-endpoint protocols relevant.
Containers and bad networking because of the lack of IP / container is a well-known problem, it's even touched on in the Borg paper, briefly:
One IP address per machine complicates things. In
Borg, all tasks on a machine use the single IP address of
their host, and thus share the host’s port space. This causes
a number of difficulties: Borg must schedule ports as a resource; tasks must pre-declare how many ports they need,
and be willing to be told which ones to use when they start;
the Borglet must enforce port isolation; and the naming and
RPC systems must handle ports as well as IP addresses.
Thanks to the advent of Linux namespaces, VMs, IPv6,
and software-defined networking, Kubernetes can take a
more user-friendly approach that eliminates these complications: every pod and service gets its own IP address, allowing developers to choose ports rather than requiring their software to adapt to the ones chosen by the infrastructure, and removes the infrastructure complexity of managing ports.
But, I ask, what's wrong with the Docker approach of rewriting ports? Reachability is our primary concern, and I'm unfortunately BGP hasn't become the lingua franca for most networks ("The Cloud"). I actually think ILA (https://tools.ietf.org/html/draft-herbert-nvo3-ila-00#sectio...) / ILNP (RFC6741) are the most interesting approaches here.
It requires that you rewrite the software trying to talk to that port, to make it aware that you've put the new port number in a special environment variable.
What do you think of them?
I hadn't considered using Virtual IPs to reverse out port-mapping. I guess it would work provided you have good connectivity between hosts - it would be a nightmare to try to configure a firewall where the actual ports in use jump around all the time.
Also such schemes require that you know in advance which ports each component listens on, and that there are no loops in the graph. Both of these requirements can be constraining.
You can deploy this on any machine (container or not) and have it always reachable from other members of the same network, which could be e.g. servers on different providers (AWS, Azure, Digital Ocean, etc)
(and maybe also that you are affiliated with them)
Sorry, I should have made it explicit. As it's my own repo and my profile's email address gives away I'm part of Wormhole I didn't think about making a statement on the coment; but you're right.
Stipulating that you need a routing protocol here (you don't, right? You can do proxy ARP, or some more modern equivalent of proxy ARP.), there's a whole family of routing protocols optimized for this scenario, of which OSPF is the best-known.
Opinions vary whether this is a real concern, or just a way for the networking team to maintain their relevance.
All Calico cares about is that routes are distributed across various systems, they don't necessarily care how you do it (configure BIRD however you'd like).
BGP is surprisingly simple and easy to set up with BIRD. Setting up a route reflector with local hosts on the same L2 all peering with each other and suddenly you can route whatever IP's you want by announcing them to your peers.
Why do people think BGP is complicated?
Read your own paragraph before this question. Why do I need to run another process to exchange routes and configure a mesh or a route reflector? As an admin that's just another mess of processes and communication to worry about.
Just because BGP is easy for you does not mean it's easy for most server admins and devs without heavy networking backgrounds.
> As an admin that's just another mess of processes and communication to worry about.
Yet we fully expect admins to understand and build HA redundant clusters for databases, or how to manage and update all the machines under their control, and a variety of other tasks.
There is nothing inherently different about running a BGP speaking daemon. It's all config.
I don't have a heavy networking background at all. I'm a software engineer that's currently working as a system architect, but even I can understand something as simple as a route distribution system.
This isn't complicated, it's config management. You can ignore 99% of what BGP can do in this use case.
>A Linux container is a process, usually with its own filesystem attached to it so that its dependencies are isolated from your normal operating system. In the Docker universe we sometimes talk like it's a virtual machine, but fundamentally, it's just a process. Like any process, it can listen on a port (like 30000) and do networking.
A container isn't a process. It's an amalgamation of cgroups and namespaces. A container can have many processes. Hell, use systemd-nspawn on a volume that contains a linux distro and your container is basically the entire userspace of a full system.
>But what do I do if I have another computer on the same network? How does that container know that 10.0.1.104 belongs to a container on my computer?
Well, BGP certainly isn't a hard requirement. Depending on how you've setup your network, if these are in the same subnet and can communicate via layer 2, you don't need any sort of routing.
>To me, this seems pretty nice. It means that you can easily interpret the packets coming in and out of your machine (and, because we love tcpdump, we want to be able to understand our network traffic). I think there are other advantages but I'm not sure what they are.
I'm not sure where the idea that calico/BGP are required to look at network traffic for containers on your machine came from. If there's network traffic on your machine, you can basically always capture it with tcpdump.
> I find reading this networking stuff pretty difficult; more difficult than usual. For example, Docker also has a networking product they released recently. The webpage says they're doing "overlay networking". I don't know what that is, but it seems like you need etcd or consul or zookeeper. So the networking thing involves a distributed key-value store? Why do I need to have a distributed key-value store to do networking? There is probably a talk about this that I can watch but I don't understand it yet.
I think not at all understanding one of the major players in container networking is a good indication it might not yet be time to personally write a blog about container networking. Also absent is simple bridging.
Julia generally writes fantastic blogs, and I know she doesn't claim to be an expert on this subject and includes a disclaimer about how this is likely to be more wrong than usual, but I feel like there was a lot of room for additional research to be done to produce a more accurate article. I understand the blog is mostly about what she has recently learned, and often has lots of questions unanswered... But this one has a lot of things that are answered, incorrectly :(
I think the best approach is to constructively comment and engage to improve the article. The important thing is to do this in a positive manner so that the author feels they have done something of value and started a conversation. It is surprisingly difficult to do this, and I've certainly failed on occasion, but it is definitely worth trying.
Honestly, I struggle with when to publish things a lot -- I practically never write about things I understand well, but I do usually write about things that I understand a little better than this post. Consider it an ongoing experiment :)
I really appreciate factual corrections like "a container isn't a process", and I think comment threads like this are a good place for that. I fixed up a few of the more egregious incorrect things.
I've been requesting this feature from the EC2 team at AWS for some time about this, to no avail. You can bind multiple interfaces (ENIs) to an instance (up to 6, I think, depending on the instance size), each with a separate IP address, but not multiple IPs to a single interface.
BGP, flannel, vxlan, etc. are IMO a waste of cycles and add needless complexity to what could otherwise be a very simple architecture.
Full disclosure: this is ours.
Currently Amazon, Google, and Azure have no native IPv6 support.
Also many are allergic to the security implications. You have to be rigorous with ip6tables and making sure everything speaks SSL or another encrypted protocol using authentication in both directions. Many things do not support SSL or don't support bidirectional auth.
Personally I doubt overlay networks are going away. Most backplane software like databases, distributed caches and event servers, etc. offers literally no security because it's all built with the assumption that it will run on a secure backplane. I have personally railed against this for years but I've found that it's a total waste of breath.
Even my residential cable provider gives me a /60. Anything less seems absurd.
The bigger issue is the need for a secure backplane, which will remain until all server software authenticates all sockets in a strong way.
I'm surprised that the bar isn't set a bit higher inside the cloud provider infrastructure for tenant separation at the network level. I suppose it boils down to the lack of assurance at an even lower level (who trusts Xen these days?) that seems unlikely to be fixed in the short term.