I hope all of these Docker overlay networks start using the in-kernel overlay network technologies soon. User-space promiscuous capture is obscenely slow.
Take a look at GRE and/or VXLAN and the kernels multiple routing table support. (This is precisely why network namespaces are so badass btw). Feel free to ping me if you are working on one of these and want some pointers on how to go about integrating more deeply with the kernel.
It's worth mentioning these protocols also have reasonable hardware offload support, unlike custom protocols implemented on UDP/TCP.
OpenContrail can be used as an overlay network for docker: the overlay is implemented as a kernel module and comes very close to the theoretical maximum iperf performance on a server with 2x10G links.
This script https://github.com/pedro-r-marques/opencontrail-netns/blob/m... can be used to associate any docker container created with "--net=none" with an overlay network. Better yet you get all the semantics of the OpenStack neutron API: floating-ip, dhcp options, source-nat, LBaaS.
The kernel module also collects flow records of all the traffic and there is a web-ui that can display the analytics of all the traffic flows in your network.
Install guide: https://github.com/Juniper/contrail-controller/wiki/OpenCont...
Support on freenode.net #opencontrail.
If you're going down the path of VXLAN support in Docker, I'd love to talk. The company I founded built a Linux distribution for commodity hardware switches that can do VXLAN encap/decap in hardware at 2+ Tbit/sec. The same configuration that works in a Linux container host or a hypervisor works on the switches.
You have to pick either kernel or user space, not both. Either implement it purely in the kernel or purely in user space. In reality pure user space is faster, just look at Snabb switch https://github.com/SnabbCo/snabbswitch/wiki
Snabb and DPDK aren't magic though. Because they poll you have to dedicate a whole core to the vSwitch. Containers are a different case than VMs because the packets start in the kernel TCP/IP stack; to get into a userspace vSwitch they'd have to exit the kernel.
Since you seem to have some kernel expertise, do you know if there is an easy way (via an iptables/ebtables plugin or some such) to get packets to switch namespaces? It seems like you could do a whole lot with just simple kernel packet rewriting if you could have an in-container-namespace rule to jump into another namespace before routing. You could do some analog of this with a veth device, but it seems like it would be much faster to just switch the namespace.
"just switching namespaces" isn't easy, since a packet (in the kernel represented by an SKB) has to have an interface it came in on. The main role an veth pair has is to move the packet between namespaces, and to provide a new in interface, one that is visible in the new namespace.
Unless someone did something crazy, traversing a veth pair should just be doing a little bookkeeping on the SKB, no data copies at all.
This looks like a great idea. For me this was a missing piece two months ago when playing with Docker.
However I have strong doubts about the network performance, not only the overhead of the UDP encapsulation (that should be quite small), but mostly the capturing of packets with pcap and then handling them in user-mode. Looks like a lot of context-switches, copying and parsing with non-optimal code paths. Are there any benchmarks available?
My feeling is that this will consume large amounts of CPU for moderate network loads and thus be unusable with most NoSQL kind of systems that benefit from clustering across hosts?
re benchmarks...publishing some is on the TODO list. See https://github.com/zettio/weave/issues/37. Informally, weave is pretty fast but it's not saturating Gbit Ethernet. As you say, capturing with pcap and handling packets in user space carries an appreciable overhead.
We've got some issues filed to look at pcap alternatives and also generally aim to improve performance.
re suitability for NoSQL clustering... depends on where the bottlenecks are; if you want to cluster for HA rather than scale, i.e. there aren't any real bottlenecks, then weave will work well. Same if you want to cluster because of CPU or memory bottlenecks. If, otoh, networking is the bottleneck then adding weave into the mix isn't going to improve matters.
Ok, good to know. I think the challenge in taking another route than pcap is that you would need to do complex tricks with the existing network stack. Because if I understand the way Weave works you would really only need to do processing at the beginning of a connection and for some ARP requests etc while you don't need to do anything to existing TCP streams apart from encapsulating and forwarding?
To retain the essence of how weave operates, this would likely not just be complex but impossible, short of kernel hackery.
> you would really only need to do processing at the beginning of a connection and for some ARP request
Weave needs to look at every Ethernet packet. Well, the headers at least. It's a virtual Ethernet switch. It doesn't even really know about IP, let alone TCP streams. See https://github.com/zettio/weave#how-does-it-work
This seems very nice. What would be the pros and cons of using Weave instead of Tinc ? I have used Tinc for a while[0] and, the end result looks very similar (i.e. there is not a nice command-line tool dedicated to use Tinc with Docker, but the high level description match).
Just having a look at Tinc.. one difference is completely non-technical - Tinc is GPL but Weave is Apache licensed so aligns better with the whole Docker ecosystem. More comments to come.
RabbitMQ has always been an extraordinarily good piece of software, from the very first versions. I think this is very good news for anyone using Docker.
as someone who is more developer than ops, I feel like the docker stuff is still changing fast and that the way you would use docker today will be very different a year from now; but that containers seem to be the way of the future - if I have no pressing need to change my server architecture does it make sense to wait for things settle or would it be more beneficial to get in and learn now and experience the changes and why they were necessary?
Docker hopefully will be a little different in a year as it will offer solid security isolation (which is not the case now).
Right now if you run a Docker container and run something within it as root, if that application gets compromised someone can break out of that container and alter the master (due to the way Docker links container-root and system-root, a root user in a container is effectively a root user on the whole system).
Docker are working on allowing containers to run entirely in user-mode (thanks to improvements in LXC). This would mean that you can run a process as root within a container, and if that gets compromised there is near zero chance of leveraging that into damaging the master OS (since it will just have normal user privileges).
Here's an article about their progress (to usermode):
> However, it has been pointed out that if a kernel vulnerability allows arbitrary code execution, it will probably allow to break out of a container — but not out of a virtual machine.
In other words, right now, a root process is likely able to escape a Docker container. You can use SELinux, AppArmor, and similar to somewhat mitigate that when it happens but neither are near as powerful as having that usermode isolation on the master.
If Docker is able to get usermode containers working, it will be very difficult for a Docker container to either alter other Docker containers or the master system (other than over the network, maybe).
It has nothing to do with LXC, but the Linux kernels. LXC simply bundles all of the linux kernel namespace features together into one shiney thing to make containers. Conceptually, docker, via libcontainer, does the exact same thing.
It is definitely changing fast, but that is not a reason to wait before trying out the technology in order to learn about it. If nothing else, the community will welcome your feedback :-)
I'd like to see a post/writeup (or even an essay!) on how to do the magic "backing services" (http://12factor.net/backing-services) with Docker. And that's where I find Deis and other Docker orchestration systems lacking very much.
Sure, you can run MySQL in Docker, but it's a far cry from running it on native xfs with aligned partitions and whatever fancy you feel configuring. And since docker containers are very reusable, whereas backing data is by default should be persistent, my impression is that it's too easy to accidentally remove a docker container.
At the expensive of sounding like I'm just plugging my own company, this is what we're working on in the open-source project Flocker (https://github.com/ClusterHQ/flocker). We think that data-services like databases, queues and key-value stores, and anything else with state should be able to run inside docker containers too. Yes, you can already run a database in a docker container, but from an ops perspective, this is a nightmare and very far from what you want want to be able to do in a production system. Would love your feedback on what we're building and even more for you to get involved. Flocker is licensed under Apache 2.0, so feel free to get involved.
You say that running a database in a Docker container is "a nightmare and very far from what you want to be able to do in a production system," but you do not explain why. Perhaps you might substantiate such a claim?
Generally, you want to be able to answer these questions when it comes to operating your databases:
What are the failure points?
What is the impact of each failure point?
What are the SINGLE points of failure?
What is my recovery pattern?
What is my upgrade experience?
What is the operational overhead in the applications running ON the product?
What is my DR strategy?
What is my HA strategy?
Pure Docker, and no other tool that we are aware of in the docker/container ecosystem that we are aware of provide really good answers to these questions when it comes to databases. That is what I mean when I said running databases in containers in a nightmare. It is possible today, for sure, but it is extremely complex operationally and its why it is so rare to see prod databases running in production today.
Nothing is stopping you from running MySQL in Docker on native xfs with aligned partitions: bind-mount whatever partition you want into the container by defining a volume in Docker.
This will be persistent, and will survive when you destroy the container. I use this to e.g. share a /home directory between a dozen experimental dev container I use to run my various projects - each container ensures I keep track of exact dependencies for each individual project, while I get to have a nice "comfortable" swiss-army-knife container with my dev tools and all all project files.
I also run a number of database containers which use volumes where I bind mount host directories to ensure persistence so I can wipe and rebuild the containers themselves without worrying about touching data.
Last time I tried was around 0.8, and aufs was not happy with xfs. I haven't tried having /var/lib/docker on ext4 and "/volumes" on xfs, I'll try if the need arises.
I run a few MongoDBs with volumes, but I'm not confident that I won't accidentally start two with the same volume, or that someone won't accidentally delete the volume, or .. or .. or.
As I've written to a sibling comment, I don't consider it a hard problem, but it hasn't been taken care of .. yet!
Am intrigued (and, again, showing my current early-stage understanding of LXCs), can you link the same data store container to multiple application containers? As in, have both a beta application and production application pulling data from the same core DB?
And do you simply define the container as a volume to ensure it stays persistent? That was the feeling I got from the docs, but again, might just be flagging how little I know at the minute...
Yes, you can definitely share volumes between any number of containers, it's a common usage pattern.
You don't need to make a container persistent: Docker, by design, will never remove anything unless you explicitly ask it to. If you want to separate the lifecycle of a directory within your container, so that it stays behind after you explicitly remove the container, or to share it between containers - that's when volumes are useful.
Unfortunately I haven't read enough 12fa, but I know I can address most of your questions with one factoid: Volumes. You are absolutely right that Docker containers are meant to be disposable, and should not contain backing data. That is what Volumes are for. I haven't done enough with volumes to give you a real primer on the use of them, but volumes can run on whatever backing store you want and they are not so intertwined with the container that they would be deleted along with it.
It looks like volumes have evolved significantly since the feature was introduced, you might want these links, sorry I haven't reviewed them myself:
(I actually do keep my backing data in the containers, we have institutionalized backups where all of the important data is already kept in git anyway, so instance clones are in fact disposable for me even though they have all of the important backing data in them.)
I'm familiar with volumes, and here's how I see the problem:
On a docker host you have the docker daemon, and whatever auxiliary stuff you need to orchestrate either the containers or the host (update docker itself, and so on), you have space for /var/lib/docker, and that's it. Volumes are always somewhere on /host/data. That means you have to make up a scheme and convention, and cook up scripts and add it to your already quite dynamic mental model.
If you go and want to manage volumes, you need something for that. And currently everyone and their cats have their own solutions (because there is one they claim to use and one they use, and one they hack on to use later). I'm not claiming it's a hard problem, just that it's not taken care of yet.
Maybe Flocker will deliver, I haven't checked it since it was posted 5 minutes ago :)
Docker's design makes it incredibly easy --- or at least it makes it difficult to treat your backing services NOT like attached resources, therefore forcing you into some sort of 12factor-esque design. There's no magic to "backing services".
So, basicaly, you're saying that you don't want to mess with that setup and use existing stuff... which you could achieve with everything as long as someone else do the setup for you. I don't see where Docker is better for a developper in this case... providing a vagrant box is exactly the same.
Startup time for a docker container is way way faster than a VM. Also, you could run the exact binary state of production, which is helpful if you run into "works on my machine" types of problems.
Agreed, but my point was that Docker doesn't reduce dev setup time. Give a good vagrant config file to a dev and tell him to do vagrant up and you have the same result as what you're saying. You can replicate production state with vagrant too (and bash scripts if we stretch this) and avoid "works on my machine" problems.
I'm not saying that vagrant > docker. The way I see it, docker is great if your infrastructure is using it all the way. If your prod setup if not dockerized, using docker in dev seems to me counterproductive than spinning up a VM and provisionning it with ansible or puppet to achieve production replication. As @netcraft said, I don't see why I should "change my server architecture" to use docker in dev.
If you have a complex stack (multiple services, different versions of Ruby/Python/etc, DB, search engine, etc), it's a real pain to shove them all into a single VM. Once you have 2 VM's running you have already lost to Docker on memory/space efficiency and start-up time.
I have yet to see real, complex, and distributed applications that share the exact same config in dev and production. I know that having the same versions of system libs in dev and prod can be a problem in some context and docker can help with that, but it's not the only solution and does not take care of the whole landscape (e.g., npm packages.json, pip requirements.txt, etc.).
I totally agree that startup time of a container is far less than a VM, but I don't see how docker "removes all the trouble of running applications that you need for your development: databases, application servers, queues"
You still need to install, configure these services, make sure that the containers can talk to each other in a reliable and secure way, etc.
First, I'm a dilettante. I haven't used docker in production. I've really only set up a handful of containers.
That said, all of those fiddly library dependencies are where i struggle the most at work. If i could just build a docker image and hand that off, it would save me a lot of grief with regard to getting deployment machines just right.
I do have a great deal of experience with legacy environments, and it seems like the only way to actually solve problems is to run as much as possible on my machine. Lowering that overhead would be valuable. Debugging simple database interaction is fine on a shared dev machine. a weblogic server that updates oracle that's polled by some random server that kicks of a shell script... ugh. Even worse when you can't log into those machines and inspect what a dev did years ago.
If you've got a clean environment, there's probably not as much value to you.
I hear you about legacy systems. Two years ago, I had to support a Python 2.4 system that used a deprecated crypto c library and I did not want to "pollute" my clean production infrastructure. Containers would definitively help with this scenario. The thought never occurred to me that docker could be used to reproduce/encapsulate legacy systems, thanks!
At the company I work for, we went through all the trouble of getting our distributed backend application running Vagrant using Chef so that we could have identical local, dev and production environments.
In the end, it's just so slow that nobody uses it locally. Even on a beefy Macbook Pro, spinning up the six VMs it needs takes nearly 20 minutes.
We're looking at moving towards docker, both for local use and production, and so far I'm excited by what I've seen but multi-host use still needs work. I'm evaluating CoreOS at the moment and I'm hopeful about it.
I don't see how Docker solves the speed problem without a workflow change that could already be accomplished with Vagrant.
* Install your stack from scratch in 6 VMs: slow
* Install your stack from scratch via 6 Dockerfilea: slow
* Download prebuild vagrant boxes with your stack installed: faster
* Download prebuilt docker images with your stack installed: fastest
The main drawback of Vagrant is that afaik it has to download the entire box each time instead of fetching just the delta. That may not matter much on a fast network.
I have to disagree, although I'll admit that what's "trivial" is subjective. Sure, a container means you don't have to run another kernel. If the container is single-purpose like docker encourages, you skip running some other software like ssh and syslog as well. That software doesn't use much CPU or memory though. I just booted Ubuntu 12.04 and it's using 53MB of memory. Multiplied across 6 VMs that's 318MB, not quite 4% of the 8GB my laptop has. I'd call that trivial.
On the last project where I had to regularly run many VMs on my laptop, the software being tested used more than 1GB. Calling it 1GB total per VM and sticking with the 53MB overhead, switching to containers would have reduced memory usage by 5%. Again, to my mind that's trivial.
This is really interesting. I've been looking for a way to build in support for networking between Docker hosts in my clocker.io software, to simplify deploying applications into a cloud hosted Docker environment. I'd been young with adding Open vSwitch, but am going to try weave as the network layer in the next release. Will there be any problems running in a cloud where I have limited control over the configuration of the host network interfaces and the traffic they can carry, such as AWS only allowing TCP and UDP between VMs?
Question for weavenetwork: are containers addressable by hostname from other containers? Is there a good way to do that? I didn't see anything about it in the readme.
I suppose service discovery is out-of-scope for this project but having some sort of weave-wide hostsfile would certainly simplify it. Am I misunderstanding the project?
Weave itself does not provide addressability beyond IP. That is the situation now, but this area is very much high on the agenda for us - service discovery is definitely in scope for weave.
Meanwhile, two points of note:
1) In weave the IP addresses can be much "stickier" than in other network setups, i.e. a moving a container from one host to another can retain the containers IP. That means it is quite amenable to relatively static name resolution configurations, e.g. via /etc/hosts files.
2) Since weave creates a fully-fledged L2 Ethernet network between app containers, name resolution technologies like mDNS that rely on multicast should work just fine.
So, in summary, while weave currently does not have any built-in service discovery, existing solutions and technologies for that should be relatively easy to deploy inside weave application networks, until weave itself grows these capabilities.
We are certainly aware of consul, and have indeed been thinking of weaving weave into it. Would love to see an experiment along those lines, if there are any volunteers.
The most significant conceptual difference is that in rudder sub-nets are tied to hosts. So containers on different hosts will always be on different sub-nets. By contrast, in weave containers belonging to the same application reside in the same sub-net, regardless of what host they are running on. In other words, weave makes the network topology fit the application topology, not the other way round.
Rudder has a central configuration (via etcd); Weave communicates network changes across all nodes by itself - as long as each new node knows how to contact at least one other node, it will learn of all other nodes and connect to as many as it can.
Also Rudder is Layer 3 and Weave is Layer 2 and Weave can encrypt traffic.
Take a look at GRE and/or VXLAN and the kernels multiple routing table support. (This is precisely why network namespaces are so badass btw). Feel free to ping me if you are working on one of these and want some pointers on how to go about integrating more deeply with the kernel.
It's worth mentioning these protocols also have reasonable hardware offload support, unlike custom protocols implemented on UDP/TCP.