

Announcing Docker 1.7: Multi-host networking, plugins and orchestration updates - eloycoto
https://blog.docker.com/2015/06/announcing-docker-1-7-multi-host-networking-plugins-and-orchestration-updates/

======
hmottestad
Awesome awesomeness. I read a blog post here about vxlan adding some 10%
overhead in terms of both latency and bandwith, which seems very acceptable to
me :)

CoreOS has had this for a while in flannel (ref the same article as above), so
it's not like it's a completely new thing, but if you prefer going with the
crowd, then having overlay networking built right into docker will make for an
awesome experience when using docker compose and docker swarm.

Edit:

Article:
[http://www.generictestdomain.net/docker/weave/networking/stu...](http://www.generictestdomain.net/docker/weave/networking/stupidity/2015/04/05/weave-
is-kinda-slow/)

And the results: Flannel VXLan 96.55% bandwidth utilisation 140.52% latency.
Without overlay latency was 91.8µs, and with vxlan it was 129µs.

~~~
Nrsolis
I can't say I'm particularly happy with the choice of VxLAN as a tunneling
technology but I'm not particularly surprised either.

~~~
Nrsolis
THIS is what I'm talking about:

[http://cloudarchitectmusings.com/2013/01/03/word-of-
caution-...](http://cloudarchitectmusings.com/2013/01/03/word-of-caution-
about-overextending-the-use-of-vxlan/)

VxLAN is a technology with a very limited horizon in terms of functionality.
If you're intending to really leverage containers in a super dense use case,
then you'd be well advised to adopt a technology that has ready made ASIC
support for cross the physical-to-virtual network boundary.

~~~
wmf
That article says VXLAN isn't good for DCI, but DCI isn't a good idea
anyway[1] and most people aren't trying to use Docker that way. And VXLAN is
pretty much the only encapsulation format that has good ASIC support in both
NICs and switches.

[1] [http://blog.ipspace.net/2014/10/vxlan-and-otv-saga-
continues...](http://blog.ipspace.net/2014/10/vxlan-and-otv-saga-
continues.html) [http://blog.ipspace.net/2013/09/sooner-or-later-someone-
will...](http://blog.ipspace.net/2013/09/sooner-or-later-someone-will-pay-
for.html) [http://blog.ipspace.net/2015/01/latency-killer-of-spread-
out...](http://blog.ipspace.net/2015/01/latency-killer-of-spread-out.html)

~~~
Nrsolis
Uh, no.

MPLS has had SOLID ASIC support for maybe two decades now. Everyone who is
putting in VxLAN-capable hardware is doing so because they have oodles of ESXi
deployed and VxLAN is the only technology that's supported by VMware.

If you're not wedded to the ESXi hypervisor, you can deploy either MPLS or
MPLSoverGREoverIP (for your non-MPLS capable endpoints) and capture a whole
stack of goodness to include interworking across an entire WAN infrastructure
for DC-to-DC VM movement and EVPN support.

MPLS has been doing L2VPN support for ages. Man, NSX/VxLAN didn't even have a
real control plane until recently. They were going to use MULTICAST for their
table updates! MULTICAST!

~~~
jpgvm
So much FUD.

First of all VxLAN is already nicely offloaded by the majority of NICs and if
not being terminated doesn't required hardware support in switching equipment.

VxLAN is almost no lock-in. Currently to do MPLS in any shape or form you are
going to be locked into proprietary network gear.

MPLS has almost no software implementations, whereas Linux has native VxLAN
support in vanilla bridging mode and OVS.

Forget MPLS on Windows.

MPLS is great tech don't get me wrong, however it's restricted to the carrier
space because it doesn't play well with others.

~~~
Nrsolis
You're thinking of a very narrow use-case for VxLAN.

What happens when your VM needs to talk to legacy networking gear? What
happens when your VM(s) need to talk to a legacy security infrastructure
that's mandated because of regulatory concerns? What happens when you need to
route between two VxLAN domains?

You're going to need to provide the VxLAN VTEP functionality someplace and
that requires hardware support in the network _someplace_. Waving your hands
and saying "that's not a concern" won't cut it. I have a customer that's
facing this issue RIGHT NOW.

And no, MPLS isn't a proprietary standard. It's open and it's available on
oodles of networking hardware. It's interoperable and it's a proven technology
that scales.

Developing a tunneling protocol is the _easy_ part. Developing a scalable
control plane to handle RIB and FIB state is another matter entirely. Getting
all of that to operate on a chipset that can do 2-4Tbps per slot is another
thing altogether.

Go check out some of the HUGE cloud infrastructure guys. They aren't using
BrandX networking hardware to build their data-centers. They're using the
Big-3 players.

~~~
jpgvm
No, they are not using the Big-3 players.

ISPs are using Alcatel and Juniper still sure, big enterprise shops are still
swearing by Cisco.

Big cloud infrastructure has all moved to Linux on merchant silicon, either
stuff like Quanta, Pico8 or even more DIY like OCP.

As for terminating VxLAN on legacy gear, yeah that is probably a bad idea and
someone probably made a bad decision to end up with an architecture that
requires that.

As for control plane. Most people using VxLAN at scale have their own control
planes. They are actually stupidly easy to build because the edge is so easy
to work with. I built my own that integrates with the Linux native VxLAN
implementation using netlink to program the forwarding table.

At the end of the day VxLAN isn't a great replacement for MPLS but it's a
great encapsulation system for fully software defined datacenters that have
already invested into a full control plane for all compute and networking
(think Mesos, Kubernetes).

It's also a good choice for smaller scale stuff because it scales down nicely.
Multicast forwarding might suck in a real DC situation but for a lot of
newbies playing around it's a good way to get into doing L2oL3.

I don't think the WAN cases matter that much. At the point where it's a
problem you have the people around to make said problem go away. Specifically
trying to do cross DC IP address mobility is dumb in the first place. Most
models that actually work well cross datacenter simply terminate it in one
place and bring it up the workload on an entirely new set of resources in the
new DC. This is much easier and is shared nothing usually (except maybe the
dataset, but that is usually stored in S3/HDFS/Ceph/other distributed store
here).

Long story short, it does it's job fine. Use it for something it wasn't built
for and yes it will hurt you.

~~~
Nrsolis
You're downvoting me but you're so wrong on so many points I don't know where
to begin.

1\. Cisco, Juniper, Alcatel, Arista all have merchant silicon platforms. They
marry those to their own control planes because the customers want that kind
of continuity and support. Pico8 and others are a FRACTION of the market place
for switching (including datacenter switching). My own company has so many
racks full of gear at AMZN and GOOG that it's hard to understand why you feel
like you know that infrastructure better.

2\. Writing a scalable control plane isn't as easy as you are representing.
GOOG took many years to write theirs and it's still problematic for them in
anything in lots of their use cases. I don't know why you think that quite
literally the entire Internet along with all of the protocol work that has
built it is a drag on novel architectures.

3\. Mesos/Kubernetes are vanishingly small parts of the larger ecosystems out
there. You can't just wave you hands and disregard quite literally the
billions of dollars of infrastructure that are deployed each and every year by
major service providers and corporations. Tiny cloud providers are not the
largest slice of the pie when it comes to dollars spent on networking gear
right now.

4\. "At the point where it's a problem you have the people around to make said
problem go away." HUH???? I can think of maybe 8 or 9 use-cases off hand that
absolutely REQUIRE this kind of functionality because billions of dollars of
transactions are handled by the infrastructure within and between those
datacenters.

Sorry buddy. You're way wrong here. Either you've never built an
infrastructure that anyone cares about losing for an hour or two while you
figure out what went wrong or you're so tanked up on "cloud" kool-aid that
you've forgotten how we got to this point and have failed to understand how
large systems scale up from small ones.

Show me a large multinational bank that's storing their transaction data on S3
or Ceph and I'll show you a bank that's not "systemically important". Most
large enterprises don't have the luxury of simply discarding 100% of their
working, proven architectures on the promise of a few small startups hoping to
cash in big.

Things like operational stability, redundancy, fault isolation, and monitoring
are not "nice to haves", they are mission critical requirements. They aren't
up to the whim of some bright-eyed CTO, but the watchful eye of umpteen
nations of regulators.

~~~
parasubvert
_GOOG took many years to write theirs and it 's still problematic for them in
anything in lots of their use cases._

I think that's a creative interpretation of the facts. GOOG is operating at
enormous scales pushing the limits of operational knowledge. I think it's
quite acceptable and natural that vendors are packaging up their learnings
from 2007 into products now for most enterprises (which are starting to really
eat them up).

 _" Mesos/Kubernetes are vanishingly small parts of the larger ecosystems out
there. You can't just wave you hands and disregard quite literally the
billions of dollars of infrastructure that are deployed each and every year by
major service providers and corporations. Tiny cloud providers are not the
largest slice of the pie when it comes to dollars spent on networking gear
right now."_

While I agree with your broader point about VxLAN v. MPLS (I think), the above
isn't even wrong. Mesos/Kube aren't vanishingly small, that would imply
they're shrinking, rather than small startups/projects that are growing at
astonishing rates. You're also confusing billions of dollars of low-mid margin
hardware with potential billions of dollars of mid-high margin software that's
aiming at IBM, HP, CA, Oracle, and Microsoft's application servers and
management tooling.

Secondly, Mesos/Kube aren't cloud providers, they're the startups that
represent the next generation (along with Cloud Foundry, OpenShift, and
whatever Docker comes up with) of data center operating systems that are going
to run the bulk of enterprise systems the way VMware does today.

That said, there's a belief that all of this requires SDN/Overlay Networking
like NSX or VxLAN that will magically fix network problems by bundling it with
the app platform and waving a wand. Here I agree ... they won't. The secret
behind good software defined networking is solid hardware defined networking
;)

 _Sorry buddy. You 're way wrong here. Either you've never built an
infrastructure that anyone cares about losing for an hour or two while you
figure out what went wrong or you're so tanked up on "cloud" kool-aid that
you've forgotten how we got to this point and have failed to understand how
large systems scale up from small ones._

I dunno. Stepping back, I remember James Hamilton from Amazon at re:invent
clearly was aiming directly at network vendors as the last bastion of costly
proprietary mainframe thinking that will be commodified by software-defined
cloud services on commodity hardware. It will take time. But they're pretty
jazzed about it.

 _" Show me a large multinational bank that's storing their transaction data
on S3 or Ceph and I'll show you a bank that's not "systemically important"._

Ceph, I agree.

Amazon OTOH has won the object storage game. S3 is the de facto API for all
object storage now, whether it's from EMC, NetApp, etc. And mission critical
banks are definitely using it, at humungous scale. I have no idea why you'd
think S3 is appropriate for transactional data, it's an object store.

~~~
Nrsolis
Right now most of my focus is in the financial industry. I would say "in the
US" but the truth is that my clients are multi-national behemoths. I'm
regularly on plane flights to EMEA and APAC.

Believe me when I tell you that they are still running systems that were
around in the 70's. They have a significant investment in code that can only
properly run in a mainframe environment and isn't going to get thrown away
anytime soon.

This is not to say that they don't have any interest in "cloud"
technologies...quite the contrary....they are deploying just about ALL of
them: ESX, OpenStack, Cloudstack, etc.

But what often emerges as a barrier to deployment is the operational details:
things like upgrades to infrastructure, minimization of downtime, security,
and integration with the rest of the network and computing infrastructure.

They don't have the luxury of starting from scratch and they certainly can't
just forget about how to make things work with their larger infrastructure.
Does OpenStack even have a way to upgrade from Juno to Kilo with ZERO
downtime? Questions like that are a huge part of the testing and design that
go into their thinking.

These guys spend $1B EVERY YEAR on computing.

And here is ANOTHER barrier: they can't readily do business with startups. It
just doesn't work for them. The possibility that a critical part of their
infrastructure is dependent on the fortunes of a group of maybe 50 people
being successful.

And it's not enough to say "well they have the source code" and can support it
themselves. That doesn't work for them when the auditors come out and need to
identify WHO is responsible for taking care of support and the lifecycle of
the code. They write the code for their applications; you can't expect them to
code big parts of their OS too.

SO....please take my comments in the spirit they are intended. You can't be
successful unless you are able to sell you solutions to the broader market
that includes lots of customers that aren't GOOG or AMZN.

Don't forget staffing either. If your infrastructure requires a CS PhD to
support/upgrade, you're going to have a hard time selling it out there.
Handling of outages tends to be business-specific so NoOPS style models don't
work everywhere. It's fine for FB, but not NYSE.

As for storage, EMC is the standard. Transactional databases are in far more
places than you'd expect. If your app has scaling issues with accessing a non-
virtualized database or datastore, then that's going to be a problem for you.
If your OS can't handle redundant datapaths or confuses the Ops people about
which piece of physical hardware is causing the issues you're seeing at the
virtual layer, then you're going to have even more problems.

So I'm not drinking the Kool-Aid just yet. I love virtual infrastructure but
there are still too many open questions that need answers before it's going to
be a complete solution for mission critical stuff.

~~~
parasubvert
_" Right now most of my focus is in the financial industry. I would say "in
the US" but the truth is that my clients are multi-national behemoths. I'm
regularly on plane flights to EMEA and APAC."_

I'm in a similar situation, though more North America focused.

 _" Believe me when I tell you that they are still running systems that were
around in the 70's. They have a significant investment in code that can only
properly run in a mainframe environment and isn't going to get thrown away
anytime soon."_

Yep, I agree. Though I've witnessed at least one that actually ditched the
mainframe completely... for SAP core banking. It wasn't pretty.

 _" Does OpenStack even have a way to upgrade from Juno to Kilo with ZERO
downtime?"_

I'm not one to defend OpenStack. :)

 _" And here is ANOTHER barrier: they can't readily do business with startups.
It just doesn't work for them. "_

The brokerages in particular have a long history of working with startups at
certainly layers of the stack. Retail banks, I tend to agree with you, but it
really depends.

Startups of a certain size and maturity (100 people+, a few years old) in many
cases tend towards even better support than larger companies because they
actually CARE about the outcome, and aren't bogged by the bureaucracy of
fighting divisions. (How many times does the pre-sales team have to fly in to
fix the screw-ups of the consulting group, or vice versa? etc.)

Every stodgy bank on the planet wants to work with Docker (150 employees now
btw), for example, once they have a product to sell.

 _" And it's not enough to say "well they have the source code" and can
support it themselves. That doesn't work for them when the auditors come out
and need to identify WHO is responsible for taking care of support and the
lifecycle of the code. They write the code for their applications; you can't
expect them to code big parts of their OS too."_

I'm not sure anyone is realistically expecting that. Pivotal (not exactly a
startup at 1500+ employees, but sometimes feels like one) for example supports
the OS inside Cloud Foundry for the customer, providing patches, upgrades,
minimal downtime rolling updates, etc. It's all open source but has 24x7
enterprise support.

 _" Handling of outages tends to be business-specific so NoOPS style models
don't work everywhere. It's fine for FB, but not NYSE."_

I'm not sure I agree here. Having an operating platform handle self-healing
and auto-recovery is sort of standard with VMware DRS/HA (admittedly not
everyone runs with it turned on). All these cloud platforms are doing is the
similar stuff for load balanced application containers and the VMs they run
on. I think this really is marking a major shift away from bespoke CMDB-driven
"how do we recover the service? get 20 people on a concall" towards systems
that reorganize themselves. Yes, there is a legacy that's not going to get
this, and needs its small armies... but we've seen shifts away from that
before when Java and .NET hit the market.

 _" Transactional databases are in far more places than you'd expect."_

I'd expect them to be everywhere and anywhere.

 _" I love virtual infrastructure but there are still too many open questions
that need answers before it's going to be a complete solution for mission
critical stuff"_

There's a difference between virtual infra and cloud. Virtual infra already
handles mission critical stuff in most of the world. Yes, plenty of bare metal
and big iron too, but that's actively shrinking. Cloud runs lots of mission
critical stuff too, but not with companies born prior to 1990... though that's
changing. I agree there is a risk calculation to be made here, but I don't
believe it is going to take more than a few years. We are talking orders of
magnitude in economic difference for time/cost in many cases. Speed with
safety is addictive. Most believe the speed of cloud - the safety part we're
verifying now as an industry, and it's happening everywhere.

------
bkeroack
Rather than run all network traffic through their single daemon, my feeling is
that it's better to abstract your containers into pods and discoverable
services like Kubernetes.

Increasingly my feeling regarding Docker, Inc is "too little, too late". They
seem to be chasing every market (enterprise, startup, developer) and therefore
are mediocre at a lot of things while expert at none.

~~~
shykes
Docker's new networking model is exactly what you describe. In fact it was
developed with a lot of feedback from the Google team, so if you like
Kubernete's networking model you'll be in known territory.

It's true that the Docker daemon by default bundles these functionalities by
default. But under the hood, the networking system is actually a separate
binary called "dnet", so it's going to be very easy to rip it out and make the
daemon less monolithic.

------
jpallen
The volume plugins look very interesting for our use case, but I can't find
any documentation about how to actually write a plugin? Do I have to reverse
engineer it from the Flocker example provided?

Edit: Nevermind, I found it
([https://github.com/docker/docker/blob/master/experimental/pl...](https://github.com/docker/docker/blob/master/experimental/plugin_api.md))!
Three clicks in, through various blog articles with lots of other links. Maybe
consider making this more obvious?

------
general_failure
Still no user namespace support? I don't get how one can use for production
websites without this. Especially if your run arbitrary containers from the
docker registry. Or this not the suggested model anymore?

~~~
davexunit
Do any other container projects support user namespaces? I'm curious about
what the implementation would look like. Several things change when a user
namespace is added to the mix, such as the container not being able to create
new device nodes. Would it be possible for unprivileged users to create
containers that had network access or is a daemon running as root still
necessary?

~~~
wmf
pflask (which I don't think anyone has heard of) and the latest version of
nspawn.

~~~
davexunit
I have heard of pflask. It's been a great resource for learning how containers
really work, but I haven't been able to figure out how the user namespace
stuff works. A container may have N users, root and a bunch of others, do they
all needed to be mapped to users on the host? I just don't know how to manage
it. Enlightenment appreciated.

~~~
ghedo
pflask author here (I'm a bit surprised to see it mentioned here really).

To answer your question, no, you don't need to map all the users inside the
container to users on the host.

pflask user namespace support is quite limited right now: with the --user
option you tell pflask to map the outside user that is running pflask, to the
inside user specified by the option. Let's say you run something like:

    
    
        $ sudo pflask --user=some_user ...
    

Since you are running pflask as root (sudo ...), pflask will map the "root"
user outside of the container to the "some_user" user inside the container.

The whole point of this feature was the possibility of running pflask as non-
root, so you could map a normal user on the host to the root user inside the
container and still be able to call mount() (although there are several
limitations), so it's only possible to map one user right now, however it
shouldn't be difficult to add another option to map additional users (feel
free to open a GitHub issue if you need this).

~~~
davexunit
Thanks! I've been writing my own container implementation for the GNU Guix
project and your code has been a wonderful reference. Guix allows unprivileged
package management, so I was hoping that my container tool could offer
unprivileged containers via user namespaces.

------
shred45
They've also RUINED their docs pages. The typography hurts my eyes and the
link layout makes it a pain to find anything specific. Takes me 10 times as
long to find anything.

:(

~~~
falcolas
Yeah. I'm actually missing their previous documentation, which was much more
readable and had some links on the left hand side down to anchors within a
page. Now it's a bit of an inconsistent mess to find something.

------
falcolas
I'm sorry, I have to be a bit negative on this one.

\- No release notes! (as of 1 hour after story posted)

\- Network stack re-written. They kinda flubbed network the first time,
creating assymetric network and double nat conditions. And within one release,
there's a complete re-write with a new networking model? This is going to take
a lot of proving to ensure that it's actually going to work for production
systems. In the meantime, we at least are going to continue using --net=host
for our containers.

\- Another storage format. Can we simply get one that's stable, please?

\- Volume re-write. Ditto networking.

\- [EDIT2] Disregard previous edit entirely - now you're simply disallowed
from using devicemapper entirely, if you're using the _officially compiled
Docker binary_ for Ubuntu 14.04 and 15.04? [4] What the actual fuck is going
on here Docker? AUFS has significant performance issues (not even mentioning
the deprecated part), OverlayFS requires a release candidate Linux kernel, ZFS
as a storage backend is brand spanking new, and btrfs is as stable as a three
legged chair. By the way, this affects CentOS and RHEL.

[EDIT]: 1.7 appears to have fixed the superficial problem - using the wrong
devicemapper drivers. Would still prefer to have a proper package.

[ORIGINAL] The whole "devicemapper on Ubuntu 14.4" [1] snafu appears to still
in full force [2]. Why can't they offer properly compiled OS packages? They're
distributing them as OS/Architecture specific packages...

Note - AUFS is not a real choice for (at least) node applications, we ran into
an issue back on 1.6 where there was a low level mutex limiting concurrency on
AUFS which did not appear on devicemapper. [3]

-

[1]
[https://github.com/docker/docker/issues/4036](https://github.com/docker/docker/issues/4036)

[2]
[https://github.com/Capgemini/Apollo/issues/315](https://github.com/Capgemini/Apollo/issues/315)

[3]
[https://github.com/docker/docker/issues/13268](https://github.com/docker/docker/issues/13268)

[4]
[https://github.com/docker/docker/issues/14035](https://github.com/docker/docker/issues/14035)
(look down for "vbatts" comment with the bolded summary header)

~~~
zupancik
Bit new to Docker - so does this mean Docker 1.7 won't work with RHEL?
Specifically I'm working on RHEL 6.6 machines, and I'm already stuck with
Docker 1.5 so curious to know whether this will affect me

~~~
falcolas
I'm not running RHEL myself, so I can't authoritatively speak about it. The
backstory is that Docker, as statically compiled and released, uses the 1.02
release of the devicemapper driver, which has a bug in it. The OS has a 1.02.1
version (and has, for some time, it was patched in Dec 2013), but you have to
make your own Docker build which will dynamically link to it.

Take a look at your version of libdevmapper, and you should be able to see if
there will be a problem; what I've heard is that CentOS and RHEL also have a
fixed version of devicemapper, which causes the mis-match and creates the
problem outlined in the #4036 issue.

The ideal fix is to actually compile Docker on an Ubuntu (RHEL) OS when
building the Ubuntu (RHEL) packages, but what's actually happening is that
it's being built in a minimal Linux container - which gives it a different
version.

------
andyl
I can't keep up with the churn. Will come back to Docker in a year or two to
see if things have stabilized.

~~~
shykes
FWIW we add a lot of new features, but we break very few APIs. API stability
is extremely important to us, so you can actually build stable things on
Docker and trust that we won't break it down the road.

~~~
jpallen
I've had APIs subtly break on me between at least 2 releases (I think it's
actually 3, but one time may have been user error). The main culprits were
changing the parameters which are passed on container creation vs on container
start, and changing error messages/response codes.

~~~
cpuguy83
Yes, we have changed response codes when the old ones were just wrong. Like
returning a 500 when really it should be a 404, that kind of stuff.

As far as start/create params, these should not be changed at all, other than
that you can pass the options that you could one only pass on start to create.
If something has broken here, it was most likely a mistake and should be
reported on GH. Even if we change the underlying configuration, we do version
the API and make sure we are sending/receiving the same structs over the wire.

TLDR; if something broke in the API, please report.

