Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Disclaimer: I work at AWS, but on a product which does not compete with Docker or its orchestration tools in any way shape or form. My opinions are my own.

I wouldn't even limit this to just the swarm feature. We've been running Docker in production for a year, using it in dev environments a year before that, and we've had major problems nearly every release. We had to upgrade directly from Docker 1.7 to 1.11 because every release in between was too unstable or had glaring performance regressions. We ended up running a custom build and backporting features which were worth the risk.

Speaking of 1.12, my heart sank when I saw the announcement. Native swarm adds a huge level of complexity to an already unstable piece of software. Dockercon this year was just a spectacle to shove these new tools down everyone's throats and really made it feel like they saw the container parts of Docker as "complete." One of the keynote slides literally read "No one cares about containers." I get the feeling we'll be running 1.11 for quite some time...



To provide some weight the other way, we've been using Docker in production for about 3 years now, and have not had any big issues. Obviously, you guys probably have a bit more extreme use cases at AWS. Things that bug us are generally missing features, but those gradually get added in over the course of the years, though some get less love than others.

For example for some reason it's still not possible to ADD something and change its permissions/ownership in one layer, resulting basically in a doubling of the size of such layers.

I wouldn't go as far as saying it's in any kind of a 'sad' state though. It's a neat wrapper over some cool Linux kernel features, and it's been that way since before 1.0.

I'm curious how you even get performance issues from Docker, what feature did cause performance issues for you?


Always fun to hear experiences from other production veterans. Glad to hear things are working well for you guys.

Our use case involves rapid creation and destruction of containers. Granted, this use case was pretty unheard of when we first adopted Docker, but it is becoming much more common.

Before Docker moved over to containerd, the docker daemon was riddled with locks which resulted in frequent dead-locking scenarios under load as well as poor performance. Thankfully, containerd now uses an event loop and is lock-free. This was a huge motivating factor for us to move forward to Docker 1.11.

To me, the sad state has more to do with Docker the company pushing new features out as quickly as possible and leaving stabilization to contributors. There are some days where it really feels like Docker is open-source so that Docker Inc can get free QA. To most users things may not feel in a sad state, but it can really suck for contributors.


> the docker daemon was riddled with locks which resulted in frequent dead-locking scenarios under load as well as poor performance

I second this. We use Docker in a similar scenario for a distributed CI. So we spawn between 70k and 90k containers every day. Up to very recently we were running 1.9 and got a staggering 9% of failures due to diverse Docker bugs.

It's getting better though, since we upgraded to 1.12 a few days ago we're down to a more manageable 4%, but I'd still consider this very unreliable for an infrastructure tool.

edit: my metrics were slightly flawed, we're down to 4% not 0.5%


You were likely seeing the bug that kept us from deploying 1.9 which was related to corruption of the bit mask which managed IP address application. We saw failure rates very similar to yours with that issue.


how is this acceptable?


You have to design for those failures. In our case we spawn 200 containers for one build, if 9% of those crashes, we still have a satisfactory experience.

In the end, at this scale even with four or five nines of reliability, you'd still have to deal with 80 or 8 failures everyday. So we would have to be resilient to those crashes anyway.

However it's a lot of wasted computing and performance that we'd love to get back. But even with those drawbacks our Docker based CI still run 2 to 3 times faster than our previous one because containers make heavy CI parallelism quite trivial.

Now maybe another container technology is more reliable, but at this point our entire infrastructure works with Docker because besides those warts it gives us other advantages that makes the overall thing worth it. So we stick with the devil we know ¯\_(ツ)_/¯.


> In our case we spawn 200 containers for one build, if 9% of those crashes, we still have a satisfactory experience.

You spawn 200 containers for one build‽ Egad, we really are at the end of days.

> But even with those drawbacks our Docker based CI still run 2 to 3 times faster than our previous one because containers make heavy CI parallelism quite trivial.

Since containers are just isolated processes, wouldn't just running processes be just as fast (if not slightly faster), without requiring 200 containers for a single build?


> wouldn't just running processes be just as fast

The applications we test with this system have dependencies, both system packages and datastores. Containers allow us to isolate the test process with all the dependant datastores (MySQL, Redis, ElasticSearch, etc)

If we were to use regular processes we'd both have to ensure the environment is properly setup before running the tests, and also fiddle with tons of port configurations so we can run 16 MySQLs and 16 Redises on the same host.

See my other comment for more details https://news.ycombinator.com/item?id=12366824


CI can just recover from these error by retrying/restarting containers.


Not Dead containers (which failed their post-shutdown cleanup).


"move fast and do'break shit" philosophy.


Where do you run the CI containers? AWS?


Yes, on a pool of c4.8xlarge EC2 instances with up to 16 containers per instance.

But very little of our failures are accountable to AWS, restarting the Docker daemon "fix" most of them.


For a newbie, what is the reason you didn't use hosted CI, like Travis CI?


Initially we were using an hosted CI (which I won't name), but it had tons of problems we couldn't fix, and we were against the wall in term of performance.

To put it simply when you run a distributed CI your performance is:

    setup_time + (test_run_time / parallelism)
So when you have a very large test suite, you can speedup the `test_run_time` part by increasing the parallelism, but the `setup_time` is a fixed cost you can't parallelize.

By setup_time I mean installing dependencies, preparing the DB schema and similar things. On our old hosted CI, we would easily end up with jobs spending 6 or 7 minutes setting up, and then 8 or 9 minutes actually running tests.

Now with our own system, we are able to build and push a docker image with the entirety of the CI environment in under 2 minutes, then all the jobs can pull and boot the docker image in 10-30 seconds and start running tests. So we were both able to make the setup faster, and to centralize it, so that our workers can actually spend their time running test and not pointlessly installing the same packages over and over again.

In the end for pretty much the same price we made our CI 2 to 3 times faster (there is a lot of variance) than the hosted one we were using before.

But all this is for our biggest applications, our small ones still use an hosted CI for now as it's much lower on maintenance for us, and I wouldn't recommend anyone going through this unless CI speed becomes a bottleneck for your organization.


You didn't include the maintenance cost to manage your infrastructure and container platform, which you don't need to worry with a hosted service.


Even with those it was still worth it. A couple people maintaining the CI is nothing if you can make the build of the 350 other developers twice as fast.

Also it's not like hosted CI is without maintenance, if you want it to not be totally sluggish, you have to use some quite complex scripts and caching strategies that need to be maintained.


> "To me, the sad state has more to do with Docker the company pushing new features out as quickly as possible and leaving stabilization to contributors."

Side note.. I'm a production AWS user, with no plans to change, but I feel like AWS does this exact same thing with each reInvent. The announced products actually become available 6-12 months later, and actually "useable" and reliable 2 yrs later...


You can spin this as they release "MVP" software and let early users drive direction. I mean, that's what I've heard.


Yeah.. Except that's not how they spin it.

In practice, they wrap it in marketing speak to paint it as something to revolutionize your stack.

Then you jump in spending several days of engineering time diving into it, only to find late in the game that the one (or several) critical details you can't find in the documentation that are essential to making an end-to-end production ready pipeline, are not actually implemented yet...

And won't be for many months


Alright that makes sense, most of our containers are long running, usually months. Only the containers that have our apps that are under active development will see multiple rollovers per day.

Now that I think about it we did have one semi-serious bug in Docker, though that was also our own fault. Our containers would log a lot, and we hadn't configured our rsyslog very well so under some circumstances its buffers would fill up and the log writes would become blocking and be real slow. When this would happen some commands like `docker ps` would totally lock up, which messed with our (hand rolled) orchestration system. It wasn't until one of us noticed the logs would be minutes behind that we discovered killing rsyslog would make docker responsive again and thus found out what was happening.

Since it didn't actually affect our service I didn't remember it as particularily bad, but I can imagine that if our service depended on having fast interactions with Docker that would have hurt bad. IIRC they did recognize the severity of the issue and quickly had a fix ready.

I bet Docker Inc. has a tough mission, building out Docker services far enough to compete with the dozens of platforms that integrate Docker such as AWS or OpenStack so they can actually make money off the enterprise.


If you don't mind me asking, What is your use case? The company that I work for is also spinning up and destroying containers constantly and we've had to develop a "spinup daemon" in order to deal with docker's slow spinup time (1-2 seconds is unnacceptable to me).

I'm curious if it'd be worth it to create some shim layer over runC (or adding the functionality) in-order to have a copy-on-write file-system that could be used to discard all changes when you're done with the container. Similar to how you can do a "docker run --rm -v /output:/output/34 mycontainer myapp" and all changes except those within the mounted volume get thrown away.

The use-case at my job needs the security of SELinux + CGroup/filesystem/network isolation. At a first glance, it looks like runC may handle most of the containerization bits, but not the copy-on-write filesystem stuff that I currently need. :s


I can't go into details on our use case, but if it can work for you, I highly recommend the new --tmpfs flag. If you know exactly where your application writes data and are okay with it being in memory, you can reuse your containers with a simple stop and start rather than waiting for the full setup of a new container.

With runC you can mount whatever filesystem you want, but it is up to you to setup that filesystem. So yes, you would need some kind of shim to set up your filesystem.


I've been using Docker in production for the past 2 and a half years (in two different companies) and even with no extreme use cases we've had problems with: performance of volumes/devicemapper, random breaking bugs: daemon would restart without warning or errors in 1.4, randomly killing containers in 1.9, having to restart the daemon in 1.8 when it hung pulling images (consequently killing the containers in the process).

I still like Docker and can see myself, team and company using it for a long time if nothing MUCH better show up (rkt is promising to take some of the complexity pain away but we are not diving into it yet) but I can't say I've not been bitten enough to completely avoid upgrading Docker if it isn't needed, we follow a rule to only upgrade to ".1" releases as most of our problems have been with ".0" ones.


My favorite was docker exec -it $container bash would cause a nil pointer deterrence in docker 1.6.0 and kill the docker daemon. We've seen gobs of bugs since, but that was the most wtf gnarly one


I'd recommend looking at using runC (which is the underlying runtime underneath Docker). Currently we're heading for a 1.0, and the Open Containers Initiative is working on specifications that will make container runtimes interoperable and eventually provide tooling that works with all OCI runtimes. If you have anything to contribute, I would hope that you can give us a hand. :D


I'm a huge fan of the work being done on runC and would love to give you guys a hand! You'll probably see me around soon :)


The same experiences we switched to using rkt, supervised by upstart (and now systemd).

We have an "application" state template in our salt config and every docker update something would cause all of them to fail. Thankful the "application" state template abstracted running container enough were we switched from docker -> rkt under the covers without anybody noticing, except now we no longer fearing of container software updates.


An example of changing behavior that broke us not to long ago: https://github.com/docker/distribution/issues/1662 . By the time this happened we were already working on the transition, just more motivation.


Hi mtanski,

How did you replace docker with rkt? Do you have an howto that you can share?


I haven't replaced Docker with rkt on a big scale (or ran Docker on a big scale), but I recently changed over some Docker containers to rkt.

First off, this and the rest of the rkt docs is a good starting point https://coreos.com/rkt/docs/latest/rkt-vs-other-projects.htm...

Second, rkt runs Docker images without modifications, so you can swap over really easily https://coreos.com/rkt/docs/latest/running-docker-images.htm...

rkt uses acbuild (which is part of the application container specification, see https://github.com/appc/spec) to build images, and I had a very tiny Docker image just running a single Go process.

I just created a shell script that ran the required acbuild commands to get as similar image.

A good place to get started is the getting started guide https://coreos.com/rkt/docs/latest/getting-started-guide.htm...

Docker runs as a daemon, and rkt doesn't (which is one of the benefits). I just start my rkt container using systemd, so I have a systemd file with 'ExecStart=/usr/bin/rkt run myimage:1.23.4', but you can start the containers with whatever you want.

It's also possible to use rkt with Kubernetes, but I have not tried that yet. http://kubernetes.io/docs/getting-started-guides/rkt/


Not to mention 1.11's restart timer never were reset to 0 even after the container ran well for more than 10 seconds. (ie: after a few restart, your container would be waiting hours to start!).

This and I can attest to 1.12 problems listed in this article.

Can't remember the specific with 1.10, but basically, nothing really ever works as promised which make people waste a lot of time trying to make something work when it can't and second, doesn't give much trust in the product's stability.

I really wish they would collaborate a lot more and fragment their solutions in smaller module while keeping everything simple. I think they have a great product, but too much growing pain.


If I were you I would add a disclaimer when criticizing Docker, mentioning that you work on AWS, since the products are competitive in some ways like EC2 Container Registry vs Docker Hub. It would be great for AWS if Docker simply focused on open source bug-fixing and let AWS provide the profitable services....


Added a disclaimer, however, the product I work on does not compete with Docker in any way. We actually rely on Docker quite heavily. No conspiracy here.


"It'd be great if Docker wasn't profitable."

I sort of agree, but that's not entirely realistic :)


Why the anti capitalistic sentiment? Is programmers want to get paid for our work, right?


No. Programmers want huge paychecks but everyone ELSE should be FOSS and code for us for free


ahem Free software does not need to be gratis. There are several examples of companies which charge money for free software.


ahem That's obviously not what is being discussed here.

> It would be great for AWS if Docker simply focused on open source bug-fixing and let AWS provide the profitable services....


I was responding to the specific, sarcastic, wording of this line "Programmers want huge paychecks but everyone ELSE should be FOSS and code for us for free".


And I said FOSS not gratis with intent. Perhaps could have made it FOSS and Gratis.

Devs are mad ITunes is closed source. Mad windows is closed source. But happy to get a big paycheck if they work at Microsoft or Apple.


> But happy to get a big paycheck if they work at Microsoft or Apple.

I wouldn't ever want to work for a proprietary software company. But I admit that I'm on the extreme end on this debate.


Would rkt be a worth to try alternative?


I really haven't looked at rkt as much as I should, but we're more likely to invest in looking at lower level tools like runC moving forward.


Amazing thank you! I need to chose a containerization tech in the next month and I am pretty worried to go with Docker because I hear many stories about how it is not really production ready. Thanks for mentioning runC I will check it out.


All depends on your use case. RunC could be way too low level for what you need and Docker may be production ready for your specific use case.


No, it is perfectly covering my use case. I need a _reliable_ containerization app that does not run any additional service on my boxes. I am working on an extremely low overhead orchestration for our cluster so we can avoid Swarm entirely.


Disclaimer: I work at Mesosphere.

There are alternative runtime implementations (such as Mesos/Mesosphere DC/OS) that let you have best of both worlds: developers can still use Docker and produce Docker images but you use production-grade container orchestration (and that same Docker images) without using Docker daemon for your actual service deployment.


runC isn't an orchestration solution. It's a low-level component that can be (or is already) used by higher-level orchestration technologies.


We're actually working on getting OCI support into Kubernetes. It's a long way away, but we're very determined to get large orchestration engines to provide support for OCI runtimes (runC being the canonical example of such a runtime).


Great, I do not need any orchestration solution at all. I need a container running thing that can encapsulate any software that we are developing (Java, C#, Node.JS, etc.). And now we are approaching the question what is my problem with Docker. I believe it is a misconception to compete with already existing tools like systemd. I especially do not want any mediocre orchestration solutions in my infrastructure that introduce big overhead and complexity that I do not need at all. One thing I learned along the way of managing large clusters (5K+ nodes) that Swarm like frameworks are extremely error prone. If you flip the problem and build a startup script the pulls down the container configuration from S3 for example and the container itself has code that attaches the instance to the right service (EKB, Haproxy, etc.) you can achieve the same without introducing services that sole purpose is to maintain a state that you do not need.


If you want a container-like technology that already has the large cluster management, scaling built in and is ideal for software whose source code you control think about trying kubernetes (and/or any similar competitors).


Sounds like you want a PaaS.

Cloud Foundry is currently running real applications with 10k+ containers per installation. We are on track to test it with 250k app instances.

Plus it's been around for, in internet terms, eternity. Garden predates Docker, Diego predates Kubernetes, BOSH predates Terraform or CloudFormation and so on. Used by boring F1000 companies, which is why it's not talked about much on HN.

Disclosure: I work for Pivotal, we are the majority donors of engineering to Cloud Foundry.


I really do wonder what it would take to get the ecosystem to get behind rkt or something else. The present situation feels to me like it's held together by a shared desire to keep the Docker brand going, and that conflicts between Docker Inc. and basically everybody else just won't stop, because there is a lot of money involved for all sides.


For me the question is more like: why should we bundle together containerization with anything else? Why couldn't we follow the unix philosophy and have the containers work together with the orchestration softwares and not tied together. CoreOS seems to have it kind of independent. Our biggest blocker is the lack of RPMs for CentOS/RedHat for rkt.


> I need to chose a containerization tech

You are more likely to run into issues with containerization itself ( cgroup, namespaces ect) than abstractions on top of it.

Unless you are doing some sort of orchestration on top of containers, you can't go wrong with any of the container abstractions.


Hm, I don’t think so. It just doesn’t have enough momentum and as such there aren’t many containers available. Of course, if you want to roll your own, that may not be relevant.

I tried it with the nginx and php-fpm Docker containers, but it wouldn’t work – because those containers assume specific process hierarchies (to log to the console where you issued `docker run`) that just aren’t present when using rkt. The advertised Docker compatibility only goes so far.

I still think rkt is a great idea, but I’m too lazy to develop my own containers. The documentation isn’t that good either.


To be fair, you really have to roll your own containers for applications, and it's not hard at all, if you already know how the applications are hosted are on a Linux server.

I've found the single-service Docker containers from the Hub are useful for development (MySQL, Redis, etc.), but the "official" language run-time Docker containers that I've looked at are basically demoware. They are built to give you something that runs with the minimum of effort, rather than being efficient or anything else.


They need to implement support for the Dockerfile format if they want to win. People value inertia. The switching costs have to be low if you expect anyone to switch; this only become non-true when the incumbent becomes intolerably useless to the general userbase.


rkt can run any docker image built by a Dockerfile: https://coreos.com/rkt/docs/latest/running-docker-images.htm...

I agree that we need a better ecosystem of build tools and that is something we are looking to help build out. But, with rkt what we are trying to do is build an excellent runtime; and think the build side is an important and orthogonal problem.


There is no proper ecosystem for rkt. All I have seen is marketing hype and I don't know anyone who uses it. Just go to any meetup.


We're in the process of converting our 100+ cloud nodes to Docker+k8s and I have a lot of the same reservations -- the space is very immature and the tooling has a lot of kinks to work out, not only functionally but also aesthetically. It's already been a nightmare and we're not even deployed to prod yet.


If "cloud" is AWS, you should join the kubernetes slack sig-aws channel. Lots of community people figuring out those kinks together.


It's obligatory for me to recommend Cloud Foundry here. We've already built the platform, there's no need to build and maintain your own.

It just works. Really well, actually.

Disclosure: I work for Pivotal, we donate the majority of engineering to Cloud Foundry.


If you haven't heard of Giant Swarm, I encourage you to contact them. They have a scalable microservices provisioning solution that can use either Docker or Kubernetes. German company. Disclaimer: I worked for them last year. Holler if you need an intro.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: