Hacker News new | past | comments | ask | show | jobs | submit login
From 30 to 230 Docker containers per host (stormbind.net)
197 points by UkiahSmith 71 days ago | hide | past | web | favorite | 53 comments

So at one point we where doing scale testing for our product where we needed to simulate systems running our software connected back to a central point. The idea was to run as many docker containers as we could on a server with 2x24 core and 512GB of RAM. The RAM needed for each container was very small. No matter what the system would start to break around ~1000 containers (this was 4 years ago). After doing may hours the normal debugging we did not see anything on the network stack or linux limits side that we had not already tweaked (so we thought).So out comes strace! Bingo! We found out that the system could not handle the ARP cache with so many end points. Playing with net.ipv4.neigh.default.gc_interval and the stuff associated with it got us up to 2500+ containers.

512GB RAM/2500 containers is still 500MB per container. In former days™ this was enough for a computer to run a complete desktop environment with a web browser and 20 tabs open (source: I had a PC with physically 500MB RAM). Is this really the limit for such a decent equipped machine? (I guess a server grade 48 cores, 512GB RAM should be less then 5kEUR nowadays)

He said "The RAM needed for each container was very small" - the RAM is _not_ the limit here. The point is that running many containers is a very different (hard to compare directly) type of load than the main, well optimized use case of running a single large desktop environment.

Networking in particular: with thousands of containers, now there are lots more interfaces, routes, conntrack entries, "background" traffic, iptables rules etc.

Some problems are algorithmic: for example historically a lot of code has been written with the assumption that the number of network interfaces is small. With >1k interfaces, suddenly O(n) lookups take time. Similarly, iptables rules are run sequentially. etc.

Some resources have limits. Exceeding these may impact performance. In his case the load blew up the ARP cache. Nice!

>512GB / 2500 is still 500mb per container

Am I missing something? 512gb / 2500 is around 200mb per container.

I think he miscalculated. His point still stands, just not with 20 browser tabs.

The important bit is "running our software" - depending on the complexity of said software, 200MB is pretty meh.

wait, i had windows xp with firefox and all at 256 mb ram ! which i got later upgraded to 384 mbs because i reused another 128 mb ram chip :)

In ~2003, I replaced my desktop that had been Intel-based with a Duron 800MHz system, only I didn't have enough budget to get it the RAM it required (new/different slot iirc), so I only had the 128MB it came with (whereas my old machine had 768MB cobbled together from like six dimms).

I figured that one hop over 100Mbit Ethernet to remote memory was going to be faster then swapping to spinning rust (remember this was before consumer SSDs, and onions on our belts), so I made a ramdisk on the old machine and mounted it over the network with the nbd (network block device) kernel driver, ran swapon on the nbd and boom, extra "512MB" of RAM.

It worked amazingly well, and (knock on wood) none of my roommates ever tripped over the Ethernet cables.

So you're the guy who actually managed to "download more RAM". Congrats!

On a more serious note, gonna add that trick to my book - still plenty of rust spinnin' round

Infiniband QDR adapters are amazingly cheap and RDMA-aware software can use DMA to poke directly at memory or devices in the other system.

Yes. Needed ~150MB per container. It was just simulating end points - ie, TCP connections back to a central system.

You're assuming that all the memory was used, even though it's not specified in the post. 512GB/2500 is the upper limit. Anything between 0 and that could be the case.

What was the remaining 12 MB of RAM used for?

Nice! But how do you tell that from strace? Wouldn't it just show the guests stuck in connect()? With perf, sure, but strace?

It’s has been awhile, but IIRC we saw the net process for ARP being called and getting stuck.

Is there any talk of increasing these defaults in higher memory systems. The low defaults feel like foot guns that people stumble into rather than something needed for optimal performance.

I've given up on expecting sane defaults from every piece of software. Some packages work perfectly fine out of the box, or simply work not at peak performance if not tuned slightly. Other software has so many dangerous default settings that it's hard to understand the rationale.

Case in point for me is Docker itself. By default it will write logs to json-file and not truncate or rotate these logs. Packages distributed for Ubuntu et al also don't set these limits, so unless you manually make it a system-wide default or set it for each container you run individually it will eventually eat your disk-space. This is a very dangerous setting since you're probably using Docker in a lot of cases where all you do is run ephemeral, stateless workloads. Having out-of-disk creep up on you on these kind of hosts is most unexpected. The same could be said of having the -p port forwarding option default to binding to instead of Also a footgun and an exposed ElasticSearch with PII waiting to happen...

I install PHP on my server (yeah I know) to run some PHP based web based software. By default PHP is configured in a very insecure manner. You would think with all the security problems that they would ship it with a set of secure defaults.

Problem with setting secure defaults, is that most of the worlds PHP would stop working properly.

Probably but you can still have the configuration secure as default and people would be aware of the security implications when enabling insecure features.

No, that would be a breaking change.

Java did that with modules and it appears to have worked ok for them.

You're vastly overestimating the technical capabilities of the average person installing or creating software.

I am talking about whoever is packaging PHP for the OS, there is a default php.ini that comes with PHP on CentOS is insecure by default (I can't remember off the top of my head which settings were set to something insecure).

We are talking about an ini file. This isn't rocket science.

Right, but they need to be conscious of their end user. If they secure by default, and someone upgrades, their software stops working. Should PHP have had these defaults to begin with, yes absolutely. But now we're all stuck with a million miles of code that will break if register_globals is turned off. That's the point. Everything you've stated above there might as well be an alien language to the majority of people using this stuff.

No it should be secure by default and people will have to enable insecure features. It doesn't stop old software from working as the person will be able to simply re-enable whatever the insecure feature is.

However they will now be aware that said feature is insecure and should know the consequences of enabling it.

Could you point to some of the sane configuration for docker? We are planning to run some of these in production.

This is really the million dollar question. Right now I'm not aware of any single cookbook example of how to tune your server for an optimal docker load. It's all buried inside engineering organizations, or blog posts like the one here.

One of the things the MySQL developers did originally was to ship the code with three examples of the my.cnf config file for small, medium and large memory systems. I wish there was something like that for docker container density and OS tuning parameters.

I think what we need is basically a matrix of OS settings which will max out your docker density for N CPUs and M gigs of memory. There are also probably settings for network configs which depend on your link capacity also.

I can dream of a day when you can spin up an Ubuntu Docker-flavor server which optimizes itself to give the highest density of containers given the hardware (or VM) it's running on.

For the logging aspect it's funny because the Docker manual itself contains a good snippet with reasonable settings: https://docs.docker.com/config/containers/logging/configure/

For the port-binding thing, I'd just remember that it binds to when not explicitly specified otherwise, and then use docker network and not port-forwards unless absolutely needed. For example, if you have an application and a couple of backing services (database, redis, ElasticSearch), then only your application needs a port forwarded from the host, the rest can live within the docker network.

Setting up a firewall like ufw could prevent accidental port mapping to I really don’t like that this is Docker‘s default.

Last time I checked (~year ago) Docker used different iptables chain(s) than ufw or added itself before ufw rules, so ufw was useless in securing access to ports exposed by containers.

Ok, thanks for the heads up, I'm going to test this.

Would love to change this default but doing so will end up breaking Kubernetes users at the very least

Maybe we could change the default based on api version.

It is really up to package maintainers, so rather than give up, why not try to contribute back?

The big bottleneck we had with docker containers per host was not sustained peak but simultaneous start. This was with 1.6-1.8 but we’d see containers failing to start if more than 10 or so (sometimes as low as 2!) were started at the same time.

Hopefully rootless docker completely eliminates the races by removing the kernel resource contention.

Rootless docker uses user namespaces it is all still happening in the kernel.

"Access was initially fronted by nginx with consul-template generating the config. When it did not scale anymore nginx was replaced by Traefik."

Wonder why Nginx didn't scale.

If I were to guess, reloads triggered from config changes.

Consul-template writes a config and then does an action. In the case of nginx, I would assume the action is to send a SIGHUP. I think haproxy would have also been an option here, it has better support for srv record to do updates from and the like.

Where I am at the moment we're running clusters of 400-800 containers sitting behind nginx instances and even thought we own nginx+ licenses, we've found the consul-template + SIGHUP route to be totally fine, even at a churn of maybe a dozen contained a minute everything still seems to be working fine. If a particularly busy node dies then we occassionally see a few requests get errors back, but Nginx's passive healthchecking (ie. checking response codes and not sending traffic to an upstream with a ton of 500's being returned) seems to handle all of that ok.

The only time our tried and tested consul-template + SIGHUP method is every unsuccessful (and we've ended up jusy having to put processes in place to stop this) is if we have the same nginx handling inbound connections to the cluster under high load and we try and respawn all the containers at once. Then things start to go wrong for 5 minutes or so then back to normal.

While "the occasions error response" isn't perfect, I suspect that for most use cases it's good enough, so I'd still be interested in knowing more specifically what happened to that nginx...

nginx behaves RfC conform. So if you sent it a SIGHUP it will try to respawn all workers by closing (from the server side) all open connections. The problem is that this behaviour confuses some HTTP libs/connection pooler more then others. For example OkHTTP seems to be able to deal with it, but others not so much. Once you reach like 6-12 reloads per second you run into latency issues because you've to establish a new connection for every request, and if you're still running with HTTP/1.1 every benefit of idle connections and connection pooling is defeated. Examples like Traefik (or more old school the F5 BigIP LTM) split frontend and backend handling of connections, and deal with so many reloads more gracefully. Beside of avoiding issues with HTTP libs it at least improves your latency.

"With /proc/sys/kernel/pid_max defaulting to 32768 we actually ran out of PIDs. We increased that limit vastly, probably way beyond what we currently need, to 500000. Actuall limit on 64bit systems is 222"

Time to start thinking about 128bit systems!

Copy-paste error: it's 2^22, which is 4194304.

Well... I'm running 100~150 per EC2 with Kubernetes... ¯\_(ツ)_/¯

what’s your instance type? t3.medium in EKS gives me 11 pods capacity by default.

m5.4xlarge currently but we are migrating to r5.2xlarge. (it will fit better our memory/cpu ratio).

They could easily double that density with Go, or quadruple with C++ or Rust. Why people still use JRE I fail to understand.

Because doubling, quadrupling, etc. the number of servers is quick and has a well-known cost compared to going for a complete re-write in another language?

I'm a massive go and rust fan, but I don't expect the entire world to be re-written in them any time soon.

The density per server savings is probably a drop in the hat compared to the cost of the engineers themselves... also by the sounds of it memory usage isn't the issue here which is the only thing I think you'd get from C++. I've seen well written java applications do amazing things performance wise even expert C programmers couldn't match (without an obtuse amount of effort).

  >> The density per server savings is probably a drop in 
  >> the hat compared to the cost of the engineers themselves
For C++ and Rust yes, unless scale is huge. For Go, emphatically no. Go is simpler to work with, and on typical programs it uses half the RAM and fewer threads. Although even for C++ and Rust at medium scale I'd rather do proper engineering, and pay my (rather than somebody else's) people better. "Hard" languages tend to select better engineers. Go is a bit of an odd one in this regard because it selects better engineers by not allowing the kind of mindless GoF masturbation one often sees in Java programs, at the language feature level. Can't abuse OOP if there's no OOP.

  >> [memory] is the only thing I think you'd get from C++
This is far, far from the "only" thing you'd get out of the languages I mentioned. Quarter of the memory, as few threads as you'd like, access to vector instruction sets, easier performance optimization where the code calls for it, not having to tune your GC for smaller heap, not having to new up a class every time you do anything (I know you don't have to, but that wouldn't be idiomatic Java), etc, etc.

Because of the existing code and libraries which are not available in other languages. For instance if I am writing some NLP related code I would write it in python.

Ironically, that'd still be better than JRE, because your NLP libs end up spending most of their time in high performance, memory efficient C++ code that underpins such libs.

No man I wouldn't want to call those horrific c functions compared to those sweet python functions ! and the libs calling those C functions do much more than just being a wrapper. they add a lot of tooling around.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact