
Nomad Million Container Challenge - minimaxir
https://www.hashicorp.com/c1m.html
======
meteorfox
It's really impressive that it can handle these many container placements.

But, honest question, what's the value of determining how fast can we schedule
a million containers? This question is not just for Nomad but other cluster
managers as well that have recently published similar benchmarks.

I see the value of scheduling thousands to perhaps hundreds of thousands of
containers across many nodes, but millions seem excessive.

I think that is more valuable to measure what happens after you have 1 million
containers running on your cluster. Such as: \- What is the overhead keeping
track of that many containers? \- How do they impact the responsiveness of
other API calls (list, delete)? \- What happens when nodes go down and
suddenly you lose a considerable amount of containers, can it recover quickly?
\- How does it impact the performance of running containers in the cluster?

Also, there are other important factors to test for: \- what about image size?
How does it impact scheduling time when non-cached? \- container density per
node \- number of nodes \- what about scheduling other workloads that Nomad
support, like VMs and runtimes?

~~~
illumin8
With any system of sufficient scale, you're bound to hit artificial (software
design inflicted) limits to maximum scale or performance.

The reason why a good software company tests extreme limits (1 million
containers) that most customers will never see is to ensure customers that
they will not reach a scale limitation.

From my experience running large private cloud infrastructure (>14,000 virtual
servers at once), you will always hit some crazy limit that the vendor never
anticipated. "14,000 VMs? We've only tested with 10,000" (not a real example,
but an idea of what type of problem you'll run into)

Proving 1 million containers in 5 minutes is just designed to assure regular
customers that they're fine. I doubt anyone really needs that many containers
for any current workload...

------
wmf
Great work, although I wish they hadn't reused the terms C100K and C1M that
have been used to refer to holding many TCP connections open.

The timing is an interesting coincidence given recent work on K8s scheduler
scalability: [https://coreos.com/blog/improving-kubernetes-scheduler-
perfo...](https://coreos.com/blog/improving-kubernetes-scheduler-
performance.html)

------
mitchellh
Hi everyone, let me do what I normally do and answer the FAQs regarding
something I see on HN about us. I'm one of the founders of HashiCorp, creator
of many of our tools, and so I have the authority to respond here as well as
the bias.

In no particular order, the feedback or questions we get asked about commonly
about this are below. If you have any additional feel free to ask and I'll try
to answer. I'm flying today so I may be in and out.

1\. "Scheduling 1M containers isn't realistic"

Nobody is going to schedule 1M containers right now in a short period of time.
I don't argue that and we don't make that claim anywhere.

HashiCorp has always believed that if you build something that scales UP, it
always scales DOWN. Now you know Nomad can schedule containers at ~4,000
containers/sec with 5 servers. You may only need a few per second, but when
you get a bunch of batch jobs you want to run (data processing, queueing,
etc.) and maybe have thousands to submit in a short period, you can now be
confident that Nomad is going to be very, very okay.

I fly a lot, and I've always been in awe of airplane wings. Did you know a
Boeing's wings can bend enough to nearly touch each other at the top before
they snap? I fly over 100 times per year (for 6 years now). Have I ever seen
wings even get 10% of the distance to that? Nope. But because they CAN, I feel
really confident in that plane [under those circumstances].

In addition to that, I think schedulers will only get more loaded over time.
We allude to this in our conclusion: think about Lambda, or using schedulers
for basic queueing, etc. All of these require a scheduler, and the load that
they'll put on a schedule is easily 10x if not 100x more than what we put on
schedulers today (in practice). Knowing _today_ that Nomad can handle this
load allows us to design for the future. Maybe its never needed, maybe.

2\. "But how does Nomad act under failure scenarios?"

This challenge actually demonstrated this! In our 1M container launch, we
actually launched 1.003M due to failures in hardware, network, as well as
finding a bug in the Docker engine itself. Amidst all of this, Nomad self-
healed and relaunched failed jobs. As a result, we actually launched 1.003M
containers to complete our jobs.

3\. "Starting containers isn't impressive, keeping them running is."

The goal of this benchmark and this post was to show the speed at which we can
binpack with constraints and high pressure and relatively few servers (five).
However, it isn't every day we get access to a 5,000 node Nomad lab, so we did
and measured a _lot_ more than what we published.

We kept the cluster running for some time. Nomad continued to self-heal any
failed jobs, and the 5 servers kept up just fine. Nomad client's monitor the
health of their own tasks so that is never very expensive, and the clients and
servers heartbeat periodically.

We didn't publish any of that because its just so boring: Nomad just kept
things running, CPU was within bounds, network was quiet, etc.

We're planning on a more strenuous benchmark around disaster scenarios
currently. We now know Nomad can scale, we're now designing tests to observe
how resilient it can be.

~~~
rsync
"I fly a lot, and I've always been in awe of airplane wings. Did you know a
Boeing's wings can bend enough to nearly touch each other at the top before
they snap? I fly over 100 times per year (for 6 years now)."

I thought this was amazing, so I looked it up:

[https://www.youtube.com/watch?v=rak2HldVp9M](https://www.youtube.com/watch?v=rak2HldVp9M)

It is indeed impressive, but by no means nearly touching each other.

~~~
recuter
Never let the truth get in the way of a good story. ;) His metaphor was
evocative and the point came across well.

------
jgrowl
"Nomad 0.3.1 was released earlier today. In the next major release, we will
focus on data volumes to enable Nomad to run more stateful applications."

What is the current approach for dealing with persistent data?

~~~
markbnj
Data is the remaining elephant in the room. Containers have made code and
config ephemeral and trivially replicable. But then you get to data, which
represents billions of actually valuable magnetized bits of platter or flipped
nand gates or whatever. It's like butterflies and battleships.

One approach, as another reply suggested, is to mount external volumes into
the container, i.e. a persistent disk on GCP or Amazon EBS (or even a GCS or
S3 bucket). This works, but of course the container, hosting the software that
makes the data meaningful, is now bolted to a physical thing. It can no longer
be moved, auto-scaled, etc. It can be monitored and restarted and you still
get all the benefits of dependency isolation and repeatability. I'd still
much, much rather run a DB this way than installing it directly onto an
instance, but there's no use pretending that we can do with data what we can
now do with code and config. It's still a cement block chained to our ankles.

------
thomaskauth
I'd be curious if they spent any time on getting multiple containers on a
single machine to start faster. When I tried to do this recently, it appeared
that starting of docker containers on a single machine is inherently serial.
Anybody else experienced this?

------
darawk
Nomad seems cool, but this benchmark is almost completely meaningless as far
as I can tell. There is no reason to test this with 5000 servers and 1 million
containers, it just seems like marketing nonsense. All you really care about
is how fast it can provision 1 container and 1 machine. The rest is just
parallelism - which, when dealing with multiple discrete servers should scale
perfectly without any actual effort.

~~~
wmf
Starting containers can be done in parallel, but scheduling cannot be
trivially parallelized because it needs to take into account the resources
consumed by existing containers. Situations like a rack failure can cause
scheduling storms where many workloads need to be rescheduled as quickly as
possible.

