
How we use HashiCorp Nomad - jen20
https://blog.cloudflare.com/how-we-use-hashicorp-nomad
======
ianlevesque
After trying out most of the kubernetes ecosystem in the pursuit of a
declarative language to describe and provision services, Nomad was a breath of
fresh air. It is so much easier to administer and the nomad job specs were
exactly what I was looking for. I also noticed a lot of k8s apps encourage the
use of Helm or even shell scripts to set up key pieces, which defeats the
purpose if you are trying to be declarative about your deployment.

~~~
ognyankulev
Helm charts are declarative way of deploying app(s) and their accompanying
resources.

~~~
q3k
Helm, however, is objectively terrible with its yaml-based templating language
and zero practical modularity.

~~~
namelosw
Sometimes I even wish they could embed a JavaScript interpreter... After all,
YAML is almost equivalent to JSON, which the perfect templating language for
JSON is -- JavaScript tbh.

Or people have to keep inventing half baked things.

~~~
q3k
The problem isn't JSON or YAML: it's text templating serialization formats,
instead of just marshalling/serializing them.

------
heipei
Great writeup, even if I'm longing for more details about things like Unimog
and their configuration management tool.

Pay close attention to the section where they described why they went with
Nomad (Simple, Single-Binary, can schedule different workloads and not just
docker, Integration with Consul). Nomad is so simple to run that you can run a
cluster in dev mode with one command on your laptop (even MacOS native where
you can try scheduling non-Docker workloads). I'd go so far as saying that it
would even pay off to use Nomad as a job scheduler when you only have one
machine and might have used systemd instead. You can wget the binary, write a
small systemd unit for Nomad, then deploy any additional workloads with Nomad
itself. By the time you have to scale to multiple machines you just add them
to the cluster and don't have to rewrite your job definitions from systemd.

------
candiddevmike
The biggest hurdle with adopting nomad is the kubernetes ecosystem and related
network effects. Things like operators only run on Kubernetes, and they're
driving an entirely new paradigm of infrastructure and application management.
HashiCorp is doing their best to keep up while supporting standard kube
interfaces like CSI/CNI/CRI, but I don't know how they can possibly stay
relevant with Kubernetes momentum.

In my opinion, HashiCorp should look at what Rancher did with K3s and offer
something like that, integrated with the entire Hashi stack. The only reason
most people choose nomad is the (initial) simplicity of it (which quickly goes
away once you realize how "on an island" you are with solutions for ingress
etc). Deliver kube with that simplicity and Integration and it's a much more
compelling story than what Nomad delivers today.

~~~
chucky_z
This is what Nomad is though... It works without their other products and also
natively integrates into them. Deploying Vault, Consul, and Nomad gives you a
very nice experience.

Also with Consul 1.8 and Nomad 0.11 you’ll get Consul Connect with Ingress
gateways which solves some of those problems you mentioned.

~~~
shaklee3
Nomad uses consul as the kv store, so it doesn't work without their other
products.

~~~
chucky_z
Please make an attempt to understand a technology before making comments like
this. Nomad has _no_ requirement to store kv's, nor secrets. It's an entirely
separate thing from Consul and Vault. It simply integrates with Consul, and
Vault. It even runs its own raft store, so each product's backend is totally
separate (Vault dropped it's reliance on other backends as of 1.4 and can run
it's own raft store now).

Nomad template stanzas are simply consul-template
([https://github.com/hashicorp/consul-
template#plugins](https://github.com/hashicorp/consul-template#plugins)), you
could use `{{ with secret "" }}` and never touch Consul. You also have every
function beyond service and the key* set of functions. You could build pretty
static, or dynamic configurations using these blocks without ever touching
Consul. On top of that, both Terraform templates and Levant work well for
templating job specs out themselves, which contain template stanzas in them.

An example of something helpful would be if you wanted to drop a very small
binary that changes with each deploy, you could use
[https://github.com/hashicorp/consul-
template#base64decode](https://github.com/hashicorp/consul-
template#base64decode) and just change the contents each time you deploy the
job.

If you wanted to use redis keys in your consul-template, simply drop a binary
on each server as a plugin to consul-template, then for example: `{{ "my-key"
| plugin "redis" "get" }}`.

Why not try running it yourself?

`nomad agent -dev` will get you a server running, then `nomad job init` will
give you a full spec which you can run with `nomad job run example.job`.

~~~
shaklee3
I understand nomad, and what you're saying is not how anyone runs it. Directly
from their documentation:

"Nomad schedules workloads of various types across a cluster of generic hosts.
Because of this, placement is not known in advance and you will need to use
service discovery to connect tasks to other services deployed across your
cluster. Nomad integrates with Consul to provide service discovery and
monitoring."

So one of the very basic features of an orchestrator, which is service
discovery, requires consul. Sure, I can use it without consul to just start
jobs that don't communicate, but obviously that's that the normal use case of
it, and you can see that by looking at their issues list.

------
solatic
So basically, when CloudFlare was making the decision to adopt Nomad, they had
already adopted Consul and had already built in-house a custom scheduler
(unimog) for their customer traffic.

It's rather disingenuous to compare to Kubernetes-based installations by
pointing out that Nomad is a single Go binary that's easy to install, because
they're also running Consul and unimog. For what it's worth, kubelet (which
schedules jobs, like Nomad) is also a single Go binary that's easy to install,
as is kube-proxy (which helps services running on a node to send traffic to
the right node, so, roughly analogous to Consul).

If anything, the author practically makes the case against Nomad. Practically
speaking, the only reason why CloudFlare adopted Nomad is because the stars
aligned on their stack. Most companies will actually reduce complexity by
adopting Kubernetes instead, compared to running two different schedulers for
jobs and externally-managed service discovery.

~~~
shaklee3
The single binary thing is a tired trope. Kubernetes has had hyperkube for
many years, which is a single binary for many years.

But in really, a single binary should be at the bottom of your list of
concerns.

~~~
jordanbeiber
Hyperkube is several binaries bundled in one container, no?

~~~
mdaniel
It's often _distributed_ as a container, but it ~~is~~ was a single monolithic
go binary. However, while I was looking for supporting evidence to that claim,
I discovered that it recently[1] was evicted from the github.com/k/k repo and
the docker image that is now built uses a shell script named hyperkube and the
Dockerfile just untars the underlying binaries into /usr/local/bin

I wasn't able to immediately track down the alleged new repo for hyperkube,
though

1 =
[https://github.com/kubernetes/kubernetes/pull/83454](https://github.com/kubernetes/kubernetes/pull/83454)

------
schoolornot
I'm surprised Hashicorp hasn't repositioned this product.

Terraform was a huge breath of fresh air after Cloudformation. If you ask me,
deploying apps via k8s is even better.

Everyone wants a free lunch and for me, momentum + cloud vendor support +
ultimately the Nomad Enterprise features that come free with K8s made the
choice easy.

~~~
rossmohax
TF can be often suffocating still. Pulumi runs circles around it, when it
comes to user experience in writing complex, modular, composable and reusable
configurations.

~~~
tetha
Except now you have to deal with javascript. It's hard enough to turn some
operators towards an IaC approach, but throwing Javascript at them isn't going
to help.

~~~
jen20
Pulumi does not require JavaScript at all - it is one of a range of options,
including JavaScript, TypeScript, Python, Go, C# and F# (currently).

Disclaimer: I worked on Terraform at HashiCorp and am a contributor to Pulumi
also.

~~~
gavinray
Pulumi is so good (even if it is just typed-SDK's over Terraform). Only sane
approach to IaC IMO, using an actual programming language with types and IDE
integration.

Really dig using with it Cloud Run.

~~~
jen20
It’s also not typed SDKs over Terraform. The best Pulumi providers are native
and offer a much richer experience than the wrapped Terraform providers (see
the Kubernetes support for example). I for one look forward to those rolling
out across the board!

~~~
gavinray
I actually have one burning question if you don't mind answering:

Pulumi has two different k8s modules, the standard one (pulumi-kubernetes) and
"kubernetes-x" which has a much nicer API that reduces boilerplate:

[https://github.com/pulumi/pulumi-
kubernetesx](https://github.com/pulumi/pulumi-kubernetesx)

The second one doesn't seem to be used in the Pulumi tutorial content and has
less stars/recognition.

Are these interchangeable? If so, why not use @pulumi/kubernetesx instead of
the standard k8s module everywhere?

~~~
jen20
They’re not interchangeable - kubernetes-x builds higher level abstractions on
top of the pulumi-kubernetes API which is faithful to the openAPI supplier by
Kubernetes.

To my mind, the real power of having general purpose languages building the
declarative model instead of a DSL is that libraries like k-x (or indeed
AWS-x) can actually be built!

------
aprdm
Do people that aren't cloudflare scale really see the need for kubernetes
and/or Nomad?

Of the two Nomad seems much more sane because it does one thing only and is
much simpler to manage and deploy.

That said, having have used it, we are mostly moving away from it. Consul +
Docker/Docker-compose with systemd in "a service per vm" model has proved much
easier to administrate to our scale (couple of datacenters, ~1k VM mark). It
is actually so simple that is boring. Instead of fiddling with infrastructure
the developers spend time solving business problems...

Our stack is really boring, developers can find their way around ansible and
share galaxy roles to configure / provision infrastructure... Other cloud
native projects like Prometheus and Fluentd give us all the visibility we need
in a very straight forward and boring way.

p.s: Great article!

~~~
stitched2gethr
From my personal experience, yes, but not just for scale. I've used Kubernetes
in a 300k person organization and now in a 35 person organization. Kubernetes
isn't the simplest solution, but it's just reliability in a box. You can
control the network, and the servers, and the load balancing all through the
API which means for most things I can install a fully clustered solution with
a handful of commands. Set the number of instances I want to run and forget
the whole thing exists. Suddenly I'm cloud agnostic by default and the
underlying K8s management is handled by the cloud provider.

In the 300k person org it was for scale and reliability. In the 35 person org
it's so that 1 guy can manage everything without trying.

~~~
aprdm
I see. My last work experiences have been introducing devops workflows into
companies that already had a decent tech body (~200 people) and the
traditional divide between IT/Dev.

I think the approach I mention ends up being a good compromise from both
sides. If I were starting a company from scratch I would certainly consider it
differently.

------
chucky_z
Cloudflare, make sure you upgrade to 0.11.3, the new scheduling behavior is
awesome for large clusters.

Also a massive warning to anyone wanting to use hard cpu limits and cgroups
(they do work in nomad it’s just not trivial), they don’t work like anyone
expects and need to be heavily tested.

~~~
waynesonfire
are you serious? you've got a major company touting a technology and a key
component of it is broken?

~~~
rbjorklin
I haven’t tested hard CPU quotas with Nomad but I suspect the issue mentioned
above is due to cgroups/CFS and is also applicable to Kubernetes. See these
slides for more details:
[https://www.slideshare.net/mobile/try_except_/optimizing-
kub...](https://www.slideshare.net/mobile/try_except_/optimizing-kubernetes-
resource-requestslimits-for-costefficiency-and-latency-highload)

~~~
Pirate-of-SV
I recall that there's been some improvements to CFS to fix this since 2018.

~~~
chucky_z
There's been a ton of improvements over time. That doesn't make it any less
foot-gun-y, unfortunately.

I had the exact same problem with Kubernetes (and even just straight up
containers), I just want to make sure folks are extremely aware of how big of
a footgun it is, and to really, really, really test it well.

------
justincormack
Nice that they did a kernel patch to fix the pivot root on initramfs problem,
but sadly looking at replies it won't get merged
[https://lore.kernel.org/linux-
fsdevel/20200305193511.28621-1...](https://lore.kernel.org/linux-
fsdevel/20200305193511.28621-1-ignat@cloudflare.com/)

------
rcar
Article is missing the title's leading "How"

------
jFriedensreich
One point i never get for companies operating their own hardware: If your
problem was having a number of well known servers for the internal management
services and you then move to a nomad cluster or kubernetes to schedule them
dynamically, you end up with the same problem as before to schedule the well
known nomad servers or kubernetes masters. So is the only advantage here that
the nomad server images update less often than the images of the management
services?

~~~
dmwilcox
Not associated with CloudFlare but I've built similar stuff with Nomad.

The cost/maintenance trade-off works when you have more SPOF management hosts
than Nomad servers (5). You decrease host images down to 2, Nomad server and
client, versus N management images.

Though it does sound a bit like they're using config management rather than
pre-build images.

Bonus, Nomad servers are more failure resistant using Raft consensus versus
any N management hosts. And for discovery I found the optimal pattern is to
put all of the Nomad servers in a "cluster" A record for clients to easily
join (pattern works well for Consul too)

------
pier25
Off topic but... I've always wondered how Cloudflare can not charge for the
bandwidth. Even the free plan is super generous (CDN, SSL, etc).

How are they making a profit when AFAIK all other CDNs charge you for the
bandwidth (and I assume they have to pay for to their providers)?

~~~
nielsole
See the second answer for a statement from cf
[https://webmasters.stackexchange.com/questions/88659/how-
can...](https://webmasters.stackexchange.com/questions/88659/how-can-
cloudflare-offer-a-free-cdn-with-unlimited-bandwidth)

~~~
pier25
I'm familiar with those terms but still.

There are people with free accounts moving GBs every month through their
network and I imagine those free users must account for a very large
percentage of their traffic.

~~~
fach
I’d imagine at this point they are heavily peered in most markets, driven by
said free users, so there isn’t a significant opex hit bandwidth -wise.
Space/power opex plus network/compute hardware capex probably dominates their
spend.

------
jeffbee
"... here is the CPU usage over a day in one of our data centers where each
time series represents one machine and the different colors represent
different generations of hardware. Unimog keeps all machines processing
traffic and at roughly the same CPU utilization."

Still a mystery to me why "balancing" has SO MUCH mindshare. This is almost
certainly not the optimal strategy for user experience. It is going to be much
better to drain traffic away from older machines while newer machines stay
fully loaded, rather than running every machine at equal utilization factor.

~~~
dpw
I'm an engineer at Cloudflare, and I work on Unimog (the system in question).

You are right that even balancing of utilization across servers with different
hardware is not necessarily the optimal strategy. But keeping faster machines
busy while slower machines are idle would not be better.

This is because the time to service a request is only partly determined by the
time it takes while being processed on a CPU somewhere. It's also determined
by the time that the request has to wait to get hold of a CPU (which can
happen at many points in the processing of a request). As the utilization of a
server gets higher, it becomes more likely that requests on that server will
end up waiting in a queue at some point (queuing theory comes into play, so
the effects are very non-linear).

Furthermore, most of the increase in server performance in the last 10 years
has been due to adding more cores, and non-core improvements (e.g. cache
sizes). Single thread performance has increased, but more modestly.

Putting those things together, if you have an old server that is almost idle,
and a new server that is busy, then a connection to the old server will
actually see better performance.

There are other factors to consider. The most important duty of Unimog is to
ensure that when the demand on a data center approaches its capacity, no
server becomes overloaded (i.e. its utilization goes above some threshold
where response latency starts to degrade rapidly). Most of the time, our data
centers have a good margin of spare capacity, and so it would be possible to
avoid overloading servers without needing to balance the load evenly. But we
still need to be confident that if there is a sudden burst of demand on one of
our data centers, it will be balanced evenly. The easiest way to demonstrate
that is to balance the load evenly long before it becomes strictly necessary.
That way, if the ongoing evolution of our hardware and software stack
introduces some new challenge to balancing the load evenly, it will be
relatively easy to diagnose it and get it addressed.

So, even load balancing might not be the optimal strategy, but it is a good
and simple one. It's the approach we use today, but we've discussed more
sophisticated approaches, and at some point we might revisit this.

~~~
jeffbee
Thanks for the detailed reply. It would be interesting to see your plots of
latency broken down by hardware class and plotted as a function of load. I'd
be pretty surprised if optimal latency was achieved near idle, since in my
experience latency is a U-shape with respect to utilization: bad at 100% but
also bad at 0% since it takes time to wake resources that went to sleep.

I'm sure your system has its benefits, I just get triggered by "load
balancing" since it is so pervasive while also being a highly misleading and
defective metaphor.

------
fideloper
What Grafana dashboards do you all use/like for Prometheus + Nomad?

------
ospider
Actually, I think K8S is much simpler to operate than nomad.

With nomad, you have too many options, you can run your service/job as a
process or within containers, but with Kubernetes, there is only one way to do
things -- container, which is really a simple choice to make.

Besides, Kubernetes got etcd builtin, so I don't have to use deploy and
maintain consul.

Last but not least, I still see containers mysteriously gone, and have no idea
how nomad did that. With kubernetes, such thing never happened.

------
jugg1es
I love nomad and architected a large system that used it for several years. We
are finally moving away from it, however, because of its lack of native
support for autoscaling. I know there are third party solutions for it, but
that doesn't work for us. I suspect a large number of k8s users could use
nomad instead with way less overhead.

~~~
jambay
I'm curious if you tried the new autoscaler [1] that is in Tech Preview? Is
there something it's missing for your scenarios?

[1] [https://www.hashicorp.com/blog/hashicorp-nomad-
autoscaling-t...](https://www.hashicorp.com/blog/hashicorp-nomad-autoscaling-
tech-preview/)

~~~
jugg1es
As you probably know, decisions like changing infrastructure architecture is a
team decision and are made with the information available at the time. The
nomad autoscaler just wasn't announced/released in time for us to seriously
consider it. That coupled with the uncertainty around the future of Fabio made
us look elsewhere. We remain staunch supporters of Hashicorp products overall.
We will continue to use Terraform, Consul and a little Vault.

------
technological
Hope they allow users to access documentation for past versions of nomad.
Currently there is no easy way to find if particular configuration option
mentioned in the documentation is available for not.

------
MuffinFlavored
What can Nomad do that Terraform can't? I'm pretty sure you can blue-green
rolling deploy services inside of Docker containers with Terraform too.

~~~
jen20
They’re completely different classes of tool - Nomad is a runtime component,
Terraform is not.

~~~
MuffinFlavored
but terraform can talk to the docker daemon and orchestrate containers.

what can you achieve with nomad that you can’t achieve with terraform?

~~~
jen20
\- run a non-dockerised job

\- have a restart policy outside of what the docker daemon offers

\- automatically schedule across multiple machines based on constraints
including affinity and anti-affinity, bin packing etc

\- “dispatch” jobs

The Terraform Docker provider effectively provides a tiny subset of the
functionality of Nomad, which provides similar functionality across many
drivers, not just Docker.

------
4636760295
I worked at a hedge fund that also used nomad. The problem, however, is not
how well it scales or whatever, but the fact that all the accompanying
deployment info and literature is for kubernetes, and k8s has far more
features.

I like the quality of products from HashiCorp, but k8s is far, far, ahead of
where nomad is.

What I _really_ want is better integration for Terraform and kubernetes. The
current TF k8s support leaves much to be desired, too many things are missing
or broken and I find there are several bugs that result in deployment flapping
(i.e., constantly re-deploying the same thing when there are no changes).

~~~
api
Which features is Nomad missing? Feature count comparisons are meaningless
unless the features are tied to actual important use cases. Lots of software
is encrusted with rarely used features that just add complexity.

~~~
q3k
The free version of Nomad is missing, IIUC:

\- no quotas for teams/projects/organizations

\- no preemption (ie higher priority job preempts lower priority job)

\- no namespacing

So generally it's somewhat useless in organizations where there are multiple
different teams that should be able to coexist on a cluster without stepping
on eachothers' toes, or even where you want a CI system to access the cluster
in a safe manner.

~~~
schmichael
Preemption is going OSS!
[https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md](https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md)

We fully intend to migrate more features to OSS in the future -- especially as
we build out more enterprise features. As you can imagine building a
sustainable business is quite the balancing act, and there's constant internal
discussion.

(I'm the Nomad Team Lead at HashiCorp)

~~~
tetha
This is going to make our AI team very happy because they can just dump
experiments into the cluster at low-priority so those'll be done when those
are done.

It's also going to make the operators very unhappy because it'll be harder to
monitor actual memory utilization (allocated memory vs memory really in use)
in order to plan cluster extensions. Are there some tools around or work
planned to make this kind of scaling and utilization easier?

~~~
schmichael
I was going to say "yes!" because we do have some telemetry/metrics
improvements coming up, but then I realized those won't address your use case.
It seems like aggregating cluster resource usage by job priority is what we
need for that. Our UI might be able to calculate that, but I don't think our
metrics include job priority so there's no way for your existing tooling to
display it.

Please file an issue or link me to an existing issue of you have time. This
seems really compelling!

------
JaggerFoo
Thanks for sharing this excellent article.

