
Open-sourcing Clusterman, a cluster autoscaler for Kubernetes and Mesos - drmorr
https://engineeringblog.yelp.com/2019/11/open-source-clusterman.html
======
ones_and_zeros
What led to the decision to not contribute enhancements to the k8s cluster
autoscaler? Seems like a lot of the features would be useful and in scope of
the cluster autoscaler...

~~~
ssk2
Former Yelp engineer here (I left recently). Clusterman was built about ~18
months ago for our primarily Mesos and Marathon based workloads. At the time,
the Kubernetes cluster autoscaler was in its infancy and there was no
comparable solution that worked for Mesos.

Yelp is currently migrating to Kubernetes which is why support for Kubernetes
autoscaling was recently added to Clusterman.

~~~
jacques_chester
I call this phenomenon "Not Invented Yet Syndrome".

~~~
zapita
More like “real production workloads, can’t wait for vendors to catch up
syndrome”.

~~~
jacques_chester
That's more or less what I was trying to convey. Responsible engineers will
look around for mature solutions to their problems. Sometimes there aren't any
and sometimes it will be the most economically-defensible decision to roll
your own.

Of course, we notoriously underestimate the costs of roll your own. So a fair
and common question is: why did you roll your own? What cost was so
compelling?

What sometimes happens though is that the roll-your-own decision, made at time
A, is later critiqued based on the options available at time B. If the
decision was being made at time B, then the roll-your-own decision might be a
symptom of "Not Invented Here Syndrome". But accusing someone of NIHS without
accounting for _when_ the original decision was made is unfair.

Hence my joke name, "Not Invented Yet Syndrome". Why didn't you use the
alternative? Because it didn't exist or wasn't applicable.

------
filereaper
>Kubernetes allows us to run workloads (Flink, Cassandra, Spark, and Kafka,
among others) that were once difficult to manage under Mesos (due to local
state requirements).

Just wanted to clarify that running Cassandra & Kafka on Mesos is much easier
now with the DC/OS Commons SDK.[1] Spark has always been supported on
Mesos.[2]

[1][https://github.com/mesosphere/dcos-
commons](https://github.com/mesosphere/dcos-commons)

Cassandra:[https://docs.d2iq.com/mesosphere/dcos/services/cassandra/2.7...](https://docs.d2iq.com/mesosphere/dcos/services/cassandra/2.7.0-3.11.4/)

Kafka:[https://docs.d2iq.com/mesosphere/dcos/services/kafka/2.8.0-2...](https://docs.d2iq.com/mesosphere/dcos/services/kafka/2.8.0-2.3.0/)

[2][https://docs.d2iq.com/mesosphere/dcos/services/spark/2.9.0-2...](https://docs.d2iq.com/mesosphere/dcos/services/spark/2.9.0-2.4.3/)

~~~
ssk2
Small point of clarification: the DC/OS SDK works well for these applications
when running on DC/OS. It did not work well when run on open source Mesos,
which is what Yelp uses.

~~~
perfect_kiss
DC/OS is also open source, Apache 2.0, same as Mesos.

In fact DC/OS includes open source Mesos with no modifications, what they add
is packaged bunch of different API providers running as Mesos apps: Marathon,
DNS, etc., and also a somewhat dumb installer.

Auto-scaling was long possible for Marathon/Mesos, and is described at DC/OS
website, though does not involve any of DC/OS additionals:
[https://docs.d2iq.com/mesosphere/dcos/2.0/tutorials/autoscal...](https://docs.d2iq.com/mesosphere/dcos/2.0/tutorials/autoscaling/)
.

In my org, we run open source Marathon/Mesos deployments for years without any
DC/OS additionals, and also a slightly modified versions of tooling below:

[https://github.com/mesosphere/marathon-
autoscale](https://github.com/mesosphere/marathon-autoscale)
[https://github.com/mesosphere/marathon-lb-
autoscale](https://github.com/mesosphere/marathon-lb-autoscale)

There are also available several open-source auto-scalers for Marathon/Mesos
apart from these two.

This new one from Yelp seems to be promising, since it's battle tested within
organization which has many kinds of different workloads, so I would give it a
try for some new cluster.

------
gravypod
It would be really cool to see the features in this mainlined.

Side question: I know in the article it was mentioned that they liked their
signal approach to scaling because it allows them to preemptively scale. I'm
just not sure why that wouldn't be achievable by scaling the replica counts of
your deployments based on signals.

If scaling happens based on pending pods then just scale your pods so that
they're configured to handle your predicted traffic. Then the cluster will
obtain your desired state. Am I missing something?

~~~
ssk2
Scaling happens at two levels: both for the deployments and for the size of
the cluster itself. Clusterman operates on the cluster itself.

If you scale just the number of pods, if there isn't enough available capacity
in the cluster, they are unable to deploy until that capacity brought online.
By emitting a signal to increase the number of nodes in the cluster just
before we think that capacity is about to be needed, we can ensure that the
new pods are launched near instantly when the deployment is actually scaled
up. This is mostly useful when you're scaling by large increments (i.e.
hundreds of pods) that far exceed the spare capacity available in your
cluster.

~~~
gravypod
What benift does that distinction give you? If the cluster scales to be able
to run all of your pending pods, and scales down when you have extra room, and
you scale your pods correctly (preemptively & with custom metrics), what do
you gain?

~~~
ssk2
Mostly time. For workloads that are pseudo interactive (e.g. continuous
integration), you're saving developers a 10-20 minutes here or there. That can
add up to be quite a bit for a medium to large organisation.

------
madsbuch
Initial commit 18 days ago: I wonder if they did not use git before open
sourcing, or of they wanted to clean the commit log before publishing.

~~~
sixo
Everyone cleans commit logs before open sourcing

~~~
ramilexe
Why?

~~~
trevor-e
Because who knows what kind of private company information might possibly be
leaked otherwise in commit messages. From a legal point of view, it's a lot
easier to start with a clean slate.

~~~
sargram01
Not everyone does, the UnrealEngine doesn’t for example, you can see project
code words in the commit history.

~~~
o-__-o
Not as sensitive as accidentally dropping information about your internal
network. then take the long-troll method of infiltrating an upstream provider
to attack a juicy target (build system of fortune 500? yes please)

or maybe catching wind of some dev keys that really are root keys..

many reasons to sanitize git history before open sourcing. in fact many
organizations i have worked with still maintain two separate repos, one
internal and one open source using fancy magic (either with git or with
additional tools) to sanitize and sync commits between the two. i've seen code
commits to a large organization that are then packaged up and inspected for
license and security violations in an untrusted environment.. many reasons to
keep two (or more) running copies

------
sargram01
So when does Yelp see spikes in traffic needing a custom Autoscaler? I guess
just the daily time zone based sine wave type load?

~~~
ssk2
Usually around dinner time in our big metro areas :)

~~~
sargram01
That makes sense, do you have any annual spikes too? Like Uber has Halloween,
which I wouldn’t have guessed as being the worst.

~~~
ssk2
Mother's Day was a surprising one for me! A lot of people take their parents
out for brunch. Same goes for Valentine's Day.

------
therobot24
yelp doing something not shady or bullying? i'm very skeptical...

~~~
skyyler
were software engineers ever the ones at yelp doing shady things or bullying?

~~~
ar_lan
I'm not a Facebook fan, but I am hugely grateful for their contributions of
React and GraphQL.

I think engineers should be considered dis-associated.

~~~
meddlepal
Engineers are special innocent snowflakes that would never do a bad thing?

~~~
strig
Engineers on the ground developing technologies generally aren't responsible
for overarching business decisions, no.

~~~
InterestBazinga
I feel like this is an intricate argument.

Why are german soldiers held in a negative light even though the war decisions
were by its commanders?

~~~
itake
They aren't held in a negative light... After the war, they were forgiven and
I think society in general has accepted that they were just following orders.
The leaders definitely thought in a negative light.

~~~
lxw
This is not true. "Just following orders" was held to be not a valid defense
in the Nuremberg trials:
[https://en.m.wikipedia.org/wiki/Superior_orders](https://en.m.wikipedia.org/wiki/Superior_orders)

