
Borg: The Next Generation [pdf] - blopeur
https://www.eurosys2020.org/wp-content/uploads/2020/04/slides/49_muhammad_tirmazi_slides.pdf
======
blopeur
Paper :
[https://dl.acm.org/doi/abs/10.1145/3342195.3387517](https://dl.acm.org/doi/abs/10.1145/3342195.3387517)

Video :
[https://www.youtube.com/watch?v=1qUQsZwbu6s](https://www.youtube.com/watch?v=1qUQsZwbu6s)

------
jeffbee
This is a really valuable contribution, just to benchmark whatever you are
doing vs what's possible, in a similar way that the large datacenter operators
have shown everyone that PUE of 1.1 is achievable. This data shows that you
can achieve > 60% utilization of both compressible and incompressible
resources, overcommit both kinds of resources by 200%, while scheduling task
arrivals at over 100 per second. It really is an extremely valuable glimpse
into how the bigs operate.

------
justicezyx
Was involved in this research project for a few months in early 2019. Feel
free to post some questions that are not sensitive to internal details, and I
can answer them here.

Edit: not one of the coauthors, since I left the team in April 2019.

~~~
sandGorgon
How do people who work on Borg look at kubernetes? Is it the real "next
generation"

~~~
justicezyx
When comparing Borg vs. K8s:

* Borg is primarily for managing hardware resources. K8s, in comparison, is for managing applications. I often call them a complementing twins. They both are container based cluster manager, but their focuses are opposite. I also use "machine oriented" for Borg, and "application oriented" for k8s.

* Borg emphasize on scheduling capabilities, performance, and scalability. And various integration capabilities for wiring together the base line software foundation, and interfacing with hardwares. For example, SDN, security, name service, needs to hook up with Borg, to interact with all applications.

* K8s emphasize on modeling application semantics and how to map them into containers. K8s provides abstractions for on boarding apps and a lot of toolings.

In summary:

* K8s is considered a layer on top of Borg, if it can ever be incorporated into Google production.

* As of now, there is nothing in sight that can be considered next generation Borg. Just like there isn't any next generation Linux. Software like Borg, and Linux, can only be supplanted, not succeeded. Their vast amount capabilities defy succeeding. In other words, if one want to build a better Borg (or Linux) they are doing it wrong. K8s did not try to better mesos or open stacks, or Borg perse, it's just a different thing.

Edit: There are numerous difference between Borg and kubernetes. Send me an
email listed in my profile, if you are really into the gory details. We can
setup some discussion.

~~~
sterlind
There's still nothing really like Borg in the open-source world. Linux can't
be replaced because it's open-source, Borg is secret sauce. Amazon and
Microsoft built their infrastructure to be based around long-lived VM tenants,
Borg based Google around tiny containers. It's doubtful another cloud will
come along or any of us giants will rebuild from the ground up, so I guess
it's unlikely anyone will ever need a Borg besides Google. I find it sad
though, because I fell in love with it from the rumors I heard and papers I
read when I redesigned Azure's cluster scheduler algorithm.

I think it's sad that 99% of the resources are used by hogs, though. I always
thought it'd be neat to build on tiny containers; this suggests that you
really can be fine with a few giant tenants and minimal colocation.

 _sigh_ some day I'll get to design my dream Borg successor, even if it's a
TempleOS-like art project after I've descended into dementia.

~~~
thu2111
Don't be fooled by the 99% figure. It says in 2011 the figure wasn't much
different. Well that was a bit surprising to me because when I was using Borg
in 2011 it didn't feel that way.

Borg is vast. Unimaginably vast. Any engineer can start a job in the free tier
for a personal web server in minutes, all the way up to saturating 100,000
cores to process some dataset. You can browse cluster job lists forever and
never reach the bottom. 1% of jobs is still a huge number of different jobs.

The "hogs" are going to be jobs like web search serving, indexing and related,
logs processing etc.

The "mice" are going to be the long tail of jobs. Remember that this sort of
paper presents Borg as a kind of exemplar of what large cluster systems look
like, but as you already observe, that's not really true. Borg is really
unique to Google and Google is really unique. In particular due to the
personalities of its founders and its financial position Google has a massive
long tail of products and web sites that don't see much usage, or see decent
usage by non-Google standards but which gets lost in the noise of search, ads,
YouTube etc.

So the hogs vs mice phenomenon is telling us more about the nature of Google
and how it does projects than something fundamental about job scheduler
systems.

~~~
alexeldeib
Interesting point re: the long tail, but isn't the same true of all the
hyperscalars? It's certainly true for Amazon and Microsoft, who both have a
better reputation for keeping the long tail of products alive than Google. So
if anything, that would point at Google having optimized for this scenario
better and earlier.

------
epistasis
Does Borg also track and allocate network usage? That has been an issue for me
in compute environments in the past for high IO situations.

~~~
jeffbee
Google's network tracks and manages network usage at the edge. I'm not sure if
experts would say that was done "by borg" or not. On each host there is a
"host enforcer" which manages flows to and from that host. You can read all
about it at [1]. I don't think there are any publications suggesting that Borg
uses network flow rate for scheduling.

1:
[https://research.google/pubs/pub43838/](https://research.google/pubs/pub43838/)

------
remus
The resource usage section at the end was really interesting, and surprising
to me. 1% of jobs use 99% of resources! It would be interesting to try and
understand how this pattern came about and if there's particular engineering
decisions that tend to lead to this situation where you have a handful of
incredibly resource intensive jobs and loads of very lightweight jobs.

~~~
jeffbee
That seems more related to weird choice of denominator than to any underlying
facts of large scale cluster management. Large services have stable names and
run forever. Search is just "search" and bigtable is just "bigtable" (some
irrelevant details have been elided for clarity). If I run a batch logs
analysis job though it is transient and has a unique name every time I run it.
So there's a huge long tail of transient job _names_ and a few dominant
services with permanent names, which makes the number-of-jobs denominator
quirky.

~~~
yeshengm
Yeah. Likely a small amount of analytics jobs taking up most memory/CPU
footprint, while critical prod OLTP jobs take very little resource comparably.

------
kgraves
Interesting, I can only wish to run this sort of tech for my own small
company.

~~~
GauntletWizard
You can! I have all the equivalent graphs and most of the equivalent features
on the k8s clusters I run. Google has commoditized borg, at least for the
scales that most companies will run.

~~~
justicezyx
Not really, Borg is itself only at the scale it operates at. It's like Michael
Jordan is not the baseball god, if it's only allowed to playing with 5 year
old kiddos.

~~~
GauntletWizard
Some scale is needed to see the true awesomeness of Borg, and you can't built
a K8s cluster anywhere near as big as a Borg cell, but many of the same "Cool
features" can be built on K8s as well as they can with borg. Things like
rolling deploys and stateless, self-healing systems work at k8s scale.

I recently got back on a project that was on maintenance mode for a year - But
because it had a nice K8s setup, it had more or less managed itself through OS
upgrades and rebuilds with very little user intervention.

------
polskibus
How much different/better/worse is this compared to Kubernetes?

~~~
fh973
Much simpler to use as a user. I'd say the k8s API is very much over designed
and has artifacts that you don't find in other job description languages. Or
can anyone explain me the need for StorageClasses, PVCs and PVs with CSI
mixins just to access some files?

As a scheduler it's much more powerful in capabilities, scalability and
performance. It was designed to react to different workload demands in huge
clusters very quickly. Imagine a data center goes offline and you need to spin
up the search engine while killing a map reduce with 20k workers. A borg
cluster is war zone, priorities and quotas win.

~~~
jrockway
At Google, a vast majority of jobs persisted data ("wrote files") over an RPC
API, so the container orchestrator didn't have to care about storage. Over in
the real world, that seems to be extremely uncommon. Best case: cloud
providers give you a block device tied to an AZ, and you're on your own (or
you rewrite your app to use S3, or something that implements a similar API).
On prem, people use lots of crazy things ranging from NFS to multi-million-
dollar black boxes. Kubernetes has to support them all.

CSIs exist because there are a billion vendors of "cloud storage" solutions,
and apps need a way to decouple themselves from the details. If you just need
a block device, your Kubernetes app can run against anything that implements
that API; your app you developed on DigitalOcean will Just Work on GKE. That's
the idea behind those.

For workload management, Borg feels very close to StatefulSets; your replica
count could increase and decrease as you demanded it, but each task was
individually named (0.your-job.cell). I ran all my jobs with semantics similar
to Deployments (tasks did not care which id they had; they all showed up as
load balancer targets) and I think Borg had different abstractions for things
in Kubernetes like ReplicaSets (i.e., a particular [config,code] tuple that is
running in production that logically makes up a deployment).

You didn't ask so I won't go into the details as it gets very complicated, but
networking is vastly different. (Kubernetes tries to provide IPAM and
connection-balancing that is transparent to workloads; Google preferred
client-side "smart" load-balancing libraries. That meant that load-balancing
complexity was outside the scope of Borg, but for Kubernetes, it's very much
in scope. Often with confusing results, but I digress.)

My experience having used both extensively is that there are about the same to
the end user. Before I worked at Google, I always let "someone else" handle
deployment and maintenance of production. At Google, I felt like I could write
and release a piece of software to Borg in the same day (and did once!) After
Google, I struggled with a bunch of orchestrators until I landed on
Kubernetes, which gives me similar confidence and ease of use. Kubernetes has
more API surface, probably because it has more users. At Google you could
declare by fiat how jobs were to be run. In the real world, people aren't
going to use your thing if you don't support their favorite feature. I have
tried the opinionated container orchestrators in the real world (Heroku,
Convox, ECS) and they didn't make me happy, while Kubernetes did.

------
gautamcgoel
Is Borg essentially Google's version of Condor?

~~~
justicezyx
What is condor?

~~~
scott_s
They probably mean the Condor project which started in academia in the late
'90s as a part of what was called grid computing:
[https://research.cs.wisc.edu/htcondor](https://research.cs.wisc.edu/htcondor)
They also have an extensive list of publications over the years:
[https://research.cs.wisc.edu/htcondor/publications.html](https://research.cs.wisc.edu/htcondor/publications.html)

------
throwaway29303
What is Borg written in? C, C++ or Go?

~~~
boulos
Borgmaster itself is written in C++, though like many internal Google systems
using protobufs for defining a lot of the interfaces and RPC bits.

It’s been around a long time, before Go existed :).

------
pwg
Also a name clash with Borg the backup utility (which is the #5 item returned
from a google search for "borg" for me when I just tested what would come
back):

[https://borgbackup.readthedocs.io/en/stable/](https://borgbackup.readthedocs.io/en/stable/)

~~~
Youden
Because I was curious:

\- The Borg are a fictional race of aliens in Star Trek: Next Generation. They
were first shown on TV in 1989. This appears to be the basis for the software
projects' naming.

\- Borg, Google's cluster management system, was introduced at some point
prior to 2006 (per [0], published in 2016, which states Borg was introduced
"over a decade ago").

\- Borg the backup system was forked from Attic in 2015 [1]

[0]: [https://cloud.google.com/blog/products/gcp/from-google-to-
th...](https://cloud.google.com/blog/products/gcp/from-google-to-the-world-
the-kubernetes-origin-story)

[1]:
[https://borgbackup.readthedocs.io/en/stable/changes.html#ver...](https://borgbackup.readthedocs.io/en/stable/changes.html#version-0-23-0-2015-06-11)

