Hacker News new | past | comments | ask | show | jobs | submit login
Borg: The Next Generation [pdf] (eurosys2020.org)
104 points by blopeur 29 days ago | hide | past | web | favorite | 57 comments

This is a really valuable contribution, just to benchmark whatever you are doing vs what's possible, in a similar way that the large datacenter operators have shown everyone that PUE of 1.1 is achievable. This data shows that you can achieve > 60% utilization of both compressible and incompressible resources, overcommit both kinds of resources by 200%, while scheduling task arrivals at over 100 per second. It really is an extremely valuable glimpse into how the bigs operate.

Was involved in this research project for a few months in early 2019. Feel free to post some questions that are not sensitive to internal details, and I can answer them here.

Edit: not one of the coauthors, since I left the team in April 2019.

How do people who work on Borg look at kubernetes? Is it the real "next generation"

When comparing Borg vs. K8s:

* Borg is primarily for managing hardware resources. K8s, in comparison, is for managing applications. I often call them a complementing twins. They both are container based cluster manager, but their focuses are opposite. I also use "machine oriented" for Borg, and "application oriented" for k8s.

* Borg emphasize on scheduling capabilities, performance, and scalability. And various integration capabilities for wiring together the base line software foundation, and interfacing with hardwares. For example, SDN, security, name service, needs to hook up with Borg, to interact with all applications.

* K8s emphasize on modeling application semantics and how to map them into containers. K8s provides abstractions for on boarding apps and a lot of toolings.

In summary:

* K8s is considered a layer on top of Borg, if it can ever be incorporated into Google production.

* As of now, there is nothing in sight that can be considered next generation Borg. Just like there isn't any next generation Linux. Software like Borg, and Linux, can only be supplanted, not succeeded. Their vast amount capabilities defy succeeding. In other words, if one want to build a better Borg (or Linux) they are doing it wrong. K8s did not try to better mesos or open stacks, or Borg perse, it's just a different thing.

Edit: There are numerous difference between Borg and kubernetes. Send me an email listed in my profile, if you are really into the gory details. We can setup some discussion.

There's still nothing really like Borg in the open-source world. Linux can't be replaced because it's open-source, Borg is secret sauce. Amazon and Microsoft built their infrastructure to be based around long-lived VM tenants, Borg based Google around tiny containers. It's doubtful another cloud will come along or any of us giants will rebuild from the ground up, so I guess it's unlikely anyone will ever need a Borg besides Google. I find it sad though, because I fell in love with it from the rumors I heard and papers I read when I redesigned Azure's cluster scheduler algorithm.

I think it's sad that 99% of the resources are used by hogs, though. I always thought it'd be neat to build on tiny containers; this suggests that you really can be fine with a few giant tenants and minimal colocation.

sigh some day I'll get to design my dream Borg successor, even if it's a TempleOS-like art project after I've descended into dementia.

Don't be fooled by the 99% figure. It says in 2011 the figure wasn't much different. Well that was a bit surprising to me because when I was using Borg in 2011 it didn't feel that way.

Borg is vast. Unimaginably vast. Any engineer can start a job in the free tier for a personal web server in minutes, all the way up to saturating 100,000 cores to process some dataset. You can browse cluster job lists forever and never reach the bottom. 1% of jobs is still a huge number of different jobs.

The "hogs" are going to be jobs like web search serving, indexing and related, logs processing etc.

The "mice" are going to be the long tail of jobs. Remember that this sort of paper presents Borg as a kind of exemplar of what large cluster systems look like, but as you already observe, that's not really true. Borg is really unique to Google and Google is really unique. In particular due to the personalities of its founders and its financial position Google has a massive long tail of products and web sites that don't see much usage, or see decent usage by non-Google standards but which gets lost in the noise of search, ads, YouTube etc.

So the hogs vs mice phenomenon is telling us more about the nature of Google and how it does projects than something fundamental about job scheduler systems.

Interesting point re: the long tail, but isn't the same true of all the hyperscalars? It's certainly true for Amazon and Microsoft, who both have a better reputation for keeping the long tail of products alive than Google. So if anything, that would point at Google having optimized for this scenario better and earlier.

Mind to connect me at info@nascentcore.ai?

We might have something to show.

I tried, but your email server is down! Gmail is saying it's timing out.

It's info@nascentcore.com Sorry about that...

from what i read as comments on this post, there is a bit of a gray area here

It seems that K8s takes networking under its purview, Borg does not and leaves it to the client (another comment here). K8s also implements StatefulSets for storage.

Does Borg also track and allocate network usage? That has been an issue for me in compute environments in the past for high IO situations.

Google's network tracks and manages network usage at the edge. I'm not sure if experts would say that was done "by borg" or not. On each host there is a "host enforcer" which manages flows to and from that host. You can read all about it at [1]. I don't think there are any publications suggesting that Borg uses network flow rate for scheduling.

1: https://research.google/pubs/pub43838/

For the most part, other services take care of network resources. Borg might have some host level network limits, but I haven't seen them used much.

Yes. Not directly Borg but another infrastructure component does.

The resource usage section at the end was really interesting, and surprising to me. 1% of jobs use 99% of resources! It would be interesting to try and understand how this pattern came about and if there's particular engineering decisions that tend to lead to this situation where you have a handful of incredibly resource intensive jobs and loads of very lightweight jobs.

That seems more related to weird choice of denominator than to any underlying facts of large scale cluster management. Large services have stable names and run forever. Search is just "search" and bigtable is just "bigtable" (some irrelevant details have been elided for clarity). If I run a batch logs analysis job though it is transient and has a unique name every time I run it. So there's a huge long tail of transient job _names_ and a few dominant services with permanent names, which makes the number-of-jobs denominator quirky.

Yeah. Likely a small amount of analytics jobs taking up most memory/CPU footprint, while critical prod OLTP jobs take very little resource comparably.

Interesting, I can only wish to run this sort of tech for my own small company.

I think you shouldn't - it's like wishing you could drive an 18-wheeler to do grocery. Most people don't need that kind of firepower ever, and that's a good thing - where are you going to park your 18-wheeler, for starters?

You can! I have all the equivalent graphs and most of the equivalent features on the k8s clusters I run. Google has commoditized borg, at least for the scales that most companies will run.

Not really, Borg is itself only at the scale it operates at. It's like Michael Jordan is not the baseball god, if it's only allowed to playing with 5 year old kiddos.

Some scale is needed to see the true awesomeness of Borg, and you can't built a K8s cluster anywhere near as big as a Borg cell, but many of the same "Cool features" can be built on K8s as well as they can with borg. Things like rolling deploys and stateless, self-healing systems work at k8s scale.

I recently got back on a project that was on maintenance mode for a year - But because it had a nice K8s setup, it had more or less managed itself through OS upgrades and rebuilds with very little user intervention.

Can't you run Kubernetes? (which is the open-source project from Google inspired by Borg)

Kubernetes is very different from Borg.

Can you share how? I’ve heard of borg previously as a inspiration for open source k8s but I’m surprised to hear that it’s very different.

As a former Google SRE: Kubernetes is very different for Borg, but it's also not. Most of the core concepts apply (A Borg "Alloc" is a Kubernetes "Pod", a "Task" in an alloc is a "Container" inside a pod, they're often implemented very differently but the abstraction can be reasoned about in much the same ways.

However, the developer experience is hugely different, both in the things that Kubernetes does have (Configuration Canaries) and the things it doesn't have (Global Load Balancing). There's patterns and hacks around most of these, but they're often created by developers who didn't understand the tradeoffs.

Most importantly: Borg is very well sanded. Borg was a production-ready system ten years ago, and while it's had to grow and change, there's many things that "work together" just nicely in the Google Ecosystem, that just don't yet outside. Partially because they support many more alternative implementations that don't quite have a standard API, but mostly because there's just rough spots about how you deal with marginal/edge cases that aren't found and documentation written on how to avoid them.

It's still an inspiration.

K8s was meant to fix borg's broken application management semantics and toolings.

How much different/better/worse is this compared to Kubernetes?

Much simpler to use as a user. I'd say the k8s API is very much over designed and has artifacts that you don't find in other job description languages. Or can anyone explain me the need for StorageClasses, PVCs and PVs with CSI mixins just to access some files?

As a scheduler it's much more powerful in capabilities, scalability and performance. It was designed to react to different workload demands in huge clusters very quickly. Imagine a data center goes offline and you need to spin up the search engine while killing a map reduce with 20k workers. A borg cluster is war zone, priorities and quotas win.

At Google, a vast majority of jobs persisted data ("wrote files") over an RPC API, so the container orchestrator didn't have to care about storage. Over in the real world, that seems to be extremely uncommon. Best case: cloud providers give you a block device tied to an AZ, and you're on your own (or you rewrite your app to use S3, or something that implements a similar API). On prem, people use lots of crazy things ranging from NFS to multi-million-dollar black boxes. Kubernetes has to support them all.

CSIs exist because there are a billion vendors of "cloud storage" solutions, and apps need a way to decouple themselves from the details. If you just need a block device, your Kubernetes app can run against anything that implements that API; your app you developed on DigitalOcean will Just Work on GKE. That's the idea behind those.

For workload management, Borg feels very close to StatefulSets; your replica count could increase and decrease as you demanded it, but each task was individually named (0.your-job.cell). I ran all my jobs with semantics similar to Deployments (tasks did not care which id they had; they all showed up as load balancer targets) and I think Borg had different abstractions for things in Kubernetes like ReplicaSets (i.e., a particular [config,code] tuple that is running in production that logically makes up a deployment).

You didn't ask so I won't go into the details as it gets very complicated, but networking is vastly different. (Kubernetes tries to provide IPAM and connection-balancing that is transparent to workloads; Google preferred client-side "smart" load-balancing libraries. That meant that load-balancing complexity was outside the scope of Borg, but for Kubernetes, it's very much in scope. Often with confusing results, but I digress.)

My experience having used both extensively is that there are about the same to the end user. Before I worked at Google, I always let "someone else" handle deployment and maintenance of production. At Google, I felt like I could write and release a piece of software to Borg in the same day (and did once!) After Google, I struggled with a bunch of orchestrators until I landed on Kubernetes, which gives me similar confidence and ease of use. Kubernetes has more API surface, probably because it has more users. At Google you could declare by fiat how jobs were to be run. In the real world, people aren't going to use your thing if you don't support their favorite feature. I have tried the opinionated container orchestrators in the real world (Heroku, Convox, ECS) and they didn't make me happy, while Kubernetes did.

while I do agree that k8s is quite overengineered you're not comparing apples to apples: on Borg you don't have persistent storage, because that's solved at the application level with libraries that talk directly to Colossus etc. A lot of complexity in k8s is the consequences of making it adaptable to different environments

k8s and borg are similar in the same way as gRPC and stubby: very similar in conceptual architecture, quite different in objective outcomes. Read the linked article to see that Borg cells have over ten thousand machines each, then note that k8s falls apart with 5000 machines. k8s supports up to 100 pods per node, and you query this trace data to see how that compares to borg. k8s in my experience can schedule about 5 pods per second, and this paper gives much higher figures for borg.

Yes for sure, in terms of expansion/adoption. It's not so certain in terms of function/utility. As a Google outsider, gRPC really does seems like Stubby for the rest of us (with balancing left as tradeoff for the community). Kubernetes does not seem functionally at all, to be a Borg/Omega. it's more like a porcelain for running Heroku/12-factor/Nanoservice style workloads on top of a Borg-like (that's no small thing, but it is just a thing) after learning what Amazon learned out the gate on AWS, that developers will not be constrained on framework choices (ie they're not ready to settle on a PaaS).

To that extent, there's a hole left dealing with things that do need to consider state versus run networked API services. Kubernetes seems to have no good story here, whether it's the evolutionary progress happening around StatefulSets/PVC/PV, or a per appliance operator for you, and you, and you, which punch a hole as big as you like in the scheduling abstraction. Streaming for example is a notable pain, but pretty much every OSS project created in this century that has state needs a compensating tool, typically an operator, to function on Kubernetes. I'm not even sure at this point whether state is a design consideration that can be retrofitted—that's not a criticism of Kubernetes, but it is a complication and investment factor for stateful workloads. So what may happen should Kubernetes be one of those things that does in fact end up being a long term technology, is the entire software industry offloads state management to vendors (ie to a handful of cloud services), or something in open source reacts and is created to fill the infrastructure gap for state management.

I don't think you can count on Google solving stateful for k8s, because within Google all storage devices and data thereupon are, to a fair approximation, totally disposable. There is nothing at Google considered a "stateful service" the way k8s community members mean it, e.g. a mysql server with critical local files.

In my personal opinion it is more valuable to adopt the Google model where no local file is considered critical, than it is to try to cram statefulness into an otherwise cloud-native stack. I feel that if you still care about specific files on specific disks then you really haven't fully adopted the meaning of cloud-native.

I agree with pretty much all of this. The point I'm making and not conveying well is that the state eventually has to be held somewhere. So not that it has to colocate with application services, but that if you want to store something with something, then Kubernetes as the substrate makes things harder. Even where you have say replication built in and are not reliant on 'the' file or disk, you tend to need to bolt on an operator to handle replication/placement. An example is Etcd used within Kubernetes. You don't get to just run Etcd on K8s itself; it needs an operator which was about 9KLOC last time I looked.

How Google handles disk storage? From what I remember of the papers, GFS depends on a disk service, that exposes the disk resources. And how YouTube store MySQL data? Uses GFS or some other mechanism?

Not worked at Google, but from what I read, only GFS or Colossus worry about disk storage, every other application can write only to GFS and have no dependency on local disks.

That's not quite true, or at least not how it worked 5 years ago.

Jobs can write to local disk. Where do you think their executables came from? They can also use local disk as a scratch space for various things. The main use was writing out logs. Logs get written then picked up by a co-located job and saved out to disk clusters like Colossus, where they are then in turn picked up for processing.

However it's true that local disk was considered to be used only for transient data, most of the time. You could configure Borg to not work like that, so you could certainly run MySQL clusters on it and things, but that was a very unusual setup and not at all recommended. After all, then you'd have to manage backups and machine failure yourself.

This practice worked amazingly well when writing software entirely in house. So it was great for things like web search where there was no open source stack waiting to be adopted. It was disastrous for backwards compatibility with pre-written software, and I think this sort of thing contributed to Google's notoriously strong NIH syndrome. Using open source software at Google is hard; there are processes around it that must be satisfied, but more importantly, that software will expect POSIX file APIs to actually work and on Borg they don't. They appear to work, right up until your job gets evicted and then the data goes poof. To solve this you need to work with files using Google's own proprietary file VFS APIs that are backed by RPC clustered storage. GFS then Colossus presented a FS API that was only vaguely related to POSIX, so you couldn't even just hack together a quick bridge.

Net/net it was worth importing small in-memory libraries from the open source world, very rarely, but anything more substantial like a server - forget it. This could lead to absurd outcomes. I worked on a project there many years ago that would have really benefited from using Apache to do file serving, but Apache didn't integrate with Borg so I ended up using "static GWS" as it was called at the time. But static GWS had just enough features to serve websites written exactly how Googlers wrote it and nothing more, so it was a total fail at serving a third party developed website we'd agreed to host. Much pain.

One of the problems with Docker and container architectures in my mind is that they ultimately evolved out of Borg. The cgroups work in the kernel and other kernel features were developed by Paul Menage and others on the Borglet team for Google's internal use. Then the Docker guys picked them up and created this notion of containers that contain a whole OS - not at all how Google does it - but they kept the notion of transient local storage. Well that's a disaster, because normal software expects local disk to stick around. In my post-Google career I have witnessed more than one disaster caused by Docker of the form "whoops, we misconfigured our Dockerfiles and just erased our private key". There's no good justification for that. For most people systemd with some simple use of static linking would work just as well. That's what Borglets used to do - set up cgroups, have a simple base OS and then run statically linked binaries.

> local disk. Where do you think their executables came from?

Just FYI, a lot changes in 5 years.

how is it different now? there is no local disk? is it more like a serverless appr oach?

Why is the community getting such a bad deal then? Why does kubernetes have to be so much worse than Borg? Is it time to rewrite Kubernetes?

That seems a bit unfair. The community is getting a pretty square deal with k8s, few people need to manage cells with more than 5000 machines, and those that do can probably afford to write their own schedulers that meet their scalability and performance needs. Google can't just throw Borg over the wall because it's pretty hairy, relies on other parts of Google that also aren't public, and because of the general belief that "the community" is not ready to deal with a large, high-performance C++ program of the type that Google tends to write, which is partly condescension but also pretty much true.

Does Go vs C++ account for some of the performance difference? My limited experience with Go suggests it's way more noisy with strace, which can't be great for throughput.

Go programs are slow compared to C++, the compiler optimises for its own speed and simplicity rather than fast output binaries.

But there are big architectural differences too. Borg can split its master across multiple machines for example. The scheduler is a very tightly optimised piece of code. Borg clusters are managed by a central, dedicated specially trained team so there's basically no limit to the operational complexity that a Borg cluster can take on as long as they're all consistent - which they are.

It's not worse overall, just different.

Early on (changing somewhat now), large scale and throughput were not big goals for K8S. Given most uses are with VMs, and pretty much all uses are tiny compared to Borg, this was a totally reasonable prioritization.

K8S fixes many many serious issues in Borg that are mostly unfixable at this point (I worked on Borg for many years), it's better in many ways.

For scale specifically, Borg has a big advantage in that the state database in internal (the master maintains it directly on a local disk). This is naturally much faster then using etcd (at least etcd today).

Is Borg essentially Google's version of Condor?

Borg is more like Kubernetes than Condor. It's for managing multiple computing jobs across many machines, instead of managing one highly parallel job across many machines.

What is condor?

They probably mean the Condor project which started in academia in the late '90s as a part of what was called grid computing: https://research.cs.wisc.edu/htcondor They also have an extensive list of publications over the years: https://research.cs.wisc.edu/htcondor/publications.html

What is Borg written in? C, C++ or Go?

Borgmaster itself is written in C++, though like many internal Google systems using protobufs for defining a lot of the interfaces and RPC bits.

It’s been around a long time, before Go existed :).

Also a name clash with Borg the backup utility (which is the #5 item returned from a google search for "borg" for me when I just tested what would come back):


Because I was curious:

- The Borg are a fictional race of aliens in Star Trek: Next Generation. They were first shown on TV in 1989. This appears to be the basis for the software projects' naming.

- Borg, Google's cluster management system, was introduced at some point prior to 2006 (per [0], published in 2016, which states Borg was introduced "over a decade ago").

- Borg the backup system was forked from Attic in 2015 [1]

[0]: https://cloud.google.com/blog/products/gcp/from-google-to-th...

[1]: https://borgbackup.readthedocs.io/en/stable/changes.html#ver...

Googler here.

Borg @ Google predates Borg the backup utility by several years.

Nah, Borg as a cluster manager way cooler and appropriate than Borg the backup utility.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact