

Large-scale cluster management at Google with Borg - mrry
http://blog.acolyer.org/2015/05/07/large-scale-cluster-management-at-google-with-borg/

======
mandeepj
Can anyone throw some light on what google is doing with tens of thousands of
machines in these clusters? Are they hosting crawlers on them?

~~~
WestCoastJustin
They host pretty much everything -- Maps, Gmail, Search, etc. Highly suggest
reading the Wired link below as it talks about this in detail [1], but if you
want a more technical look, check out the Youtube video [2], or the Borg paper
[3]. One really interesting development, is that Mesos + Marathon + Chronos +
Docker, puts a very similar system into the hands of your average IT team
today [4].

[1] [http://www.wired.com/2013/03/google-borg-twitter-
mesos/](http://www.wired.com/2013/03/google-borg-twitter-mesos/)

[2]
[https://www.youtube.com/watch?v=0ZFMlO98Jkc](https://www.youtube.com/watch?v=0ZFMlO98Jkc)

[3]
[https://research.google.com/pubs/archive/43438.pdf](https://research.google.com/pubs/archive/43438.pdf)

[4]
[https://www.youtube.com/watch?v=hZNGST2vIds](https://www.youtube.com/watch?v=hZNGST2vIds)

------
gtaylor
I've been hesitant to spend a ton of time playing with Kubernetes yet, since
it is still in such a state of flux (and the docs are woefully inadequate).
With that said, if anyone can make something like this awesome, I'd put my
money on Google.

Exciting times!

~~~
samkone
Try Mesos/Marathon in addition to Kubernetes. Been running Kubernetes on Mesos
and it's just great.

~~~
throwaway1979
Swarm isn't shabby either.

------
_dark_matter_
Can anyone explain how MapReduce is built on top of Borg, when Borg doesn't
support scheduling with data locality? I'm having a hard time reconciling
that. One of the principal use cases of the MapReduce framework is reducing
data movement...

~~~
thrownaway2424
You seem to be suggesting that using local i/o to read inputs would be
preferable to using remote i/o, but I don't think that's a supportable
statement in a datacenter with a high-performance network.

~~~
wmf
But it's a fact that original MapReduce/Borg used data locality (and networks
weren't as fast 15 years ago).

~~~
thrownaway2424
Yes but time marches on. The amount of compute power available on a commodity
server node has grown far faster than the available disk io time.

~~~
VikingCoder
What makes you think the data is on disk?

No, seriously.

Picture if all of your data is in RAM, somewhere. If you can move your program
to that machine quickly, then that's a huge win.

~~~
thrownaway2424
It just doesn't seem very likely to me. It seems a lot more likely that the
data would be spread around on thousands of individual hosts, one megabyte at
a time. The only processes by which I imagine the entire contents of one shard
of a mapreduce ending up in memory of a single host are 1) a miracle, 2) the
previous stage of the mapreduce wrote it there, in which case it is
effectively remote access before the fact, and 3) the data lives there all the
time for some other reason, in which case you don't have a mapreduce, you have
a coprocessor.

~~~
VikingCoder
I agree about the one megabyte at a time. I disagree that that implies it's
likely evicted from RAM.

For one, if I'm running a MapReduce, it's either likely on the same data
someone else is investigating... Or I'm likely to have to re-run my MapReduce
a few times, calculating exactly what I'm looking for.

So, I may be cheating by saying "Only the first time is the data cold. Every
other time, the data is likely to be hot in RAM."

Because you'd be right to say, "Well, duh. I'm talking about the FIRST time.
And since I'm talking about the first time, moving the data across the network
wouldn't be that bad. After that, yes, of course it will still likely be hot,
just as you describe."

My only real response is, I think it depends on how much data, versus how much
code you're talking about. And can you maybe even broadcast the code to the
right nodes, to even more effectively use the network.

"in which case you don't have a mapreduce, you have a coprocessor."

...a coprocessor which uses the MapReduce API. Yes, I think that is exactly
what you have. And it works whether the data is local or remote, cold or hot.
One unified API for all of those things, and it's probably optimized to work
local and hot, because I'm betting more than half of all MapReduces actually
work out to be local and hot.

------
dcsommer
Does Kubernetes use chroot jails like Borg, or has it advanced to LXC
containers? Nowadays LXC isolation seems strictly preferable.

~~~
wffurr
Borg also uses cgroups, just like LXC.

From the paper: "We use a Linux chroot jail as the primary security isolation
mechanism between multiple tasks on the same machine… all Borg tasks run
inside a Linux cgroup-based resource container."

~~~
samkone
Remember that Borg is an old thing. At the time they probably started out with
chroot jails, because cgroups weren't ready yet. But Kubernetes, Mesos use
docker and cgroups(Mesos).

~~~
KaiserPro
Cgroups have been around for a long time. Wikipedia says 8 years.

I think was in RHEL 6.2 possibly even 6.0

------
AdmiralAsshat
Does anyone else feel like calling your massive infrastructure software "Borg"
might be tempting fate? I mean, I'm not superstitious, but I would be weary of
calling my satellite network "Skynet."

~~~
oesmith
Skynet, like the UK military satellite comms network?

[https://en.m.wikipedia.org/wiki/Skynet_(satellites)](https://en.m.wikipedia.org/wiki/Skynet_\(satellites\))

